Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Some C++ code is sporadically crashing on specific Snapdragon chips #1884

Closed
finagolfin opened this issue May 13, 2023 · 14 comments
Closed
Labels

Comments

@finagolfin
Copy link

finagolfin commented May 13, 2023

Description

I maintain the LLVM toolchain that runs directly on Android devices in the Termux app. Ever since updating to LLVM 16 a couple months ago, I'm seeing sporadic lld crashes about 2-5% of the time: see bugs llvm/llvm-project#62165 and termux/termux-packages#15867 for more info, including repro instructions in the Termux app using a Swift project. The output seems to indicate some kind of memory leak, since the new tagged pointers on Android AArch64 are also being modified.

We've spent some time gathering data and this weirdly only seems to affect some Snapdragon devices. I can easily reproduce on a Snapdragon 865 phone and a Snapdragon 865+ tablet, and another user just reported the same issue on a Snapdragon 660 phone, llvm/llvm-project#62605. Conversely, I was not able to reproduce on an Exynos 2100 phone, nor was someone else on a Snapdragon 888 phone.

Obviously, this could just be a problem in our non-NDK code, either an lld 16 bug or some mistake in our lightly-patched, NDK 25c-based Termux toolchain. However, the fact that this is not reproducible with lld 15 cross-compiled with the exact same Termux NDK toolchain, and not even with lld 16 on non-Snapdragon devices suggests the problem may go deeper. My guess is some NDK codegen issue that hits some hardware bug on 3-5 year-old Snapdragon chips, triggered by some code change in lld 16 (could be something like this previous Exynos bug).

I'm looking for feedback from the NDK devs and community, if anybody else has seen a similar issue. If anybody thinks this is not an issue in NDK 23c or the upcoming 26, I can cross-compile lld 16 with those and check.

@DanAlbert, if you have rr working on a Snapdragon device exhibiting this bug, rr-debugger/rr#3433, maybe you can repro in Termux then rewind in rr to get to the root cause? Obviously these leaks are not usually easy to pin down, but with more modern tooling, we may be able to now.

Upstream bug

llvm/llvm-project#62165

Commit to cherry-pick

No response

Affected versions

r25

Canary version

No response

Host OS

Linux

Host OS version

Ubuntu 20.04 and 22.04

Affected ABIs

arm64-v8a

@finagolfin finagolfin added the bug label May 13, 2023
@DanAlbert

This comment was marked as off-topic.

@DanAlbert
Copy link
Member

Am I understanding correctly that this is a bug for aarch64 hosts, not targets?

@finagolfin
Copy link
Author

finagolfin commented May 16, 2023

@DanAlbert, no, not as the term "hosts" is commonly used for the NDK, let me explain.

The Termux app runs on all four Android architectures and allows users to run command-line tools on their Android phone or tablet. One of those tools is the LLVM toolchain, lightly patched to run on Android itself alongside a lightly-patched NDK sysroot. We cross-compile the LLVM toolchain for Android from Ubuntu 22.04 x86_64, so this bug does not occur when the NDK is running on an Ubuntu host, only when the produced executable runs on Android AArch64.

When linking in the Termux app with lld 16 on Snapdragon devices running Android, it will occasionally fail with errors like these:

error: link command failed with exit code 1 (use -v to see invocation)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: /data/data/com.termux/files/usr/bin/ld.lld --sysroot=/data/data/com.termux/files -pie -EL --fix-cortex-a53-843419 -z now -z relro -z max-page-size=4096 --hash-style=gnu -rpath=/data/data/com.termux/files/usr/lib --eh-frame-hdr -m aarch64linux -dynamic-linker /system/bin/linker64 -o ...
clang-16: error: unable to execute command: Segmentation fault
clang-16: error: linker command failed due to signal (use -v to see invocation)

Other times, memory tagging is tripped:

error: link command failed with exit code 1 (use -v to see invocation)
Pointer tag for 0xe2a9a7d900000000 was truncated, see 'https://source.android.com/devices/tech/debug/tagged-pointers'.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
clang-16: error: unable to execute command: Aborted
clang-16: error: linker command failed due to signal (use -v to see invocation)

Given that this only happens on certain Snapdragon devices after the source updates in lld 16, I'm reporting it here as it is likely an NDK codegen bug, ie your patched clang 14 in NDK 25c is probably generating some code that causes memory leaks with certain C++ source on some Snapdragon chips. Obviously, our particular example of lld 16 is not important to you, but if this codegen issue hits other NDK devs and causes memory leaks on these popular Snapdragon chips, it may be worth looking into a workaround.

That is why I'm reporting it, for that possible general case, not our specific case in the Termux app.

@DanAlbert
Copy link
Member

The aarch64 codegen bug is just speculation at this point, right? Is there any evidence to support that the NDK is producing bad code?

@arttet
Copy link

arttet commented May 16, 2023

Hello @DanAlbert!

I have the similar problem related to Segmentation fault for the aarch64-linux-android target. So, I have been investigating this issue. But I can reproduce it for Android NDK 25c and cannot reproduce for Android NDK 24. I will provide other details tomorrow. I have a static binary, and I got a core dump. It happened before main. The core dump showed that the problem related to the null pointer here.

@DanAlbert
Copy link
Member

Try it with the r26 canary: https://ci.android.com/builds/branches/aosp-master-ndk/grid. There won't be any more releases of r25.

@DanAlbert
Copy link
Member

The core dump showed that the problem related to the null pointer here.

I don't understand, that doesn't appear related to the topic in the OP.

@finagolfin
Copy link
Author

The aarch64 codegen bug is just speculation at this point, right?

Yep, I started off by saying that "My guess is some NDK codegen issue that hits some hardware bug on 3-5 year-old Snapdragon chips."

Is there any evidence to support that the NDK is producing bad code?

None so far, but as this is likely a hardware bug, it will require a codegen fix if we want to avoid it.

@DanAlbert
Copy link
Member

Okay, so this was just filed as a heads up that there's maybe an upstream bug? We don't keep tracking bugs for all upstream LLVM reports, and it doesn't sound like this is impacting any NDK users, so I'm going to close this. lmk if I've misunderstood.

@DanAlbert DanAlbert closed this as not planned Won't fix, can't repro, duplicate, stale May 16, 2023
@finagolfin
Copy link
Author

finagolfin commented May 17, 2023

Okay, so this was just filed as a heads up that there's maybe an upstream bug?

No, this was filed to see if you want to look into the underlying issue and gather feedback from the dev community using the NDK.

it doesn't sound like this is impacting any NDK users, so I'm going to close this.

Isn't it too early to tell? This issue was only open for two weekdays. Even if you don't want to look into this, I think this should be kept open so other devs using the NDK are made aware and can report. I've already got multiple independent reports from Termux users when trying to link using lld 16, so it's not just me seeing this.

If you're hoping this is fixed upstream, my read of that bug report is that the LLVM devs are not too incented to track down chip-specific bugs like this. That probably falls to Qualcomm or you NDK devs, and I'm unsure how to report this to Qualcomm, maybe you know.

@DanAlbert
Copy link
Member

I think you may be reading too much from the "closed" status of this bug. It's closed because we're not going to spend engineering time on this. With the data we have been given, we don't have a reason to believe this is affecting app developers. I am not saying that there's 100% without a doubt no bug here. I'm saying that we're not going to chase this one. Upstream have over 5k open bugs. We only mirror the ones that we're actively seeking a cherry-pick for.

Right now, we're treating this the way we would treat any other bug reporting an app crash on a specific device: almost all of those are bugs in the app or bugs in the device, neither of which are in our power to fix. We are able to offer debugging help for reduced test cases, but we've never had the bandwidth to help debug full apps.

There are a number of things that could be at fault here. An incomplete list:

  1. LLD bug that only affects snapdragon hosts
  2. Bug in one of your patches to LLD
  3. Bug in how termux LLD was built
  4. Device-specific bug
  5. SoC specific bug
  6. aarch64 codegen bug

The only ones that the NDK will track are 5 and 6. If it's 5 and the vendor submits an errata fix for it to LLVM (as happened for a53, for example), and that errata is severe enough to justify it, we can make a cherry-pick and make it on by default. We aren't the ones that do that though, that's qualcomm. If it's 6, we're still reliant on upstream LLVM to fix the bug, but we'll track it here if it's affecting app developers. You've raised this with upstream, so the people that are able to fix this have already been notified.

If you're hoping this is fixed upstream

Yes. This is all we're staffed for.

I think this should be kept open so other devs using the NDK are made aware

It's closed, not deleted. The record is here.

@finagolfin
Copy link
Author

Understood, I'll see if we can narrow it down to 5 or 6, which based on the error reporting it almost certainly is but the root cause simply has not been found yet, and will ask to reopen if so.

@KaliforniaShell
Copy link

I received this error message on a 2022 Moto G stylus...

= note: Pointer tag for 0x4cb4000000060106 was truncated, see 'https://source.android.com/devices/tech/debug/tagged-pointers'.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
cc: error: unable to execute command: Aborted
cc: error: linker command failed due to signal (use -v to see invocation)

        error: could not compile `thiserror` (build script) due to 1 previous error

        Caused by:
          process didn't exit successfully: `rustc --crate-name build_script_build --edition=2021 /data/data/com.termux/files/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/thiserror-1.0.61/build.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type bin --emit=dep-info,link -C embed-bitcode=no -C debug-assertions=off --check-cfg 'cfg(docsrs)' --check-cfg 'cfg(feature, values())' -C metadata=834aff333b47f15f -C extra-filename=-834aff333b47f15f --out-dir /data/data/com.termux/files/usr/tmp/pip-install-3s8biry_/maturin_6eaa9c16721d40ceaaf3973e543fd9a4/target/release/build/thiserror-834aff333b47f15f -C strip=debuginfo -L dependency=/data/data/com.termux/files/usr/tmp/pip-install-3s8biry_/maturin_6eaa9c16721d40ceaaf3973e543fd9a4/target/release/deps --cap-lints allow` (exit status: 1)
        warning: build failed, waiting for other jobs to finish...
        error: `cargo build --manifest-path Cargo.toml --message-format=json-render-diagnostics --release -v --no-default-features --locked` failed with code 101
        [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip.
    ERROR: Failed building wheel for maturin
    Building wheel for cffi (pyproject.toml): started
    Building wheel for cffi (pyproject.toml): finished with status 'done'
    Created wheel for cffi: filename=cffi-1.17.0-cp311-cp311-linux_aarch64.whl size=333543 sha256=dbb034eb47469693c907f54ca9f6cee9d5104235cd290d9cb3ea28caa186c93c
    Stored in directory: /data/data/com.termux/files/home/.cache/pip/wheels/1e/f6/68/a4412a1d1612ba40403ffa99005a9053166e806ea5e15580c6
  Successfully built cffi
  Failed to build maturin
  ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (maturin)
  [end of output]

@finagolfin
Copy link
Author

@KaliforniaShell, thanks for the report, but this issue has been closed for more than a year. Take a look at the more detailed info upstream, llvm/llvm-project#62165, and you can either help debug that or just ignore this memory leak race bug for now, as simply running the same command again will usually succeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants