-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Some C++ code is sporadically crashing on specific Snapdragon chips #1884
Comments
This comment was marked as off-topic.
This comment was marked as off-topic.
Am I understanding correctly that this is a bug for aarch64 hosts, not targets? |
@DanAlbert, no, not as the term "hosts" is commonly used for the NDK, let me explain. The Termux app runs on all four Android architectures and allows users to run command-line tools on their Android phone or tablet. One of those tools is the LLVM toolchain, lightly patched to run on Android itself alongside a lightly-patched NDK sysroot. We cross-compile the LLVM toolchain for Android from Ubuntu 22.04 x86_64, so this bug does not occur when the NDK is running on an Ubuntu host, only when the produced executable runs on Android AArch64. When linking in the Termux app with lld 16 on Snapdragon devices running Android, it will occasionally fail with errors like these: error: link command failed with exit code 1 (use -v to see invocation)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0. Program arguments: /data/data/com.termux/files/usr/bin/ld.lld --sysroot=/data/data/com.termux/files -pie -EL --fix-cortex-a53-843419 -z now -z relro -z max-page-size=4096 --hash-style=gnu -rpath=/data/data/com.termux/files/usr/lib --eh-frame-hdr -m aarch64linux -dynamic-linker /system/bin/linker64 -o ...
clang-16: error: unable to execute command: Segmentation fault
clang-16: error: linker command failed due to signal (use -v to see invocation) Other times, memory tagging is tripped: error: link command failed with exit code 1 (use -v to see invocation)
Pointer tag for 0xe2a9a7d900000000 was truncated, see 'https://source.android.com/devices/tech/debug/tagged-pointers'.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
clang-16: error: unable to execute command: Aborted
clang-16: error: linker command failed due to signal (use -v to see invocation) Given that this only happens on certain Snapdragon devices after the source updates in lld 16, I'm reporting it here as it is likely an NDK codegen bug, ie your patched clang 14 in NDK 25c is probably generating some code that causes memory leaks with certain C++ source on some Snapdragon chips. Obviously, our particular example of lld 16 is not important to you, but if this codegen issue hits other NDK devs and causes memory leaks on these popular Snapdragon chips, it may be worth looking into a workaround. That is why I'm reporting it, for that possible general case, not our specific case in the Termux app. |
The aarch64 codegen bug is just speculation at this point, right? Is there any evidence to support that the NDK is producing bad code? |
Hello @DanAlbert! I have the similar problem related to |
Try it with the r26 canary: https://ci.android.com/builds/branches/aosp-master-ndk/grid. There won't be any more releases of r25. |
I don't understand, that doesn't appear related to the topic in the OP. |
Yep, I started off by saying that "My guess is some NDK codegen issue that hits some hardware bug on 3-5 year-old Snapdragon chips."
None so far, but as this is likely a hardware bug, it will require a codegen fix if we want to avoid it. |
Okay, so this was just filed as a heads up that there's maybe an upstream bug? We don't keep tracking bugs for all upstream LLVM reports, and it doesn't sound like this is impacting any NDK users, so I'm going to close this. lmk if I've misunderstood. |
No, this was filed to see if you want to look into the underlying issue and gather feedback from the dev community using the NDK.
Isn't it too early to tell? This issue was only open for two weekdays. Even if you don't want to look into this, I think this should be kept open so other devs using the NDK are made aware and can report. I've already got multiple independent reports from Termux users when trying to link using lld 16, so it's not just me seeing this. If you're hoping this is fixed upstream, my read of that bug report is that the LLVM devs are not too incented to track down chip-specific bugs like this. That probably falls to Qualcomm or you NDK devs, and I'm unsure how to report this to Qualcomm, maybe you know. |
I think you may be reading too much from the "closed" status of this bug. It's closed because we're not going to spend engineering time on this. With the data we have been given, we don't have a reason to believe this is affecting app developers. I am not saying that there's 100% without a doubt no bug here. I'm saying that we're not going to chase this one. Upstream have over 5k open bugs. We only mirror the ones that we're actively seeking a cherry-pick for. Right now, we're treating this the way we would treat any other bug reporting an app crash on a specific device: almost all of those are bugs in the app or bugs in the device, neither of which are in our power to fix. We are able to offer debugging help for reduced test cases, but we've never had the bandwidth to help debug full apps. There are a number of things that could be at fault here. An incomplete list:
The only ones that the NDK will track are 5 and 6. If it's 5 and the vendor submits an errata fix for it to LLVM (as happened for a53, for example), and that errata is severe enough to justify it, we can make a cherry-pick and make it on by default. We aren't the ones that do that though, that's qualcomm. If it's 6, we're still reliant on upstream LLVM to fix the bug, but we'll track it here if it's affecting app developers. You've raised this with upstream, so the people that are able to fix this have already been notified.
Yes. This is all we're staffed for.
It's closed, not deleted. The record is here. |
Understood, I'll see if we can narrow it down to 5 or 6, which based on the error reporting it almost certainly is but the root cause simply has not been found yet, and will ask to reopen if so. |
I received this error message on a 2022 Moto G stylus... = note: Pointer tag for 0x4cb4000000060106 was truncated, see 'https://source.android.com/devices/tech/debug/tagged-pointers'.
|
@KaliforniaShell, thanks for the report, but this issue has been closed for more than a year. Take a look at the more detailed info upstream, llvm/llvm-project#62165, and you can either help debug that or just ignore this memory leak race bug for now, as simply running the same command again will usually succeed. |
Description
I maintain the LLVM toolchain that runs directly on Android devices in the Termux app. Ever since updating to LLVM 16 a couple months ago, I'm seeing sporadic lld crashes about 2-5% of the time: see bugs llvm/llvm-project#62165 and termux/termux-packages#15867 for more info, including repro instructions in the Termux app using a Swift project. The output seems to indicate some kind of memory leak, since the new tagged pointers on Android AArch64 are also being modified.
We've spent some time gathering data and this weirdly only seems to affect some Snapdragon devices. I can easily reproduce on a Snapdragon 865 phone and a Snapdragon 865+ tablet, and another user just reported the same issue on a Snapdragon 660 phone, llvm/llvm-project#62605. Conversely, I was not able to reproduce on an Exynos 2100 phone, nor was someone else on a Snapdragon 888 phone.
Obviously, this could just be a problem in our non-NDK code, either an lld 16 bug or some mistake in our lightly-patched, NDK 25c-based Termux toolchain. However, the fact that this is not reproducible with lld 15 cross-compiled with the exact same Termux NDK toolchain, and not even with lld 16 on non-Snapdragon devices suggests the problem may go deeper. My guess is some NDK codegen issue that hits some hardware bug on 3-5 year-old Snapdragon chips, triggered by some code change in lld 16 (could be something like this previous Exynos bug).
I'm looking for feedback from the NDK devs and community, if anybody else has seen a similar issue. If anybody thinks this is not an issue in NDK 23c or the upcoming 26, I can cross-compile lld 16 with those and check.
@DanAlbert, if you have rr working on a Snapdragon device exhibiting this bug, rr-debugger/rr#3433, maybe you can repro in Termux then rewind in rr to get to the root cause? Obviously these leaks are not usually easy to pin down, but with more modern tooling, we may be able to now.
Upstream bug
llvm/llvm-project#62165
Commit to cherry-pick
No response
Affected versions
r25
Canary version
No response
Host OS
Linux
Host OS version
Ubuntu 20.04 and 22.04
Affected ABIs
arm64-v8a
The text was updated successfully, but these errors were encountered: