-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lld 16 appears to have a threading-related memory leak on some Android AArch64 devices #62165
Comments
@llvm/issue-subscribers-lld-elf |
Can you run the link commands with the option |
@futurejones, do you think you could try to reproduce with lld 16 on your AArch64 server, by using a script similar to the github Action script I linked but with the official lld 16.0.1 build for linux AArch64? @DimitryAndric, that just saves the command line and object files used? I doubt it is specific to certain build input, as I see it randomly across various projects when linking, something like a 1-3% failure rate. |
Well, it is pretty much impossible to fix an issue if it cannot be reproduced reliably. If you are really experiencing totally random crashes, I would first assume your hardware is faulty, but that can usually be tested by e.g. memtest, or running the build on another known-good system. |
@DimitryAndric, unlikely to be hardware-related since lld 15 was working well, and I think @xtkoba said he reproduced on other hardware in the linked Termux issue. My guess would be some AArch64-specific codegen issue, which is why I asked for more testing on linux AArch64. As for tracking it down, I would try running repeatedly on a system where this is happening with a replayable debugger like rr, then try to work back from the segfault to the presumed leak. That is why checking linux AArch64 is important, as much better tools like rr are available there. |
It is useful to attach a tarball of a crashed instance here, since it might be difficult for somebody else to reproduce your exact build environment, etc. That is usually the trickiest part of finding the cause for bugs. |
OK, I will take a look at what that |
@buttaface Probably you misunderstand something. This has not reproduced for me yet. |
I was able to reproduce fairly quickly, when building Swift NIO on my Android AArch64 phone with If you want to try it out with the same lld 16.0.1 that crashed for me, install the Termux app on an Android device and then run |
Still seeing this with 16.0.2, will try on other Android devices next to rule out the bad hardware possibility mentioned. |
I tried the Swift build above on an Android tablet and saw an lld crash on the first run. I then tried it around 15 times on another Android phone, no crash so far, will keep trying to make sure. No correlation to amount of memory, crashing devices have 6 and 8 GB of RAM, whereas non-crashing phone has 8 GB. Both crashing devices have a Snapdragon 865 CPU, whereas non-crashing has an Exynos chip, chip-specific AArch64 codegen issue? |
Now tried it more than two dozen times on the Exynos without a single lld 16.0.2 crash. I'll keep trying, but it looks like this bug might be specific to certain AArch64 CPUs. |
I have checked a
|
@llvm/issue-subscribers-backend-aarch64 |
@MaskRay, what platform was the lld you used run on? As indicated earlier, I can only reproduce when running lld on some Android AArch64 devices right now, so I mentioned this is likely an AArch64 codegen bug, as lld is built by the patched clang in the Android NDK. |
@DanAlbert, does this failure mode ring any bells for you? I just saw your rr pull for Android, rr-debugger/rr#3433, does rr work well for you and does it need higher privileges like root/adb? Maybe I can use rr on Android to track this down. |
rr is nowhere near ready for debugging apps. If all you have is a binary that can run out of |
@Sonicadvance1, you appear to be the only person to ever get a commit into this project with Snapdragon in the log message, 045d84f. The Android NDK devs say these chip-specific issues are not their domain, android/ndk#1884. Any idea how we can get Qualcomm to look into this seeming issue with some of their SoCs? |
I think it is unlikely to be a particular issue with the SoC. It is more likely to be what features that particular SoC enables. Does this reproduce at all on a linux aarch64 host, or has that not been tried yet? If we can reproduce outside of an Android phone then this will be much easier to reproduce. Without that I think you may need to add your own trace to LLD and run it on the device to find out where the crash is. Building LLD with debug information may also help as it may end up with a better stack trace when it crashes. Given the error message sometimes says "Pointer tag for 0xe2a9a7d900000000 was truncated" I expect this is something to do with the information in https://source.android.com/docs/security/test/tagged-pointers this says:
It is possible that this only reproduces on a TBI supporting device. |
AFAIK, not been tried, as that is still a rare configuration. Also, it is possible that whatever Snapdragon features that are causing this are not there in AArch64 server CPUs, like the Exynos bug I linked.
Trying that next, with ASan enabled too, though I'm skeptical of the stack trace showing much because usually the segfault is much downstream of the actual leak.
No, a user just reported the same issue with a Snapdragon 660 device, #62605, running Android 10, and tagged pointers were not added till Android 11, aosp-mirror/platform_bionic@3b21ada, so I don't think that is a factor. More likely that some of these leaks are simply overwriting that Android malloc tag sometimes, so that failsafe is occasionally being triggered. |
No idea how to contact QCom, but as said it's unlikely to be a SoC specific issue and more an issue of what features the SoC has. Tagged pointers are fairly new as an example and my devices don't even have that feature in their kernel yet. Probably more. Alternatively, I don't do any Android development, my commit was done entirely on a Linux device. |
Finally reproduced this with a lld build with debug info and ASan enabled on a Snapdragon 865 phone, got the following output when it failed (excuse the weird formatting):
Looks like some kind of race, which would explain why it sporadically occurs with the exact same input. I'm not too familiar with ASan output, so pasting this here in case others can read this much faster and figure it out. This was on the very first run, then I ran the reproducible response.txt on the object files that I gzipped above about a 100 times before I got essentially the same leak again:
Maybe a good sign that this is repeatable, though only about 2% of the time. I'll try it on some more devices next. If you want to reproduce, install the Termux app on a Snapdragon device, get the AArch64 lld debug zip file here, and run the following commands:
@xtkoba, would you try this on your Snapdragon device and let me know what you find? |
I ran the same procedure on an Android tablet with a Snapdragon 865+ and was able to reproduce quickly and multiple times. Then I tried on a phone with an Exynos 2100 more than a thousand times, without a single failure. Clearly there is some difference on various CPUs, whether hardware features or maybe even the variance in speed of the big.little cores. |
The change to run relocation scanning in parallel is https://reviews.llvm.org/D133003 IIUC the relocations are expected to be added in per-thread relocation vectors and then merged. I can remember one issue was found https://reviews.llvm.org/D142317 but this was merged for the 16.0.0 release #60338. To experiment with the race condition theory, could be worth trying to reproduce with --threads=1, or if you are rebuilding LLD from scratch, serialise the relocation scanning as it is done for Mips and PPC
If it is as simple as LLD writing to a shared location I would expect it to reproduce on other platforms though. If my websearch of the Exynos and SnapDragon SoCs is correct then Exynos is based on a Cortex-X1 and the SnapDragon is based an A77, both of which are architecturally (not-microarchitecturally) the same. |
Thanks for all the relevant info, @smithp35, adding I will try patching the source as you suggested, and see if that works around this problem in the Termux app. |
I could not reproduce after forcing that lld code to run serially, so I have pushed @smithp35's suggestion as a workaround in the Termux app, termux/termux-packages@e4dd29c. Another Termux dev tried to reproduce with the debug build and sees this same issue on several Snapdragon devices, some more testing should determine if this is Qualcomm-specific. |
@finagolfin Are you actually using the sources for LLVM 16 or just taking the version number from somewhere in Android? Android versions aren't always the official LLVM release versions. In this case, I do see that https://reviews.llvm.org/D142317 is in the upcoming r26 compiler, but it wouldn't be in any r25 compiler (but I also don't know how you would have the parallelization patch in an r25 toolchain either, since that patch only landed in upstream in September). |
@stephenhines, as I explained in more detail in the NDK issue I subsequently opened, this is indeed a lightly patched LLVM 16.0.4 running on Android Snapdragon devices, after cross-compiling it with a NDK 25c toolchain from linux x86_64 with a lightly patched sysroot. Since the lld 16.0.4 I'm building on linux and running on Android has that patch you linked, I don't think it is that issue, though it may be related to some of the other issues raised in that review thread. I have now turned off that parallelized code in lld 16.0.4 and I no longer see any segfaults when running it on Android, so the issue is clearly in that multi-threading, perhaps interacting with some Snapdragon hardware bug. |
An update: after disabling threading in lld, I no longer see this crash with the reproducer I provided, but I still see an occasional rare crash when using lld 16.0.5 to link some random project. So this bug is clearly not isolated to threading alone, but now the crashes are much more rare, ie less than 0.1%. |
I see random crashes when linking, that always go away if I run the same command again, ie it's usually not repeatable. This is only since the LLVM 16 update, LLVM 15 worked well, as noted in termux/termux-packages#15867.
Sometimes, it'll just segfault:
Other times, it'll claim issues with memory tagging:
Most likely, this is a memory leak that sometimes causes segfaults.
I tried to reproduce on Ubuntu 20.04 x86_64 by building a Swift package repeatedly with the official Clang 16.0.0 build on GitHub Actions, but could not after 50+ builds of a couple Swift packages, implying this is an issue only on AArch64 or some Bionic interaction.
If someone could test LLD 16 on linux AArch64 and report their results, should help narrow this down, may be related to prior LLD issues #58056 or #60456.
The text was updated successfully, but these errors were encountered: