-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: crashes with "runtime: traceback stuck" #62086
Comments
It looks like it is stuck when collecting a goroutine profile, which takes stack traces for all goroutines. The place it is stuck is at the entry of a deferred function. It should never unwind to there. It looks like a stack slot to save the defer record, instead of a return PC. So the unwinding already went off at this point. It could be due to failing to unwind a frame lower down the stack, or the stack being corrupted. @aciduck is your program a pure-Go program or it uses cgo? Does it use any unsafe code? Is there a way we can reproduce the failure? Thanks. |
This happens in multiple services, some are compiled with cgo and some don't. |
Could you show a more complete error message? There should be other goroutine stack dumps below, and my have information about how the unwinder get to that point (It is okay to have symbol names redacted). Thanks. |
Full dump is in the attached file: |
Hi @cherrymui , is there anything else we can provide to help investigating this? We have upgraded to 1.21.4 in the past weeks and the problem still persists. |
Sorry the the delay. Thanks for the full stack trace. Reading the stack trace, it seems to fail when unwinding the stack at memory address
This matches goroutine 594:
When the runtime crashes, it successfully unwound this stack. So it is weird that the first round of unwinding (for taking the goroutine profile) failed. It is possible that the goroutine has executed and changed its stack between the two unwinding. But given that it's been sitting there for 2 minutes, and the crash probably took much less time, it probably didn't change. Besides, goroutine 105 is interesting:
At the second frame, It seems this failure occurs while taking a goroutine profile. Could you try
Thanks. |
We also continue to see even after fix to #54332, traceback stuck approximately once a month across our arm64 fleet under continuous profiling, including on most recent
I've symbolised
https://github.com/golang/go/blob/go1.21.6/src/net/http/server.go#L1864-L1882 |
Thanks @lizthegrey for the information.
This looks like it is the entry address of a deferred function. Could you confirm that it is indeed the function entry, or it is somewhere in the function prologue, perhaps with What is PC 0x25f88 ? It is the first argument of Thanks. |
Confirm, function entry.
|
Hi, since upgrading to 1.21 the stack traces changed a bit, and it became obvious that the problem is triggered by the runtime profiling we are running every 30 seconds. We kept CPU and heap profile enabled and disabled goroutine profiling, and the problem almost disappeared. |
@aciduck have you tried Go 1.22 ? Also, in your crash dumps it seems some lines are missing or reordered, like, frames from different stacks are mixed together. Could you paste an unfiltered stack dump? (It is fine to redact function names.) Also,
the SP and the frame are not in the bounds of the stack. This might be that the G and the stack don't match? Are these crashes from a cgo program or pure Go? Thanks. |
@cherrymui We haven't updated to 1.22 yet, will probably happen in the coming weeks. |
We've done our 1.22 upgrade, will report if we see any stuck tracebacks. |
@cherrymui: got our first stuck traceback since go1.22. Let me know what you need symbolised from this
|
Hi, we upgraded to Go 1.22.1 and the problem keeps happening sporadically. Two such crashes are attached: |
Another crash, let me know if you need me to symbolise it.
|
Can we please get an update here? I know the 1.23 window is soon, and it would be nice to get a fix or additional telemetry into that release. |
Sorry for the delay. I still plan to look into this, and address it in 1.23 (we can still fix bugs during the freeze). Thanks. |
@aciduck in extract-2024-03-27T12_12_46.194Z.txt
The signal PC (where the traceback starts) is 0x8a2cc, which is close to
So it may be something in the runtime, and probably assembly code. Is it possible to get what that PC is? Also, from the SP,
it seems to be on this goroutine
So apparently when it crashes, it actually can unwind the stack successfully. It starts with 0x4005a13a10, and the next frame is at 0x4005a13b00. Why is the profiling unwinding stuck in the middle of that frame at 0x4005a13ae0? Also, the first frame has FP=SP, which looks like it is in the prologue of calling morestack. Maybe it is related to profiling from @lizthegrey 's cases are weird.
It stuck at SP=0x402ec0f2d0. The saved LR, or next PC, should be *SP=0x5c4528 (https://cs.opensource.google/go/go/+/master:src/runtime/traceback.go;l=373). But that is not the one reported on the first line, which is 0x25d80, which is actually |
For unwinding for profiling, we are already a bit permissive -- in a number of cases if we can't find a valid frame, instead of throwing an error, we simply give up. Maybe we should do this here as well: if it is going to fall into an unwinding loop, just give up and leave a truncated stack trace. On the other hand, the "traceback stuck" error did uncover a few real bugs. I'm not sure... |
How about a compromise? Leave a truncated stack, but output the error message somewhere, so that we can pick it up and use it for debugging in the rare case where it does happen? |
Can we get an update here? This is still happening from time to time. There are a few suggestions in this thread to add additional debug info if it is still a mystery. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
We can't use 1.20.6 because of #61431
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Our production environments runs tens of thousands of containers written in Go, most of them running only for a few minutes or hours. About once every day one of them crashes with
runtime: traceback stuck
. It is not always the same service, and has been happening for months and across multiple Go versions, going back to at least Go 1.18. We are not sure exactly when it started.We did saw a common pattern, where the stack trace is always of the routine running our internal MemoryMonitor. It is a small library that runs in all our services, samples the cgroup memory parameters every second from procfs, and logs all the running operation if we use 90% of available memory. When we turn off this functionality the problem disappear, so we know it is related. All the containers that crashed didn't reach this limit during their run, so only the sampling occurred.
Another thing to note that we always see in the dump is an active runtime profiling running by the DataDog agent we are integrating with. It is running every 30 seconds and takes a CPU and memory profile using the standard pprof library. We are not sure if this is related.
What did you expect to see?
No crashes.
What did you see instead?
The text was updated successfully, but these errors were encountered: