Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: crashes with "runtime: traceback stuck" #62086

Open
aciduck opened this issue Aug 17, 2023 · 23 comments
Open

runtime: crashes with "runtime: traceback stuck" #62086

aciduck opened this issue Aug 17, 2023 · 23 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@aciduck
Copy link

aciduck commented Aug 17, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20.5 linux/arm64

Does this issue reproduce with the latest release?

We can't use 1.20.6 because of #61431

What operating system and processor architecture are you using (go env)?

go env Output
$ go env

What did you do?

Our production environments runs tens of thousands of containers written in Go, most of them running only for a few minutes or hours. About once every day one of them crashes with runtime: traceback stuck. It is not always the same service, and has been happening for months and across multiple Go versions, going back to at least Go 1.18. We are not sure exactly when it started.
We did saw a common pattern, where the stack trace is always of the routine running our internal MemoryMonitor. It is a small library that runs in all our services, samples the cgroup memory parameters every second from procfs, and logs all the running operation if we use 90% of available memory. When we turn off this functionality the problem disappear, so we know it is related. All the containers that crashed didn't reach this limit during their run, so only the sampling occurred.
Another thing to note that we always see in the dump is an active runtime profiling running by the DataDog agent we are integrating with. It is running every 30 seconds and takes a CPU and memory profile using the standard pprof library. We are not sure if this is related.

What did you expect to see?

No crashes.

What did you see instead?

runtime: traceback stuck. pc=0xfdfcf0 sp=0x40097fbf90
stack: frame={sp:0x40097fbf90, fp:0x40097fbf90} stack=[0x40097fa000,0x40097fc000)
0x00000040097fbe90: 0x00000040075ae200 0x0000004003a53d00
0x00000040097fbea0: 0x0000004007587d40 0x0000000000000001
0x00000040097fbeb0: 0x0000000001b38580 <github.com/xxx/xxx/commonlib/memmonitor.(*memoryMonitor).run.func1.1+0x0000000000000000> 0x000000400771e500
0x00000040097fbec0: 0x0000004003a13e60 0x0000000000000000
0x00000040097fbed0: 0x0000000000000000 0x0000000000000000
0x00000040097fbee0: 0x00000040097fbeb0 0x0000004000507ef8
0x00000040097fbef0: 0x0000000000fe02c8 <github.com/xxx/xxx/commonlib/waitgroup.(*SafeErrorGroup).Go.func1+0x0000000000000088> 0x0000004000507f58
0x00000040097fbf00: 0x0000000000fdfbfc <golang.org/x/sync/errgroup.(*Group).Go.func1+0x000000000000005c> 0x0000004004eed080
0x00000040097fbf10: 0x0300000000000000 0x0000000000000000
0x00000040097fbf20: 0x0000000000000000 0x0000000000fe0330 <github.com/xxx/xxx/commonlib/waitgroup.(*SafeErrorGroup).Go.func1.1+0x0000000000000000>
0x00000040097fbf30: 0x0000000004324400 0x0000004007587d40
0x00000040097fbf40: 0x00000040097fbf18 0x00000000036f2020
0x00000040097fbf50: 0x00000040097fbf28 0x0000000000000000
0x00000040097fbf60: 0x000000000007fd44 <runtime.goexit+0x0000000000000004> 0x00000040076be780
0x00000040097fbf70: 0x0000004002347b60 0x0000000000000000
0x00000040097fbf80: 0x0100004004d079e0 0x0000004007585f00
0x00000040097fbf90: >0x0000000000fdfcf0 <golang.org/x/sync/errgroup.(*Group).Go.func1.2+0x0000000000000000> 0x0000004007585f00
0x00000040097fbfa0: 0x0000004000507f48 0x0000000000000000
0x00000040097fbfb0: 0x000000000007fd44 <runtime.goexit+0x0000000000000004> 0x000000400074f7a0
0x00000040097fbfc0: 0x00000040097fbf90 0x0000000000000000
0x00000040097fbfd0: 0x0000000000000000 0x0000000000000000
0x00000040097fbfe0: 0x0000000000000000 0x0000000000000000
0x00000040097fbff0: 0x0000000000000000 0x0000000000000000
fatal error: traceback stuck
runtime stack:
runtime.throw({0x30b0158?, 0x63afaa0?})
/usr/local/go/src/runtime/panic.go:1047 +0x40 fp=0x4000111bc0 sp=0x4000111b90 pc=0x49cd0
runtime.gentraceback(0x4000111f98?, 0x76480?, 0x50e55d58a9b8?, 0x4000102b60, 0x0, 0x4009100c00, 0x20, 0x0, 0x0?, 0x0)
/usr/local/go/src/runtime/traceback.go:487 +0xdd8 fp=0x4000111f30 sp=0x4000111bc0 pc=0x70658
runtime.saveg(0x0?, 0x7d92c?, 0x4005e36600?, 0x4009100c00)
/usr/local/go/src/runtime/mprof.go:1181 +0x44 fp=0x4000111f90 sp=0x4000111f30 pc=0x42154
runtime.doRecordGoroutineProfile.func1()
/usr/local/go/src/runtime/mprof.go:1093 +0x48 fp=0x4000111fc0 sp=0x4000111f90 pc=0x420e8
runtime.systemstack()
/usr/local/go/src/runtime/asm_arm64.s:243 +0x6c fp=0x4000111fd0 sp=0x4000111fc0 pc=0x7d92c
@bcmills bcmills added compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 17, 2023
@bcmills bcmills changed the title Crashes with "runtime: traceback stuck" runtime: crashes with "runtime: traceback stuck" Aug 17, 2023
@mknyszek mknyszek added this to the Backlog milestone Aug 23, 2023
@cherrymui
Copy link
Member

It looks like it is stuck when collecting a goroutine profile, which takes stack traces for all goroutines. The place it is stuck is at the entry of a deferred function. It should never unwind to there. It looks like a stack slot to save the defer record, instead of a return PC. So the unwinding already went off at this point. It could be due to failing to unwind a frame lower down the stack, or the stack being corrupted.

@aciduck is your program a pure-Go program or it uses cgo? Does it use any unsafe code? Is there a way we can reproduce the failure? Thanks.

@cherrymui cherrymui added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 13, 2023
@aciduck
Copy link
Author

aciduck commented Sep 14, 2023

This happens in multiple services, some are compiled with cgo and some don't.
We don't use unsafe code directly, but it might be called from some dependency we have.
Unfortunately we can't reproduce it consistently or in a simple program. The only pattern we found is that this only happens if our MemoryMonitor code is running, and we suspect it is related to the fact it executes frequent syscalls.

@cherrymui
Copy link
Member

Could you show a more complete error message? There should be other goroutine stack dumps below, and my have information about how the unwinder get to that point (It is okay to have symbol names redacted). Thanks.

@aciduck
Copy link
Author

aciduck commented Oct 9, 2023

Full dump is in the attached file:

extract-2023-08-16T12_49_41.656Z - Copy.txt

@aciduck
Copy link
Author

aciduck commented Dec 25, 2023

Hi @cherrymui , is there anything else we can provide to help investigating this? We have upgraded to 1.21.4 in the past weeks and the problem still persists.

@cherrymui
Copy link
Member

cherrymui commented Dec 28, 2023

Sorry the the delay. Thanks for the full stack trace.

Reading the stack trace, it seems to fail when unwinding the stack at memory address [0x40097fa000,0x40097fc000)

runtime: traceback stuck. pc=0xfdfcf0 sp=0x40097fbf90
stack: frame={sp:0x40097fbf90, fp:0x40097fbf90} stack=[0x40097fa000,0x40097fc000)

This matches goroutine 594:

goroutine 594 [select, 2 minutes]:
runtime.gopark(0x400941fec0?, 0x2?, 0x2?, 0x0?, 0x400941fe84?)
/usr/local/go/src/runtime/proc.go:381 +0xe0 fp=0x40097fbd10 sp=0x40097fbcf0 pc=0x4c740
runtime.selectgo(0x40097fbec0, 0x400941fe80, 0x4007587d40?, 0x0, 0x2?, 0x1)
/usr/local/go/src/runtime/select.go:327 +0x68c fp=0x40097fbe20 sp=0x40097fbd10 pc=0x5d1ac
github.com/xxx/xxx/commonlib/memmonitor.(*memoryMonitor).run.func1()
/home/circleci/project/commonlib/memmonitor/monitor.go:73 +0xb0 fp=0x40097fbef0 sp=0x40097fbe20 pc=0x1b383f0
github.com/xxx/xxx/commonlib/waitgroup.(*SafeWaitGroup).Go.func1()
/home/circleci/project/commonlib/waitgroup/waitgroup.go:83 +0x24 fp=0x40097fbf00 sp=0x40097fbef0 pc=0xfe0684
github.com/xxx/xxx/commonlib/waitgroup.(*SafeErrorGroup).Go.func1()
/home/circleci/project/commonlib/waitgroup/waitgroup.go:40 +0x88 fp=0x40097fbf60 sp=0x40097fbf00 pc=0xfe02c8
golang.org/x/sync/errgroup.(*Group).Go.func1()
/go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75 +0x5c fp=0x40097fbfd0 sp=0x40097fbf60 pc=0xfdfbfc

When the runtime crashes, it successfully unwound this stack. So it is weird that the first round of unwinding (for taking the goroutine profile) failed. It is possible that the goroutine has executed and changed its stack between the two unwinding. But given that it's been sitting there for 2 minutes, and the crash probably took much less time, it probably didn't change.

Besides, goroutine 105 is interesting:

goroutine 105 [select, 2 minutes]:
runtime.gopark(0x40097fbf48?, 0x3?, 0x80?, 0xd?, 0x40097fbefa?)
/usr/local/go/src/runtime/proc.go:381 +0xe0 fp=0x4004b92d70 sp=0x4004b92d50 pc=0x4c740
runtime.selectgo(0x4004b92f48, 0x40097fbef4, 0x40044f0a00?, 0x0, 0x3?, 0x1)
/usr/local/go/src/runtime/select.go:327 +0x68c fp=0x4004b92e80 sp=0x4004b92d70 pc=0x5d1ac
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue(0x40005f5220)
/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.16.0/trace/batch_span_processor.go:293 +0xd4 fp=0x4004b92f90 sp=0x4004b92e80 pc=0x912a74
go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func1()
/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.16.0/trace/batch_span_processor.go:128 +0x58 fp=0x4004b92fd0 sp=0x4004b92f90 pc=0x911e68
runtime.goexit()

At the second frame, runtime.selectgo, the second argument 0x40097fbef4 (which is pointer-typed) points to goroutine 594's stack, which should not happen. Perhaps something weird is going on with that stack...

It seems this failure occurs while taking a goroutine profile. Could you try

  • running the program without enabling goroutine profiles, and see if the error still occurs?
  • with Go 1.20.x, changing go119ConcurrentGoroutineProfile to false, and see if the error still occurs? You'll need to use a Go 1.20.x source tree and modify the source code, then rebuild your program. (This constant only exists in Go 1.20.x, not Go 1.21 or later)
  • running with Go 1.22rc1 (https://go.dev/dl/#go1.22rc1) or tip. If the error still occurs, this will print some additional information.

Thanks.

@cherrymui cherrymui removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Dec 28, 2023
@lizthegrey
Copy link

lizthegrey commented Jan 29, 2024

We also continue to see even after fix to #54332, traceback stuck approximately once a month across our arm64 fleet under continuous profiling, including on most recent go1.21.6. Most recent dump here:

runtime: traceback stuck. pc=0x2c4670 sp=0x42091d9da0
stack: frame={sp:0x42091d9da0, fp:0x42091d9da0} stack=[0x420e72a000,0x420e72e000)

fatal error: traceback stuck

goroutine 0 [idle]:
runtime.throw({0x217a524?, 0x71700?})
        /usr/local/go/src/runtime/panic.go:1077 +0x40 fp=0x4000514620 sp=0x40005145f0 pc=0x4a8d0
runtime.(*unwinder).next(0x4000514888)
        /usr/local/go/src/runtime/traceback.go:486 +0x344 fp=0x40005146b0 sp=0x4000514620 pc=0x720a4
runtime.tracebackPCs(0x4000514888, 0x0?, {0x40005149e8, 0x40, 0x40})
        /usr/local/go/src/runtime/traceback.go:628 +0x58 fp=0x4000514840 sp=0x40005146b0 pc=0x723c8
runtime.sigprof(0x25f88, 0x4000514c48?, 0x4000514c58?, 0x4000502b60, 0x400050c000)
        /usr/local/go/src/runtime/proc.go:4969 +0x18c fp=0x4000514bf0 sp=0x4000514840 pc=0x5739c
runtime.sighandler(0x1b?, 0x400050c000?, 0x4000514c98?, 0x4000502b60?)
        /usr/local/go/src/runtime/signal_unix.go:642 +0xf4 fp=0x4000514c60 sp=0x4000514bf0 pc=0x61f94
runtime.sigtrampgo(0x1b, 0x4000514da0, 0x4000514e20)
        /usr/local/go/src/runtime/signal_unix.go:490 +0xd8 fp=0x4000514ce0 sp=0x4000514c60 pc=0x619f8
runtime.sigtramp()
        /usr/local/go/src/runtime/sys_linux_arm64.s:462 +0x54 fp=0x4000514da0 sp=0x4000514ce0 pc=0x832d4

I've symbolised 0x2c4670, it is:

$ go tool addr2line shepherd.full 
0x2c4670
net/http.(*conn).serve.func1
/usr/local/go/src/net/http/server.go:1864

https://github.com/golang/go/blob/go1.21.6/src/net/http/server.go#L1864-L1882

@cherrymui
Copy link
Member

Thanks @lizthegrey for the information.

$ go tool addr2line shepherd.full
0x2c4670
net/http.(*conn).serve.func1
/usr/local/go/src/net/http/server.go:1864
https://github.com/golang/go/blob/go1.21.6/src/net/http/server.go#L1864-L1882

This looks like it is the entry address of a deferred function. Could you confirm that it is indeed the function entry, or it is somewhere in the function prologue, perhaps with go tool objdump?

What is PC 0x25f88 ? It is the first argument of runtime.sigprof, i.e. where the profiling signal lands, and where the unwinding starts. That would be helpful to see why it unwinds to a function entry (or prologue). It looks like it is pointing to a runtime function?

Thanks.

@lizthegrey
Copy link

$ go tool addr2line shepherd.full
0x2c4670
net/http.(*conn).serve.func1
/usr/local/go/src/net/http/server.go:1864
go1.21.6/src/net/http/server.go#L1864-L1882

This looks like it is the entry address of a deferred function. Could you confirm that it is indeed the function entry, or it is somewhere in the function prologue, perhaps with go tool objdump?

Confirm, function entry.

TEXT net/http.(*conn).serve.func1(SB) /usr/local/go/src/net/http/server.go
  server.go:1864        0x2c4670                f9400b90                MOVD 16(R28), R16                                       
  server.go:1864        0x2c4674                d10083f1                SUB $32, RSP, R17                                       
  server.go:1864        0x2c4678                eb10023f                CMP R16, R17                                            
  server.go:1864        0x2c467c                54000e09                BLS 112(PC)                                             
  server.go:1864        0x2c4680                f8160ffe                MOVD.W R30, -160(RSP)                                   
  server.go:1864        0x2c4684                f81f83fd                MOVD R29, -8(RSP)                                       
  server.go:1864        0x2c4688                d10023fd                SUB $8, RSP, R29                                        
  server.go:1865        0x2c468c                910283e1                ADD $160, RSP, R1                                       
  server.go:1865        0x2c4690                91002020                ADD $8, R1, R0                                          
  server.go:1864        0x2c4694                f9400741                MOVD 8(R26), R1                                         
  server.go:1864        0x2c4698                f9002be1                MOVD R1, 80(RSP)                                        
  server.go:1864        0x2c469c                f9400b42                MOVD 16(R26), R2                                        
  server.go:1864        0x2c46a0                f90033e2                MOVD R2, 96(RSP)                                        
  server.go:1865        0x2c46a4                97f61867                CALL runtime.gorecover(SB)                              
  server.go:1865        0x2c46a8                f9001fe0                MOVD R0, 56(RSP)                                        
  server.go:1869        0x2c46ac                f90027e1                MOVD R1, 72(RSP)                                        
  server.go:1865        0x2c46b0                b4000200                CBZ R0, 16(PC)                                          
  server.go:1865        0x2c46b4                b001d09b                ADRP 60887040(PC), R27                                  
  server.go:1865        0x2c46b8                f947f363                MOVD 4064(R27), R3                                      
  server.go:1865        0x2c46bc                b4000043                CBZ R3, 2(PC)                                           
  server.go:1865        0x2c46c0                f9400463                MOVD 8(R3), R3                                          
  server.go:1865        0x2c46c4                eb03001f                CMP R3, R0                                              
  server.go:1865        0x2c46c8                54000060                BEQ 3(PC)                                               
  server.go:1865        0x2c46cc                b24003e3                ORR $1, ZR, R3                                          
  server.go:1865        0x2c46d0                14000009                JMP 9(PC)      

What is PC 0x25f88 ? It is the first argument of runtime.sigprof, i.e. where the profiling signal lands, and where the unwinding starts. That would be helpful to see why it unwinds to a function entry (or prologue). It looks like it is pointing to a runtime function?

$ go tool addr2line shepherd.full 
0x25f88
runtime.findObject
/usr/local/go/src/runtime/mbitmap.go:340

@aciduck
Copy link
Author

aciduck commented Feb 18, 2024

Hi, since upgrading to 1.21 the stack traces changed a bit, and it became obvious that the problem is triggered by the runtime profiling we are running every 30 seconds. We kept CPU and heap profile enabled and disabled goroutine profiling, and the problem almost disappeared.
Instead of a random container crashing almost daily, it happened only twice in the past 3 weeks. Both error dumps are attached.
extract-2024-02-18T09_43_34.174Z.txt
extract-2024-02-18T09_39_23.548Z.txt

@cherrymui
Copy link
Member

@aciduck have you tried Go 1.22 ? Also, in your crash dumps it seems some lines are missing or reordered, like, frames from different stacks are mixed together. Could you paste an unfiltered stack dump? (It is fine to redact function names.)

Also,

runtime: traceback stuck. pc=0x266f680 sp=0x400369f160
stack: frame={sp:0x400369f160, fp:0x400369f160} stack=[0x40167fe000,0x4016802000)
fatal error: traceback stuck

the SP and the frame are not in the bounds of the stack. This might be that the G and the stack don't match? Are these crashes from a cgo program or pure Go? Thanks.

@aciduck
Copy link
Author

aciduck commented Feb 24, 2024

@cherrymui We haven't updated to 1.22 yet, will probably happen in the coming weeks.
This output is the best we have, we don't get the output directly from the executables but from a log collecting service, so some lines might have been swapped.
Those two components were compiled with CGO_ENABLED=0.

@lizthegrey
Copy link

We've done our 1.22 upgrade, will report if we see any stuck tracebacks.

@lizthegrey
Copy link

lizthegrey commented Mar 6, 2024

@cherrymui: got our first stuck traceback since go1.22. Let me know what you need symbolised from this

runtime: traceback stuck. pc=0x25d80 sp=0x402ec0f2d0
stack: frame={sp:0x402ec0f2d0, fp:0x402ec0f2d0} stack=[0x402ec0e000,0x402ec10000)
0x000000402ec0f1d0:  0x3fc5c01a39fbd68c  0xd0d3df5a4e91cfca 
0x000000402ec0f1e0:  0x73a17ce7dbd37d1b  0x4e7c0e5d2b3f15a7 
0x000000402ec0f1f0:  0x64a1b1812af5a73d  0xffffffffffffffff 
0x000000402ec0f200:  0xffffffffffffffff  0xffffffffffffffff 
0x000000402ec0f210:  0xffffffffffffffff  0x82b34e68bd8b9149 
0x000000402ec0f220:  0xcc4956d8c5ab7c86  0xcf938da9c9928796 
0x000000402ec0f230:  0x713ee03c43e459c9  0xc56fd90763005b1a 
0x000000402ec0f240:  0x8f1bbcdc8f1bbcdc  0xca62c1d6ca62c1d6 
0x000000402ec0f250:  0x6ed9eba15a827999  0x00000000c6e15407 
0x000000402ec0f260:  0x0fb105ae0fb105ae  0x81e09f8c81e09f8c 
0x000000402ec0f270:  0x47686bec47686bec  0xc6608605c6608605 
0x000000402ec0f280:  0x44badf1b44badf1b  0xfb169498fb169498 
0x000000402ec0f290:  0x0bccfead0bccfead  0x14547546c6e6ffdc 
0x000000402ec0f2a0:  0x276b0427333f7161  0x000000005f83d1d4 
0x000000402ec0f2b0:  0x0605040702010003  0x000000402ec0f2c8 
0x000000402ec0f2c0:  0x0000000000025d80 <runtime.mapaccess2_faststr+0x00000000000003f0>  0x000000402ec0f3c8 
0x000000402ec0f2d0: >0x00000000005c4528 <github.com/klauspost/compress/zstd.(*Encoder).EncodeAll+0x0000000000000688>  0x000000414d184000 
0x000000402ec0f2e0:  0x000000414ce60000  0x0000000000020000 <runtime.mallocgc+0x0000000000000400> 
0x000000402ec0f2f0:  0x00000000005c460c <github.com/klauspost/compress/zstd.(*Encoder).EncodeAll+0x000000000000076c>  0x00000040d0456698 
0x000000402ec0f300:  0x0000004137ad3800  0x0000000000000100 
0x000000402ec0f310:  0x0000000000000100  0x0000000000000117 
0x000000402ec0f320:  0x00074b0d0006ccc0  0x0000000900000000 
0x000000402ec0f330:  0x0000000000000220  0x0000000c00000004 
0x000000402ec0f340:  0x0000000000000001  0x00074bb60007fff6 
0x000000402ec0f350:  0x0000021d00000563  0x0000001200074bc2 
0x000000402ec0f360:  0x0000004100074b0c  0x000000402ec0f3c8 
0x000000402ec0f370:  0x0000021d005c44d8  0x000000000000000b 
0x000000402ec0f380:  0x0000000000080000 <runtime.goroutineProfileWithLabelsConcurrent.func2+0x0000000000000010>  0x0000000001000000 <github.com/bradfitz/gomemcache/memcache.(*ServerList).SetServers+0x0000000000000180> 
0x000000402ec0f390:  0x00000000000010b9  0x0000000000020000 <runtime.mallocgc+0x0000000000000400> 
0x000000402ec0f3a0:  0x0000000000000b0f  0x0000004153870000 
0x000000402ec0f3b0:  0x000000414d184048  0x000000414d1c4048 
0x000000402ec0f3c0:  0x000000414d3f0000  0x000000402ec0f558 
fatal error: traceback stuck

goroutine 0 gp=0x4000802380 m=18 mp=0x4000780808 [idle]:
runtime.throw({0x21b3a9f?, 0xfffffff000000000?})
        /usr/local/go/src/runtime/panic.go:1023 +0x40 fp=0x4000d0e730 sp=0x4000d0e700 pc=0x4cd90
runtime.(*unwinder).next(0x4000d0e978)
        /usr/local/go/src/runtime/traceback.go:480 +0x340 fp=0x4000d0e7c0 sp=0x4000d0e730 pc=0x79b70
runtime.tracebackPCs(0x4000d0e978, 0x408f4b72d8?, {0x4000d0e9d8, 0x40, 0x40})
        /usr/local/go/src/runtime/traceback.go:622 +0x5c fp=0x4000d0e940 sp=0x4000d0e7c0 pc=0x79e9c
runtime.sigprof(0x8a620, 0x4000d0ec48?, 0x4000d0ec58?, 0x401de59340, 0x4000780808)
        /usr/local/go/src/runtime/proc.go:5375 +0x18c fp=0x4000d0ebf0 sp=0x4000d0e940 pc=0x5b4dc
runtime.sighandler(0x1b?, 0x4000780808?, 0x4000d0ec98?, 0x40041f48c0?)
        /usr/local/go/src/runtime/signal_unix.go:642 +0x61c fp=0x4000d0ec60 sp=0x4000d0ebf0 pc=0x66c7c
runtime.sigtrampgo(0x1b, 0x4000d0eda0, 0x4000d0ee20)
        /usr/local/go/src/runtime/signal_unix.go:490 +0xd8 fp=0x4000d0ece0 sp=0x4000d0ec60 pc=0x661b8
runtime.sigtramp()
        /usr/local/go/src/runtime/sys_linux_arm64.s:462 +0x54 fp=0x4000d0eda0 sp=0x4000d0ece0 pc=0x8ac14

goroutine 4203 gp=0x401de59340 m=18 mp=0x4000780808 [running]:

@aciduck
Copy link
Author

aciduck commented Mar 27, 2024

Hi, we upgraded to Go 1.22.1 and the problem keeps happening sporadically. Two such crashes are attached:
extract-2024-03-27T12_14_33.042Z.txt
extract-2024-03-27T12_12_46.194Z.txt

@lizthegrey
Copy link

Another crash, let me know if you need me to symbolise it.

goroutine 4458 gp=0x401229b500 m=25 mp=0x400052e808 [running]:
	runtime/sys_linux_arm64.s:462 +0x54 fp=0x4000bfeda0 sp=0x4000bfece0 pc=0x8ac14
runtime.sigtramp()
	runtime/signal_unix.go:490 +0xd8 fp=0x4000bfece0 sp=0x4000bfec60 pc=0x661b8
runtime.sigtrampgo(0x1b, 0x4000bfeda0, 0x4000bfee20)
	runtime/signal_unix.go:642 +0x61c fp=0x4000bfec60 sp=0x4000bfebf0 pc=0x66c7c
runtime.sighandler(0x1b?, 0x400052e808?, 0x4000bfec98?, 0x4000ba0fc0?)
	runtime/proc.go:5375 +0x18c fp=0x4000bfebf0 sp=0x4000bfe940 pc=0x5b4dc
runtime.sigprof(0x8a620, 0x4000bfec48?, 0x4000bfec58?, 0x401229b500, 0x400052e808)
	runtime/traceback.go:622 +0x5c fp=0x4000bfe940 sp=0x4000bfe7c0 pc=0x79e9c
runtime.tracebackPCs(0x4000bfe978, 0x4018186f58?, {0x4000bfe9d8, 0x40, 0x40})
	runtime/traceback.go:480 +0x340 fp=0x4000bfe7c0 sp=0x4000bfe730 pc=0x79b70
runtime.(*unwinder).next(0x4000bfe978)
	runtime/panic.go:1023 +0x40 fp=0x4000bfe730 sp=0x4000bfe700 pc=0x4cd90
runtime.throw({0x21e954a?, 0xfffffff000000000?})
goroutine 0 gp=0x4000ba0e00 m=25 mp=0x400052e808 [idle]:
fatal error: traceback stuck
0x0000004399cc93c0:  0x00000044d6ca2000  0x0000004399cc9558 
0x0000004399cc93b0:  0x000000452dcd0048  0x000000452dd10048 
0x0000004399cc93a0:  0x00000000000002f3  0x000000452de12000 
0x0000004399cc9390:  0x0000000000000da1  0x0000000000020000 <runtime.mallocgc+0x0000000000000400> 
0x0000004399cc9380:  0x0000000000040000 <runtime.(*pageAlloc).update+0x0000000000000490>  0x0000000001000000 <github.com/honeycombio/hound/keys.KeyList.UserIDs+0x00000000000002c0> 
0x0000004399cc9370:  0x00001c8f005c3568  0x0000000000000000 
0x0000004399cc9360:  0x0000004400024e45  0x0000004399cc93c8 
0x0000004399cc9350:  0x00001c8f0000868c  0x0000000500025047 
0x0000004399cc9340:  0x0000000000000001  0x000250420003fff6 
0x0000004399cc9330:  0x0000000000001c92  0x0000000100000010 
0x0000004399cc9320:  0x00024e46000233b3  0x0000000200000006 
0x0000004399cc9310:  0x0000000000000100  0x000000000000018f 
0x0000004399cc9300:  0x00000044c3d15400  0x0000000000000100 
0x0000004399cc92f0:  0x00000000005c369c <github.com/klauspost/compress/zstd.(*Encoder).EncodeAll+0x000000000000076c>  0x00000040b3d80e78 
0x0000004399cc92e0:  0x00000044e848c000  0x0000000000020000 <runtime.mallocgc+0x0000000000000400> 
0x0000004399cc92d0: >0x00000000005c35b8 <github.com/klauspost/compress/zstd.(*Encoder).EncodeAll+0x0000000000000688>  0x000000452dcd0000 
0x0000004399cc92c0:  0x0000000000015900 <indexbody+0x0000000000000110>  0x0000004399cc93c8 
0x0000004399cc92b0:  0x0605040702010003  0x0000004399cc92c8 
0x0000004399cc92a0:  0x4ed8aa4a391c0cb3  0x000000007bdcb9b0 
0x0000004399cc9290:  0xb22567d4b22567d4  0x1e376c0819a4c116 
0x0000004399cc9280:  0xf5c2a259f5c2a259  0x8a78a8298a78a829 
0x0000004399cc9270:  0x97263d2997263d29  0x3a9cc8fc3a9cc8fc 
0x0000004399cc9260:  0x663d6a46663d6a46  0xd403e495d403e495 
0x0000004399cc9250:  0x6ed9eba15a827999  0x0000000043b3cc49 
0x0000004399cc9240:  0x8f1bbcdc8f1bbcdc  0xca62c1d6ca62c1d6 
0x0000004399cc9230:  0xf2d710aef8f89c7e  0xf26b17cdac4d3039 
0x0000004399cc9220:  0xbf363a3e9d7e0b9e  0x934a39ccb5589239 
0x0000004399cc9210:  0x3c5b0fbb53a804d7  0x19f140fa74ab92f0 
0x0000004399cc9200:  0xba6eb3e8ab4e6384  0xf54c2e3362ed086f 
0x0000004399cc91f0:  0x37ba530109445563  0xe64d8eecc33e1c87 
0x0000004399cc91e0:  0xd07788fc7b01bab7  0x50ebbb82c7975f4b 
0x0000004399cc91d0:  0x73531f6ab7360646  0x645c308cfb8ce111 
stack: frame={sp:0x4399cc92d0, fp:0x4399cc92d0} stack=[0x4399cc8000,0x4399cca000)
runtime: traceback stuck. pc=0x15900 sp=0x4399cc92d0

@lizthegrey
Copy link

Can we please get an update here? I know the 1.23 window is soon, and it would be nice to get a fix or additional telemetry into that release.

@cherrymui
Copy link
Member

Sorry for the delay. I still plan to look into this, and address it in 1.23 (we can still fix bugs during the freeze). Thanks.

@cherrymui
Copy link
Member

@aciduck in extract-2024-03-27T12_12_46.194Z.txt

runtime.sigprof(0x8a2cc, 0x4000150c48?, 0x4000150c58?, 0x4000005180, 0x400012d008)

The signal PC (where the traceback starts) is 0x8a2cc, which is close to

runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_arm64.s:200 +0x8 fp=0x4007620e60 sp=0x4007620e50 pc=0x8a218

So it may be something in the runtime, and probably assembly code. Is it possible to get what that PC is?

Also, from the SP,

runtime: traceback stuck. pc=0x170340 sp=0x4005a13ae0
stack: frame={sp:0x4005a13ae0, fp:0x4005a13ae0} stack=[0x4005a0e000,0x4005a16000)

it seems to be on this goroutine

goroutine 713 gp=0x4006610e00 m=7 mp=0x4000180808 [runnable]:
encoding/json.stateInString(0x4017495e00?, 0x6f?)
/usr/local/go/src/encoding/json/scanner.go:337 +0x1b4 fp=0x4005a13a10 sp=0x4005a13a10 pc=0x172544
encoding/json.appendCompact({0x40097dea80, 0x0, 0xa80}, {0x40097de000, 0x927, 0xa80}, 0x1)
/usr/local/go/src/encoding/json/indent.go:70 +0x44c fp=0x4005a13b00 sp=0x4005a13a10 pc=0x17002c
encoding/json.marshalerEncoder(0x4013aaaf00, {0x3572b20?, 0x4004ee0af0?, 0x35640a0?}, {0x20?, 0x2b?})
/usr/local/go/src/encoding/json/encode.go:444 +0x128 fp=0x4005a13bc0 sp=0x4005a13b00 pc=0x169f88

So apparently when it crashes, it actually can unwind the stack successfully. It starts with 0x4005a13a10, and the next frame is at 0x4005a13b00. Why is the profiling unwinding stuck in the middle of that frame at 0x4005a13ae0? Also, the first frame has FP=SP, which looks like it is in the prologue of calling morestack. Maybe it is related to profiling from morestack?

@lizthegrey 's cases are weird.

runtime: traceback stuck. pc=0x25d80 sp=0x402ec0f2d0
0x000000402ec0f2c0: 0x0000000000025d80 <runtime.mapaccess2_faststr+0x00000000000003f0> 0x000000402ec0f3c8
0x000000402ec0f2d0: >0x00000000005c4528 <github.com/klauspost/compress/zstd.(*Encoder).EncodeAll+0x0000000000000688> 0x000000414d184000

It stuck at SP=0x402ec0f2d0. The saved LR, or next PC, should be *SP=0x5c4528 (https://cs.opensource.google/go/go/+/master:src/runtime/traceback.go;l=373). But that is not the one reported on the first line, which is 0x25d80, which is actually *(SP-16). This is super weird. Unless these PC, SP, and LR are not from unwinding a frame, but something else? It is not the first frame where the traceback starts, as the argument of sigprof (the signal PC) is 0x8a620. A stack transition? In that case the PC, SP, and LR are from gp.sched that is populated in morestack. But I don't see how morestack could produce such PC and SP, either... Nevertheless, it seems also related to morestack...

@lizthegrey
Copy link

lizthegrey commented Jun 13, 2024

It's clearly not a very frequent crash, so it's an edge case of an edge case of an edge case we suspect.

Screenshot 2024-06-13 13 38 56

@cherrymui
Copy link
Member

For unwinding for profiling, we are already a bit permissive -- in a number of cases if we can't find a valid frame, instead of throwing an error, we simply give up. Maybe we should do this here as well: if it is going to fall into an unwinding loop, just give up and leave a truncated stack trace.

On the other hand, the "traceback stuck" error did uncover a few real bugs. I'm not sure...

@lizthegrey
Copy link

For unwinding for profiling, we are already a bit permissive -- in a number of cases if we can't find a valid frame, instead of throwing an error, we simply give up. Maybe we should do this here as well: if it is going to fall into an unwinding loop, just give up and leave a truncated stack trace.

On the other hand, the "traceback stuck" error did uncover a few real bugs. I'm not sure...

How about a compromise? Leave a truncated stack, but output the error message somewhere, so that we can pick it up and use it for debugging in the rare case where it does happen?

@AdallomRoy
Copy link

Can we get an update here? This is still happening from time to time. There are a few suggestions in this thread to add additional debug info if it is still a mystery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

6 participants