-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad server daemon hung with 100% CPU #8163
Comments
ah sorry, didn't look far enough back for other issues. |
Hi @kung-foo, @shoenig . I was looking at the goroutine stack dump (we are thinking into upgrading our prod environment to 0.11.3, but we are afraid of this), and saw something that maybe could help a bit. By looking at the logs it seems like the server stopped working by Jun 14 10:03:12 until the SIGABRT signal at Jun 15 09:13:08 That's about 1390 minutes. There're several things that seems to be blocked since that time. What was weird to me was this goroutine:
The code of blocked_evals is the following: // prune is a long lived function that prunes unnecessary objects on a timer.
func (b *BlockedEvals) prune(stopCh <-chan struct{}) {
ticker := time.NewTicker(pruneInterval) // pruneInternal is a constant set to 5 minutes
defer ticker.Stop()
for {
select {
case <-stopCh:
return
case <-ticker.C:
b.pruneUnblockIndexes()
}
}
} That should't be blocked that log. I'm not an expert in go, but this goroute should run at some point, if I'm not mistaken. Could this be related to: golang/go#38051 (comment) |
@jorgemarey, thanks for pointing this out! |
Here's some debug output that might be interesting: spinning on sched_yield
current goroutine dump:
other delve tidbits:
|
I've been following these (3-4) issues closely and until today we hadn't encountered this issue. But today it happend. Server was gone and Nomad process was running at full CPU capacity (100% on each core) UPDATE: This doesn't just affect Nomad server nodes. We also had a Nomad client node with the same issue (100% cpu, restart of Nomad process fixed it). Version details
Stacktracehttps://gist.github.com/rkettelerij/7df7ee22ea76292182e715cd05a048f9 (via SIGABRT) of server node |
This happend again today, this time on a Nomad client (not server). For completeness sake I'm posting the stacktrace of the client: https://gist.github.com/rkettelerij/aad2738ae23df519515dcba82af7b953 |
It look like @jorgemarey was absolutely right. |
We've updated all our servers to |
Thanks for the update @kung-foo ! Will close this out, definitely let us know if the issue arises again with Nomad v0.11.3 or later. |
We've been running 0.11.3 for 7 days now without any issues. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Operating system and Environment details
Issue
Nomad server unresponsive. Daemon hung with 100% CPU.
Azure's view of the CPU usage:
Nomad Server logs
logs looked normal until they just stopped. (
SIGABRT
is from me)goroutine stack dump:
The text was updated successfully, but these errors were encountered: