-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harden shutdown logic #1037
Harden shutdown logic #1037
Conversation
@@ -509,8 +514,10 @@ func (i *cmdInvocation) setupInterruptHandler() { | |||
case 0: | |||
fmt.Println(shutdownMessage) | |||
go func() { | |||
wg.Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is racey-- the Add should go in the same goroutine as the wg gets initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the add always needs to happen before a goroutine is spun off
@torarnv this is looking good! on the wait group, maybe something like: type IntrCloser struct {
c chan os.Signal
wg sync.WaitGroup
}
func NewIntrCloser() *IntrCloser {
var ic := &IntrCloser{}
ic.c = allInterruptSignals()
ic.wg.Add(1)
return ic
}
func (ic *IntrCloser) Done() {
ic.wg.Done()
}
func (ic *IntrCloser) Close() error {
signal.Stop(c)
ic.wg.Wait()
return nil
}
func (i *cmdInvocation) setupInterruptHandler() *sync.WaitGroup {
intrc := NewIntrCloser()
...
// make sure to call intrc.Done()
...
return intrc
} |
can actually just replace the waitgroup with a channel: type IntrCloser struct {
c chan os.Signal
done chan struct{}
}
func NewIntrCloser() *IntrCloser {
var ic := &IntrCloser{}
ic.c = allInterruptSignals()
ic.done = make(chan struct{})
return ic
}
func (ic *IntrCloser) Done() {
ic.done<- struct{}{}
}
func (ic *IntrCloser) Close() error {
signal.Stop(c)
<-ic.exit
return nil
}
func (i *cmdInvocation) setupInterruptHandler() *sync.WaitGroup {
intrc := NewIntrCloser()
...
// make sure to call intrc.Done()
...
return intrc
}
|
@torarnv done with CR. gave suggestion on how to deal with the waitgroup problem, but no need to do it that way. |
hey @torarnv! i'd love to merge this in soon. any chance you could handle the CR comments? else i can do it. |
@jbenet Hey! Sorry, had some stuff come up. I'll look at this today! |
@torarnv will want to rebase-- i think diverged |
6c64360
to
9914ca5
Compare
#1081 got in there, thought github might detect the dependency? |
@torarnv when we merge the other one, the merged commits will disappear from this listing |
Need to double-check the tests, but general feedback welcome! |
a04cad4
to
ef409c0
Compare
btw @torarnv one thing to do is run (every commit has to pass so that we can rebase easily in the future-- it makes it really annoying if we do not follow this practice.) |
(will CR later tonight) |
@jbenet all tests pass locally for each commit in the PR 🎆 |
select { | ||
case <-ctx.Context.Done(): | ||
log.Info("Gracefully shut down daemon") | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of the default here? Wouldnt we want to wait and print the log message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's based on the possibility that the node/daemon may shut down through other means, eg through an API call or similar that wouldn't cancel the context, only close the node. In that case we wouldn't want to block. But if that's not really an option, or we do want to go through the context for those shutdowns as well, it shouldn't have a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, thanks!
If I could get a couple comments on the new struct and functions you added, that would be awesome. Otherwise this all looks good to me. |
You mean comments from me? Or you want to comment on them? 😃 |
@@ -475,59 +469,62 @@ func writeHeapProfileToFile() error { | |||
return pprof.WriteHeapProfile(mprof) | |||
} | |||
|
|||
// listen for and handle SIGTERM | |||
func (i *cmdInvocation) setupInterruptHandler() { | |||
type IntrHandler struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on this struct
Pointed out the two i'd like commented, your commit messages are far nicer than mine, thank for you that :) . I mostly like having the comments on structs and functions to make it easy to grok the code through GoDoc |
Will be great to have this in. |
(this will probably need rebasing on master, sorry. happy to rebase it for you if desired) |
This reverts commit f74e71f. The 'Online' flag of the command context does not seem to be set in any code paths, at least not when running commands such as 'ipfs daemon' or 'ipfs ping'. The result after f74e71f is that we never shutdown cleanly, as we'll always os.Exit(0) from the interrupt handler. The os.Exit(0) itself is also dubious, as conceptually the interrupt handler should ask whatever is stalling to stop stalling, so that main() can return like normal. Exiting with -1 in error cases where the interrupt handler is unable to stop the stall is fine, but the normal case of interrupting cleanly should exit through main().
The server may stay alive for quite a while due to waiting on open connections to close before shutting down. We should find ways to terminate these connections in a more controlled manner, but in the meantime it's helpful to be able to see why a shutdown of the ipfs daemon is taking so long.
When closing a node, the node itself only takes care of tearing down its own children. As corehttp sets up a server based on a node, it needs to also ensure that the server is accounted for when determining if the node has been fully closed.
Once the server is asked to shut down, we stop accepting new connections, but the 'manners' graceful shutdown will wait for all existing connections closed to close before finishing. For keep-alive connections this will never happen unless the client detects that the server is shutting down through the ipfs API itself, and closes the connection in response. This is a problem e.g. with the webui's connections visualization, which polls the swarm/peers endpoint once a second, and never detects that the API server was shut down. We can mitigate this by telling the server to disable keep-alive, which will add a 'Connection: close' header to the next HTTP response on the connection. A well behaving client should then treat that correspondingly by closing the connection. Unfortunately this doesn't happen immediately in all cases, presumably depending on the keep-alive timeout of the browser that set up the connection, but it's at least a step in the right direction.
If a command invocation such as 'daemon' is interrupted, the interrupt handler asks the node to close. The closing of the node will result in the command invocation finishing, and possibly returning from main() before the interrupt handler is done. In particular, the info logging that a graceful shutdown was completed may never reach reach stdout. As the whole point of logging "Gracefully shut down." is to give confidence when debugging that the shutdown was clean, this is slightly unfortunate. The interrupt handler needs to be set up in main() instead of Run(), so that we can defer the closing of the interrupt handler until just before returning from main, not when Run() returns with a streaming result reader.
The context passed on from main() may change before we hit callCommand, so setting it in Parse is a bit optimistic.
The context may be cancelled while a request is in flight. We need to handle this and cancel the request. The code is based on the ideas from https://blog.golang.org/context
When the response includes the X-Chunked-Output header, we treat that as channel output, and fire up a goroutine to decode the chunks. This routine need to look for context cancellation so that it can exit cleanly.
Instead of assuming the command is the daemon command and closing the node, which resulted in bugs like ipfs#1053, we cancel the context and let the context children detect the cancellation and gracefully clean up after themselves. The shutdown logging has been moved into the daemon command, where it makes more sense, so that commands like ping will not print out the same output on cancellation.
Instead of just terminating right there and then, we cancel the context, and let the daemon exit cleanly. This make take a few seconds, as the node builder and its child processes do not care too much about the context state while building nodes, but this can be improved by injecting checks for ctx.Done() before time-consuming steps.
ef409c0
to
bfd1211
Compare
Added comments and rebased. |
@torarnv this is now in master. thank you very much! |
Awesome! 🎉 Let me know if there are regressions the testsuite didn't catch, or similar issues/related areas that I could tackle as an extension of this. |
@torarnv no issues so far. so-- related issues are daemon init + startup. there's enough issues around this to warrant a whole sprint: https://github.com/ipfs/go-ipfs/labels/daemon%20%2B%20init may be worth picking small ones of those or scoping out what changes need to happen (there's plenty of cmdslib crap to do too there-- cc @whyrusleeping ) |
Fixes in my quest to make the daemon shut down cleanly on Ctrl+C. Having the daemon stuck, especially without any logging/info to the user of why it's stuck, gives a bad impression to new users, as it's not evident that pressing Ctrl+C a second time will not interrupt an important task and corrupt something.
Spawned from ipfs/ipfs-webui#28