eth: close engine before handler for graceful shutdown #1189
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR attempts to fix #1146.
TL;DR: Move the precedence of closing consensus engine above eth handler to prevent deadlock during block processing (when heimdall isn't responsive).
Note; More evaluation needs to be done to confirm if this change doesn't have any repercussions.
Reproduction steps and explanation
This issue occurs when we use a wrong / unresponsive heimdall endpoint in bor and then try to stop bor (by Ctrl+C or kill signals). Following command should suffice to reproduce it locally.
Once the block synchronisation starts, the process doesn't gracefully shutdown if tried to. On further debugging, I could narrow down to the order in which the backend handles shutting different processes down (like handler, txpool, miner, consensus engine, blockchain, etc). The handler being one of the core process for syncing takes precedence. On digging further, it turns out that handler keeps waiting for a wait group to free up. The one which doesn't free up is used by the loop() function of sync module which further waits for block processing to be completed. More evidence could be gathered from this stack trace below. Ultimately it keeps waiting for
ProcessBlock
to complete which waits forconsensus.Finalize
to complete. As the heimdall endpoint is wrong, consensus keeps trying to reach out endlessly to heimdall else it can't validate the block and proceed. We haven't closed theconsensus.Engine
first so the stop signal hasn't reached yet to heimdall client (in bor). This causes a deadlock and causes bor to halt. The PR simply closes the engine first before the handler which unblocks block processing and it can gracefully exit (with an error though).Changes
Breaking changes
Please complete this section if any breaking changes have been made, otherwise delete it
Nodes audience
In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)
Checklist
Cross repository changes
Testing
Manual tests
Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it
Additional comments
Please post additional comments in this section if you have them, otherwise delete it