Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eth: close engine before handler for graceful shutdown #1189

Merged
merged 2 commits into from
Apr 15, 2024

Conversation

manav2401
Copy link
Contributor

@manav2401 manav2401 commented Mar 16, 2024

Description

This PR attempts to fix #1146.

TL;DR: Move the precedence of closing consensus engine above eth handler to prevent deadlock during block processing (when heimdall isn't responsive).

Note; More evaluation needs to be done to confirm if this change doesn't have any repercussions.

Reproduction steps and explanation

This issue occurs when we use a wrong / unresponsive heimdall endpoint in bor and then try to stop bor (by Ctrl+C or kill signals). Following command should suffice to reproduce it locally.

bor server --bor.heimdall=http://1.2.3.4:1234

Once the block synchronisation starts, the process doesn't gracefully shutdown if tried to. On further debugging, I could narrow down to the order in which the backend handles shutting different processes down (like handler, txpool, miner, consensus engine, blockchain, etc). The handler being one of the core process for syncing takes precedence. On digging further, it turns out that handler keeps waiting for a wait group to free up. The one which doesn't free up is used by the loop() function of sync module which further waits for block processing to be completed. More evidence could be gathered from this stack trace below. Ultimately it keeps waiting for ProcessBlock to complete which waits for consensus.Finalize to complete. As the heimdall endpoint is wrong, consensus keeps trying to reach out endlessly to heimdall else it can't validate the block and proceed. We haven't closed the consensus.Engine first so the stop signal hasn't reached yet to heimdall client (in bor). This causes a deadlock and causes bor to halt. The PR simply closes the engine first before the handler which unblocks block processing and it can gracefully exit (with an error though).

goroutine 858 gp=0xc009777340 m=nil [chan receive]:
runtime.gopark(0xc00ac2d5f0?, 0xc00ac7c800?, 0x58?, 0x6?, 0xac76e00?)
	/Users/manav/.go/src/runtime/proc.go:402 +0xce fp=0xc00a7887e0 sp=0xc00a7887c0 pc=0x10271d66e
runtime.chanrecv(0xc00ac66900, 0xc00a788a00, 0x1)
	/Users/manav/.go/src/runtime/chan.go:583 +0x3bf fp=0xc00a788858 sp=0xc00a7887e0 pc=0x1026e77ff
runtime.chanrecv1(0x9d23449bd1284f65?, 0x1bf7f13870f21210?)
	/Users/manav/.go/src/runtime/chan.go:442 +0x12 fp=0xc00a788880 sp=0xc00a788858 pc=0x1026e7412
github.com/ethereum/go-ethereum/core.(*BlockChain).ProcessBlock(0xc00051f408, 0xc00a6ada40, 0xc00ac84c88)
	/Users/manav/polygon/bor/core/blockchain.go:598 +0x385 fp=0xc00a788a78 sp=0xc00a788880 pc=0x1031e8185
github.com/ethereum/go-ethereum/core.(*BlockChain).insertChain(0xc00051f408, {0xc00a687908, 0x180, 0x180}, 0x1)
	/Users/manav/polygon/bor/core/blockchain.go:2224 +0x1d38 fp=0xc00a7896e8 sp=0xc00a788a78 pc=0x1031fa4f8
github.com/ethereum/go-ethereum/core.(*BlockChain).InsertChain(0xc00051f408, {0xc00a687908, 0x180, 0x180})
	/Users/manav/polygon/bor/core/blockchain.go:1949 +0xb16 fp=0xc00a789a08 sp=0xc00a7896e8 pc=0x1031f85d6
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).importBlockResults(0xc000b90e00, {0xc00a686c88, 0x180, 0x105b70f40?})
	/Users/manav/polygon/bor/eth/downloader/downloader.go:1644 +0x544 fp=0xc00a789d58 sp=0xc00a789a08 pc=0x1033988c4
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).processFullSyncContent(0xc000b90e00, 0x0, 0x0)
	/Users/manav/polygon/bor/eth/downloader/downloader.go:1609 +0x16c fp=0xc00a789f60 sp=0xc00a789d58 pc=0x103397d8c
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).syncWithPeer.func9()
	/Users/manav/polygon/bor/eth/downloader/downloader.go:692 +0x1f fp=0xc00a789f88 sp=0xc00a789f60 pc=0x103391cdf
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).spawnSync.func1()
	/Users/manav/polygon/bor/eth/downloader/downloader.go:707 +0x66 fp=0xc00a789fe0 sp=0xc00a789f88 pc=0x103392246
runtime.goexit({})
	/Users/manav/.go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00a789fe8 sp=0xc00a789fe0 pc=0x1027567a1
created by github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).spawnSync in goroutine 688
	/Users/manav/polygon/bor/eth/downloader/downloader.go:707 +0x6e

Changes

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)
  • Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

  • I have added at least 2 reviewer or the whole pos-v1 team
  • I have added sufficient documentation in code
  • I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
  • Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
  • Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

  • This PR requires changes to heimdall
    • In case link the PR here:
  • This PR requires changes to matic-cli
    • In case link the PR here:

Testing

  • I have added unit tests
  • I have added tests to CI
  • I have tested this code manually on local environment
  • I have tested this code manually on remote devnet using express-cli
  • I have tested this code manually on mumbai
  • I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

@marcello33
Copy link
Contributor

LGTM, only unit-tests to be fixed.

Copy link
Member

@pratikspatil024 pratikspatil024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noice!

Copy link

This PR is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the Stale label Apr 11, 2024
@manav2401 manav2401 removed the Stale label Apr 12, 2024
@manav2401 manav2401 merged commit f6969db into develop Apr 15, 2024
11 of 12 checks passed
@manav2401 manav2401 deleted the manav/fix-graceful-shutdown branch April 15, 2024 05:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants