eth: close engine before handler for graceful shutdown #1189

manav2401 · 2024-03-16T07:09:22Z

Description

This PR attempts to fix #1146.

TL;DR: Move the precedence of closing consensus engine above eth handler to prevent deadlock during block processing (when heimdall isn't responsive).

Note; More evaluation needs to be done to confirm if this change doesn't have any repercussions.

Reproduction steps and explanation

This issue occurs when we use a wrong / unresponsive heimdall endpoint in bor and then try to stop bor (by Ctrl+C or kill signals). Following command should suffice to reproduce it locally.

bor server --bor.heimdall=http://1.2.3.4:1234

Once the block synchronisation starts, the process doesn't gracefully shutdown if tried to. On further debugging, I could narrow down to the order in which the backend handles shutting different processes down (like handler, txpool, miner, consensus engine, blockchain, etc). The handler being one of the core process for syncing takes precedence. On digging further, it turns out that handler keeps waiting for a wait group to free up. The one which doesn't free up is used by the loop() function of sync module which further waits for block processing to be completed. More evidence could be gathered from this stack trace below. Ultimately it keeps waiting for ProcessBlock to complete which waits for consensus.Finalize to complete. As the heimdall endpoint is wrong, consensus keeps trying to reach out endlessly to heimdall else it can't validate the block and proceed. We haven't closed the consensus.Engine first so the stop signal hasn't reached yet to heimdall client (in bor). This causes a deadlock and causes bor to halt. The PR simply closes the engine first before the handler which unblocks block processing and it can gracefully exit (with an error though).

goroutine 858 gp=0xc009777340 m=nil [chan receive]:
runtime.gopark(0xc00ac2d5f0?, 0xc00ac7c800?, 0x58?, 0x6?, 0xac76e00?)
	/Users/manav/.go/src/runtime/proc.go:402 +0xce fp=0xc00a7887e0 sp=0xc00a7887c0 pc=0x10271d66e
runtime.chanrecv(0xc00ac66900, 0xc00a788a00, 0x1)
	/Users/manav/.go/src/runtime/chan.go:583 +0x3bf fp=0xc00a788858 sp=0xc00a7887e0 pc=0x1026e77ff
runtime.chanrecv1(0x9d23449bd1284f65?, 0x1bf7f13870f21210?)
	/Users/manav/.go/src/runtime/chan.go:442 +0x12 fp=0xc00a788880 sp=0xc00a788858 pc=0x1026e7412
github.com/ethereum/go-ethereum/core.(*BlockChain).ProcessBlock(0xc00051f408, 0xc00a6ada40, 0xc00ac84c88)
	/Users/manav/polygon/bor/core/blockchain.go:598 +0x385 fp=0xc00a788a78 sp=0xc00a788880 pc=0x1031e8185
github.com/ethereum/go-ethereum/core.(*BlockChain).insertChain(0xc00051f408, {0xc00a687908, 0x180, 0x180}, 0x1)
	/Users/manav/polygon/bor/core/blockchain.go:2224 +0x1d38 fp=0xc00a7896e8 sp=0xc00a788a78 pc=0x1031fa4f8
github.com/ethereum/go-ethereum/core.(*BlockChain).InsertChain(0xc00051f408, {0xc00a687908, 0x180, 0x180})
	/Users/manav/polygon/bor/core/blockchain.go:1949 +0xb16 fp=0xc00a789a08 sp=0xc00a7896e8 pc=0x1031f85d6
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).importBlockResults(0xc000b90e00, {0xc00a686c88, 0x180, 0x105b70f40?})
	/Users/manav/polygon/bor/eth/downloader/downloader.go:1644 +0x544 fp=0xc00a789d58 sp=0xc00a789a08 pc=0x1033988c4
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).processFullSyncContent(0xc000b90e00, 0x0, 0x0)
	/Users/manav/polygon/bor/eth/downloader/downloader.go:1609 +0x16c fp=0xc00a789f60 sp=0xc00a789d58 pc=0x103397d8c
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).syncWithPeer.func9()
	/Users/manav/polygon/bor/eth/downloader/downloader.go:692 +0x1f fp=0xc00a789f88 sp=0xc00a789f60 pc=0x103391cdf
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).spawnSync.func1()
	/Users/manav/polygon/bor/eth/downloader/downloader.go:707 +0x66 fp=0xc00a789fe0 sp=0xc00a789f88 pc=0x103392246
runtime.goexit({})
	/Users/manav/.go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00a789fe8 sp=0xc00a789fe0 pc=0x1027567a1
created by github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).spawnSync in goroutine 688
	/Users/manav/polygon/bor/eth/downloader/downloader.go:707 +0x6e

Changes

Bugfix (non-breaking change that solves an issue)
Hotfix (change that solves an urgent issue, and requires immediate attention)
New feature (non-breaking change that adds functionality)
Breaking change (change that is not backwards-compatible and/or changes current functionality)
Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

I have added at least 2 reviewer or the whole pos-v1 team
I have added sufficient documentation in code
I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

This PR requires changes to heimdall
- In case link the PR here:
This PR requires changes to matic-cli
- In case link the PR here:

Testing

I have added unit tests
I have added tests to CI
I have tested this code manually on local environment
I have tested this code manually on remote devnet using express-cli
I have tested this code manually on mumbai
I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

marcello33 · 2024-03-18T09:00:07Z

LGTM, only unit-tests to be fixed.

pratikspatil024

Noice!

github-actions · 2024-04-11T00:17:13Z

This PR is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 14 days.

eth: close engine before handler

3a3a19a

temaniarpit27 approved these changes Mar 18, 2024

View reviewed changes

marcello33 approved these changes Mar 18, 2024

View reviewed changes

ssandeep approved these changes Mar 18, 2024

View reviewed changes

pratikspatil024 approved these changes Mar 20, 2024

View reviewed changes

github-actions bot added the Stale label Apr 11, 2024

manav2401 removed the Stale label Apr 12, 2024

Merge branch 'develop' into manav/fix-graceful-shutdown

b6d92aa

manav2401 merged commit f6969db into develop Apr 15, 2024
11 of 12 checks passed

manav2401 deleted the manav/fix-graceful-shutdown branch April 15, 2024 05:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eth: close engine before handler for graceful shutdown #1189

eth: close engine before handler for graceful shutdown #1189

manav2401 commented Mar 16, 2024 •

edited

Loading

marcello33 commented Mar 18, 2024

pratikspatil024 left a comment

github-actions bot commented Apr 11, 2024

eth: close engine before handler for graceful shutdown #1189

eth: close engine before handler for graceful shutdown #1189

Conversation

manav2401 commented Mar 16, 2024 • edited Loading

Description

Reproduction steps and explanation

Changes

Breaking changes

Nodes audience

Checklist

Cross repository changes

Testing

Manual tests

Additional comments

marcello33 commented Mar 18, 2024

pratikspatil024 left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 11, 2024

manav2401 commented Mar 16, 2024 •

edited

Loading