Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[neuron-1 testnet] node sync crash when restarting node process #371

Closed
EG-easy opened this issue Oct 26, 2021 · 1 comment
Closed

[neuron-1 testnet] node sync crash when restarting node process #371

EG-easy opened this issue Oct 26, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@EG-easy
Copy link
Contributor

EG-easy commented Oct 26, 2021

Summary of Bug

nibiru version
neuron-1.1

  • cosmos-sdk v0.44.2
  • tendermint v0.34.12

Environment
ubuntu 20.04
6 vCore/16 GB

What happened

  • My node crashed with the error below soon after restarting node
  • Restart succeeded only when the block height was under 1000 with no peer info in address.json
  • Even using same version peer (seeds and persistent_peers), still crashed
  • Using synced data from other node, it still crashed after the command nibirud start
  • Only the solution for this issue is re sync from scratch with the command nibirud unsafe-reset-all

Logs

panic: Failed to process committed block (15253:313A220A98EB569C01B7BCEF20F9E25AAB761C586D9E17617FCEF95CE8EF3A12): wrong Block.Header.LastResultsHash.  Expected 1E5CCA645833276B4ED18012F98634C8A7CC7661F5D4D1C0796703E9B843F2DD, got 455F0B1ADF97CF95C59C1D6D037B4B6A9160ED3CBF209F12FB517658291FE20F
Oct 25 04:46:48 neuron-dev-2 nibirud[4410]: goroutine 163 [running]:
Oct 25 04:46:48 neuron-dev-2 nibirud[4410]: github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine(0xc0002348c0, 0x0)
Oct 25 04:46:48 neuron-dev-2 nibirud[4410]:         /root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.12/blockchain/v0/reactor.go:401 +0x1265
Oct 25 04:46:48 neuron-dev-2 nibirud[4410]: created by github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).OnStart
Oct 25 04:46:48 neuron-dev-2 nibirud[4410]:         /root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.12/blockchain/v0/reactor.go:110 +0x85
Oct 25 04:46:48 neuron-dev-2 systemd[1]: nibirud.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

Possible cause

This error display when call replay block, after that try to apply this block and get case that block.LastResultsHash not equal with state.LastResultsHash: see detail: https://github.com/tendermint/tendermint/blob/d030cddca01c0c3ff0ce41e051e123cdc9872a4e/consensus/replay.go#L404-L424

Possible solution

Rollback to [the crashed height - 1] block in both app state side and tendermint state side discussed in the related issues.

Related Issues

Tendermint
Add command to roll-back a single block #3845
Cosmos-SDK
Add rollback support in the event of an incorrect hash #10281
Terra
[BUG] terrad cannot start after Columbus 5 Upgrade #582

Effect to the testnet

Due to the difficulty of restart node, Neuron Incentivized Testnet Mission 6 (Upgrade Node) will be skipped.

How to configure validator node setting without restart in the rest of testnet period
In order to reflect the change in config.toml, it is necessary to restart the node. However, restarting the node causes it to crash, which means that the node needs to be resynchronized from the beginning, leading to missing the signatures.

One idea is to move the private key of the original validator from server A to another server B, rewrite server B's config.toml to synchronize the blocks until just before it catches up with the latest block, and stop the process of server A's node at that time. Obviously, this involves the risk of double signatures, and should be done at your own risk.

Effect to the mainnet

Currently the rollback feature is implemented only tendermint(v0.34.14) side. In order for the rollback to work flawlessly. we have to wait until this issue resolved in app state side.

@EG-easy EG-easy added the bug Something isn't working label Oct 26, 2021
@EG-easy
Copy link
Contributor Author

EG-easy commented Dec 28, 2021

We have released neuron-1.2 binary which will solve the bug when it happens.
https://github.com/cosmos-gaminghub/nibiru/releases/tag/neuron-1.2

@EG-easy EG-easy closed this as completed Dec 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant