Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: wrong Block.Header.AppHash crashes #284

Closed
hpmv opened this issue Dec 29, 2021 · 12 comments
Closed

Problem: wrong Block.Header.AppHash crashes #284

hpmv opened this issue Dec 29, 2021 · 12 comments
Assignees

Comments

@hpmv
Copy link

hpmv commented Dec 29, 2021

Describe the bug
My node once in a while crashes with errors like this:

panic: Failed to process committed block (780414:CF2396E752FF8BD24E64AB4F926EB4B2EB488AB070611DFD29F1EFB8AA45B9A0): wrong Block.Header.AppHash.  Expected FB6CC3B73190529D528E5EB3│································
18CBA4FB1E47D48030C952A47C6E2429958F4E29, got 6875B5CB3A6F7FD9C1AC10FE1F16D4DD5A6CE28727408669C4254E36F3A4B351

restarting the node gives a similar error.

To Reproduce
Cannot reproduce reliably other than running the node and this might happen once in a couple of days. I'm using version v0.6.1.

Expected behavior
App really should recover from such errors by automatically reverting to the previous height. Or, a manual tool like state_recover from BSC would also be great. Right now there's no solution other than recovering from a disk backup.

Could a dev tell me how to revert back to the previous height manually via setting leveldb keys? I know I need to set a couple of keys in the Tendermint side, but the app side is too confusing to dig in for me. Thanks!

@tomtau
Copy link
Contributor

tomtau commented Dec 30, 2021

@hpmv Thanks for reporting the issue. For the wrong app hash error, it'd be helpful if you could provide more details: which block heights, which network (I assume the mainnet beta?), etc. this happens on. Given it was on v0.6.1, it could also be the case that there were some unnoticed consensus state breaking changes between 0.6.1 and 0.6.5.

The latest Tendermint has a rollback feature, but it hasn't been used in Cosmos SDK yet: cosmos/cosmos-sdk#10281

@yihuang @JayT106 may advise if there's a manual workaround in the meantime.

@tomtau tomtau changed the title Node crashes with corrupted database: wrong Block.Header.AppHash Problem: wrong Block.Header.AppHash crashes Dec 30, 2021
@hpmv
Copy link
Author

hpmv commented Dec 30, 2021

Thanks @tomtau! The network is mainnet beta, and the height was as shown in the error message: 780414. I just got it again at another height 789757.

What's the version (commit hash) that's supposed to be running in Mainnet beta?

@JayT106
Copy link
Collaborator

JayT106 commented Dec 30, 2021

@hpmv you might try to update the DB state via modifying the wal, remove the latest messages until the previous EndHeight # message. Be careful to back up your data first when using it.
the script tool you can find it in the Tendermint project:
https://github.com/tendermint/tendermint/tree/master/scripts

However, it is not guaranteed work due to the unknown root cause of the appHash crashes. We need more investigations to understand the issue.

@hpmv
Copy link
Author

hpmv commented Dec 30, 2021

Thanks Jay! Is this a known issue in the community (I see a previous bug filed about this too)?

@tomtau
Copy link
Contributor

tomtau commented Dec 30, 2021

ok, it seems this may be a duplicate issue: #256
That issue was with 0.6.4. It may be also be due to non-deterministic operations, so it could happen irrespective of changes between 0.6.1 and 0.6.5

@yihuang
Copy link
Collaborator

yihuang commented Mar 8, 2022

we observed this in one of our RPC nodes after upgrading to 0.6.6, after inspecting and comparing the iavl storage using iaview tool, we found that this transaction's sender's balance is different between the problematic node and normal node, and the numbers match the hypothesis that the tx is reverted on the problematic node(and the sender's balance is deducted by "gas limit * gas price"), but successfully executed on the normal nodes.

@JayT106
Copy link
Collaborator

JayT106 commented Mar 8, 2022

we observed this in one of our RPC nodes after upgrading to 0.6.6, after inspecting and comparing the iavl storage using iaview tool, we found that this transaction's sender's balance is different between the problematic node and normal node, and the numbers match the hypothesis that the tx is reverted on the problematic node(and the sender's balance is deducted by "gas limit * gas price"), but successfully executed on the normal nodes.

Is the PR #377 the root cause of the AppHash mismatch in 0.6.6?

@yihuang
Copy link
Collaborator

yihuang commented Mar 8, 2022

we observed this in one of our RPC nodes after upgrading to 0.6.6, after inspecting and comparing the iavl storage using iaview tool, we found that this transaction's sender's balance is different between the problematic node and normal node, and the numbers match the hypothesis that the tx is reverted on the problematic node(and the sender's balance is deducted by "gas limit * gas price"), but successfully executed on the normal nodes.

Is the PR #377 the root cause of the AppHash mismatch in 0.6.6?

no, that one is released in 0.6.8, but the issue that happens today is for 0.6.6 and above.

@JayT106
Copy link
Collaborator

JayT106 commented Mar 11, 2022

@hpmv , what's your setup for fast_sync= and statesync.enable=. I am trying to reproduce this, it would be great if you can provide it. Thanks.

@JayT106
Copy link
Collaborator

JayT106 commented Apr 5, 2022

From investigating the recent crashes cases, suspect the EVM module might cause the indeterministic result. But we need more crashed databases to identify which part of the EVM module causes the issue.

@yihuang yihuang closed this as completed Apr 19, 2022
@yihuang yihuang reopened this Apr 19, 2022
@yihuang
Copy link
Collaborator

yihuang commented May 3, 2022

App really should recover from such errors by automatically reverting to the previous height. Or, a manual tool like state_recover from BSC would also be great. Right now there's no solution other than recovering from a disk backup.

This rollback command may help in the future: cosmos/cosmos-sdk#11361

@yihuang
Copy link
Collaborator

yihuang commented May 26, 2022

cosmos/cosmos-sdk#12012

We believe the root cause is found, and the workaround for now is to increase the file open limit using ulimit.

@yihuang yihuang closed this as completed May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants