-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrades: pd exits too early, before submitting pre-upgrade block #4432
Comments
priv_validator_state
munging can fail
Caught it again:
The two validators clearly have a different height in their cometbft configs, but both had stopped generating blocks at halt height, as expected. To work around the problem, I tried manually decrementing the value via
Next, I decremented the height to
Proceeding with more testing, but this problem is already persistent enough that we need a solution before shipping. |
I think this is a different error though, running a migration over chain state that has already been migrated. In the future this should be caught by the halt bit flag. |
Fair, but even decrementing to 0 also didn't work. Should I perhaps have decremented to 1? My impulse is to loosen the constraint to LTE for the
so we need a clear plan on how to handle this error. |
No, none of this manually messing around with the height should be required. The behavior of the current code is correct. The question that should be answered and isn’t is why Comet ends up past the halt height in the first place. |
This is never the correct solution, as it asks the validator to double-sign a block. |
Comet shouldn’t be signing the next block immediately after getting a commit message from the application, unless the block time has been turned all the way down to zero, which means we are testing a situation we don’t care about. Can you confirm that this behavior occurs when using the intended 5-second block time? |
The migrate command is not idempotent (yet). So my guess of what happened here is that the exported chain state worked from a migrated chain state (with application height zero). It generated a genesis with @hdevalence We're looking into it. The current working theory is that this is a race condition. There's a minuscule window of time between the moment we return |
To answer your latest comment, AFAICT comet will actually update its signing state as soon as it creates a new proposal, even before checking that it is valid at all. |
Yes, still using the default default 5s timeout-commit value. However, the devnet has customized values for
This feels right to me: pd hasn't moved past halt-1, but cometbft updates its priv val state the halt height. |
Why is comet creating a new proposal immediately? What value of timeout_commit is in use? Should that not prevent Comet from immediately continuing? |
Encountered again. Target height in the upgrade plan was
This time, before running the migration, I decremented height=540 to height=539, then ran the migration. The migration succeeded, and the chain resumed generating blocks without any errors. I understand that we want to avoid manual munging of priv val state, but providing this info to help us diagnose the root cause. |
Great, I'm going to log into the kube instance to inspect the comet WAL. |
Re @hdevalence: Yeah agreed, on second look the race condition story doesn't checkout since the commit timeout should slow down consensus. Moreover, re-reading the initial error, we can notice that we're already collecting prevotes (
|
@conorsch does this specific setup has a different restart cadence than the others? It shouldn't matter, but it might help pinpoint exactly the cause. |
@conorsch can you suspend pd/comet without winding down the containers or have them restart? I want to acquire a lock on the comet db but it's runnning. My hunch is that the "pd crash" logic that we have is bad, but need logs for conclusive evidence |
I am certain that this is the problem. The full node should stop, period, but instead it keeps serving ABCI |
The way we perform the migrations is to place the pods in "maintenanceMode", replacing the usual pd and cometbft commands with More relevant for our needs is probably the pre-upgrade snapshot archives, which are available for fetching here:
And less interesting but still available are the post-upgrade migration archives, taken immediately after the migration was run, but before the network was resumed:
It's worth noting the crash behavior we're investigating is in the |
The problem is that we don't actually halt the full node. We leave it in an intermediary state where it's able to serve certain requests, but not others. This is because we crash two of its four core services ( The node eventually crashes once comet start making consensus requests because it believes the application is "ready". This is really not ideal. Crucially, it's unrelated to whether we use a halt bit or a halt counter, or any other mechanism. The same issue will creep up regardless of what the "halting rule" is. I will push a PR, but for today's update we will have to do without it, so we need to make sure that the validators do not have auto-restart enabled to avoid having to touch the signing state. |
First attempt at resolution was #4436, which successfully caused pd to exit completely, but a bit too early: on several test runs, pd exited before cometbft had broadcast the halt-minus-1 block. We observed fullnodes stalled at halt-minus-2, which poses problems for migration. This issues remains a problem, and should be resolved prior to proceeding with the next chain upgrade. Ideally, the fix for this issue will land in a point release 0.75.1, after which we can proceed with the scheduled 0.76.0 upgrade-via-migration. |
priv_validator_state
munging can failPreviously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. ## Checklist before requesting a review - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor
Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor
Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor (cherry picked from commit 2c9c3f3)
Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor (cherry picked from commit 2c9c3f3)
Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor (cherry picked from commit 2c9c3f3)
Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor (cherry picked from commit 2c9c3f3)
Describe the bug
As of #4339, the
pd migrate
operation inspects CometBFT'spriv_validator_state.json
file, and errors out of the height is higher than the halt height. That's a good sanity check, but it's possible that the constraint is violated.During a recent migration test against an online devnet, I observed
val-0
migrating successfully, butval-1
failed during migration, erroring with:The contents of the
priv_validator_state.json
forval-1
was:So
pd
did properly halt, but CometBFT still proceeded with preparing 300 block. Across the half of dozen end-to-end migration tests I've run in the past few days, I've encountered this situation only once, but it feels like a race condition: once pd halts, its exiting will close the abci port, and cause cometbft to crash, but it's possible cometbft has advanced slightly.Fortunately, since pd was stopped, this shouldn't have any bearing on the app state. I believe the proper workaround, if this is encountered in the wild, is to manually decrement the height value in the priv val state by 1, then rerun the migration. More research required to understand what CometBFT is doing behind the scenes, and whether the priv val state is best understood as a WAL or as a watermark (or both).
The text was updated successfully, but these errors were encountered: