Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrades: pd exits too early, before submitting pre-upgrade block #4432

Closed
conorsch opened this issue May 22, 2024 · 20 comments · Fixed by #4436
Closed

upgrades: pd exits too early, before submitting pre-upgrade block #4432

conorsch opened this issue May 22, 2024 · 20 comments · Fixed by #4436
Assignees
Labels
A-upgrades Area: Relates to chain upgrades

Comments

@conorsch
Copy link
Contributor

Describe the bug
As of #4339, the pd migrate operation inspects CometBFT's priv_validator_state.json file, and errors out of the height is higher than the halt height. That's a good sanity check, but it's possible that the constraint is violated.

During a recent migration test against an online devnet, I observed val-0 migrating successfully, but val-1 failed during migration, erroring with:

2024-05-21T19:10:42.069620Z  INFO pd_migrate:migrate{self=Testnet76 comet_home=Some("/penumbra-config/penumbra-devnet-val/node0/cometbft")}:migrate_comet_data: pd::migrate: parsed genesis file genesis_json=Object {"genesis_time": String("2024-05-21T18:52:33.849663975Z"), "chain_id": String("penumbra-testnet-deimos-8-77aa47d1"), "initial_height": String("300"), "consensus_params":
 Object {"block": Object {"max_bytes": String("22020096"), "max_gas": String("-1"), "time_iota_ms": String("500")}, "evidence": Object {"max_age_num_blocks": String("100000"), "max_age_duration": String("86400000000000"), "max_bytes": String("1048576")}, "validator": Object {"pub_key_types": Array [String("ed25519")]}, "abci": Object {}}, "validators": Array [], "app_hash": Strin
g("47FF8AAAEE45784C6B6637DC46F1157010DAF9036D8047BA5508192A04FE5E55"), "app_state": Object {"genesisCheckpoint": String("R/+Kqu5FeExrZjfcRvEVcBDa+QNtgEe6VQgZKgT+XlU=")}}
2024-05-21T19:10:42.069672Z  INFO pd_migrate:migrate{self=Testnet76 comet_home=Some("/penumbra-config/penumbra-devnet-val/node0/cometbft")}:migrate_comet_data: pd::migrate: writing genesis file to comet config genesis_file="/penumbra-config/penumbra-devnet-val/node0/cometbft/config/genesis.json"
Error: failed to upgrade state

Caused by:
    priv_validator_state height 300 is already greater than or equal to initial_height 300
command terminated with exit code 1

The contents of the priv_validator_state.json for val-1 was:

{
  "height": "300",
  "round": 0,
  "step": 2,
  "signature": "9dfsrfrUDqtdK+7MUbDgf85idj4orCzrfSZiC/wABWCdhU6cue/h741jptRd+NRJsrmp+R24soIsOEOKBEFBAg==",
  "signbytes": "3D0801112C010000000000002A0C0880DBB3B206108BDD8FA102322270656E756D6272612D746573746E65742D6465696D6F732D382D3737616134376431"
}

So pd did properly halt, but CometBFT still proceeded with preparing 300 block. Across the half of dozen end-to-end migration tests I've run in the past few days, I've encountered this situation only once, but it feels like a race condition: once pd halts, its exiting will close the abci port, and cause cometbft to crash, but it's possible cometbft has advanced slightly.

Fortunately, since pd was stopped, this shouldn't have any bearing on the app state. I believe the proper workaround, if this is encountered in the wild, is to manually decrement the height value in the priv val state by 1, then rerun the migration. More research required to understand what CometBFT is doing behind the scenes, and whether the priv val state is best understood as a WAL or as a watermark (or both).

@conorsch conorsch added the A-upgrades Area: Relates to chain upgrades label May 22, 2024
@github-project-automation github-project-automation bot moved this to Backlog in Penumbra May 22, 2024
@github-actions github-actions bot added the needs-refinement unclear, incomplete, or stub issue that needs work label May 22, 2024
@conorsch
Copy link
Contributor Author

Talked this through with @erwanor earlier today, and he's already digging into it. The good news is, we don't believe this is a showstopper for the upcoming release (#4402).

@erwanor erwanor changed the title upgrades: priv_validator_state munging can fail upgrades:priv_validator_state munging can fail May 22, 2024
@conorsch
Copy link
Contributor Author

Caught it again:

2024-05-22T00:50:32.293269Z  INFO pd_migrate:migrate{self=Testnet76 comet_home=Some("/penumbra-config/penumbra-devnet-val/node0/cometbft")}:migrate_comet_data: pd::migrate: writing genesis file to comet config genesis_file="/penumbra-config/penumbra-devnet-val/node0/cometbft/config/genesis.jso
n"
Error: failed to upgrade state

Caused by:
    priv_validator_state height 275 is already greater than or equal to initial_height 275
command terminated with exit code 1

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "275",
  "round": 0,
  "step": 2,
  "signature": "FJEQK3cNzlgsyl6/t7RAl3Mxn5maj+6uPaqungu/iQLC5RFw+LmrcPs57ZiDIjmDt49njCMqrdwds4vE8CMkCg==",
  "signbytes": "3D08011113010000000000002A0C08C6FCB4B20610F78DE6E001322270656E756D6272612D746573746E65742D6465696D6F732D382D3865653166333130"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "274",
  "round": 0,
  "step": 3,
  "signature": "SCmvwdmG/QK/r0tgtXGqSJgKTP+RHUAhKbwB2F9K5M56r5lFB5S52yVPijBnxVShtoZEXmZUWwt96IBb17S3AQ==",
  "signbytes": "8701080211120100000000000022480A203D9B6DDAEC9D347F0E3CF34DD7EF33A0CFB73A28B22BFD238C1F11DDF43CEC5C12240801122024A3D9F8A1092256B56FFA6DAC726AB1E56A21FF305FE01C3BCD6C4ED9780A022A0C08BDFCB4B2061090F2AF9E01322270656E756D6272612D746573746E65742D6465696D6F732D382D3865653166333130"
}

The two validators clearly have a different height in their cometbft configs, but both had stopped generating blocks at halt height, as expected. To work around the problem, I tried manually decrementing the value via perl -npi -E 's/.*height.*/ "height": "274",/' priv_validator_state.json and rerunning the migration. It again failed, because pd and cometbft disagreed about the height:

2024-05-22T00:56:43.980023Z  INFO pd_migrate:migrate{self=Testnet76 comet_home=Some("/penumbra-config/penumbra-devnet-val/node0/cometbft")}:migrate_comet_data: pd::migrate: writing genesis file to comet config genesis_file="/penumbra-config/penumbra-devnet-val/node0/cometbft/config/genesis.json"
Error: failed to upgrade state

Caused by:
    priv_validator_state height 274 is already greater than or equal to initial_height 1
command terminated with exit code 1

Next, I decremented the height to "0", and the migration completed successfully after that, but the network would not resume. Error messages include:

E[2024-05-22|01:11:21.021] Peer sent us invalid msg                     module=consensus peer="Peer{MConn{34.16.34.194:26656} 6bc340b9a9b2fbb9d1fac53c972973c22b252c75 out}" msg="[NewRoundStep H:275 R:0 S:RoundStepPrevote LCR:-1]" err="LastCommitRound can only be negative for initial height 1"
E[2024-05-22|01:11:21.021] Stopping peer for error                      module=p2p peer="Peer{MConn{34.16.34.194:26656} 6bc340b9a9b2fbb9d1fac53c972973c22b252c75 out}" err="LastCommitRound can only be negative for initial height 1"

Proceeding with more testing, but this problem is already persistent enough that we need a solution before shipping.

@erwanor
Copy link
Member

erwanor commented May 22, 2024

priv_validator_state height 274 is already greater than or equal to initial_height 1

I think this is a different error though, running a migration over chain state that has already been migrated. In the future this should be caught by the halt bit flag.

@conorsch
Copy link
Contributor Author

Fair, but even decrementing to 0 also didn't work. Should I perhaps have decremented to 1? My impulse is to loosen the constraint to LTE for the pd migrate height check, but I don't fully understand the implications of doing so. The devnet is dead in the water:

❯ cargo run -q --release --bin pcli -- --home ~/.local/share/pcli-devnet/ q chain info
Chain Info:

 Current Block Height   0
 Current Epoch          0
 Total Validators       2
 Active Validators      2
 Inactive Validators    0
 Jailed Validators      0
 Tombstoned Validators  0
 Disabled Validators    0

so we need a clear plan on how to handle this error.

@hdevalence
Copy link
Member

No, none of this manually messing around with the height should be required. The behavior of the current code is correct.

The question that should be answered and isn’t is why Comet ends up past the halt height in the first place.

@hdevalence
Copy link
Member

I believe the proper workaround, if this is encountered in the wild, is to manually decrement the height value in the priv val state by 1, then rerun the migration.

This is never the correct solution, as it asks the validator to double-sign a block.

@hdevalence
Copy link
Member

Across the half of dozen end-to-end migration tests I've run in the past few days, I've encountered this situation only once, but it feels like a race condition: once pd halts, its exiting will close the abci port, and cause cometbft to crash, but it's possible cometbft has advanced slightly.

Comet shouldn’t be signing the next block immediately after getting a commit message from the application, unless the block time has been turned all the way down to zero, which means we are testing a situation we don’t care about. Can you confirm that this behavior occurs when using the intended 5-second block time?

@erwanor
Copy link
Member

erwanor commented May 22, 2024

The migrate command is not idempotent (yet). So my guess of what happened here is that the exported chain state worked from a migrated chain state (with application height zero). It generated a genesis with initialHeight: 1. This isn't a problem with the priv validator state, even if the error messages look somewhat similar. It's a problem of corrupted chain state which is unfortunate, but should be solved altogether for the next upgrade.

@hdevalence We're looking into it. The current working theory is that this is a race condition.

There's a minuscule window of time between the moment we return Commit and when we interrupt the ABCI connection. This leaves a minuscule window of time for comet to produce and sign a new proposal and tick its signing state (the fact that the mempool is empty helps). AFAICT, it does so even before doing an app round trip to check that the proposal is correct.

@erwanor
Copy link
Member

erwanor commented May 22, 2024

To answer your latest comment, AFAICT comet will actually update its signing state as soon as it creates a new proposal, even before checking that it is valid at all.

@conorsch
Copy link
Contributor Author

Can you confirm that this behavior occurs when using the intended 5-second block time?

Yes, still using the default default 5s timeout-commit value. However, the devnet has customized values for proposal_voting_blocks=50 and epoch_duration=20.

AFAICT comet will actually update its signing state as soon as it creates a new proposal

This feels right to me: pd hasn't moved past halt-1, but cometbft updates its priv val state the halt height.

@hdevalence
Copy link
Member

Why is comet creating a new proposal immediately?

What value of timeout_commit is in use? Should that not prevent Comet from immediately continuing?

@conorsch
Copy link
Contributor Author

Encountered again. Target height in the upgrade plan was 540. Validators showed:

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "539",
  "round": 0,
  "step": 3,
  "signature": "6esGuH2dM9ZWQJKfAi2yTr0G9OrTGN/mwWwoo91eX5nVW/L1rqiQH5SEZK6yXKfBtAKRymyXsKX7KXfxkoEBDA==",
  "signbytes": "87010802111B0200000000000022480A202E5BB4D8B3D30CAF6D5019A8A84C5D3D4E68BAC1B3C34451B1510C957876322F122408011220C2FC1A22BD906AD4053369038BB83DD190E270404A9D4290B914F2EE9CA4
82992A0C08F6B1B5B20610A3EEC0D601322270656E756D6272612D746573746E65742D6465696D6F732D382D6563363732313937"
}

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "540",
  "round": 0,
  "step": 2,
  "signature": "bLvn3a/agsR9heeTaBk7kZuDCO0ZR5HMuadxj5MMAovOTiWuulDHyqjlFoZr41o/G0q53HMzEZfFDzk+CluBAw==",
  "signbytes": "3D0801111C020000000000002A0C0882B2B5B206109387FE9403322270656E756D6272612D746573746E65742D6465696D6F732D382D6563363732313937"
}

This time, before running the migration, I decremented height=540 to height=539, then ran the migration. The migration succeeded, and the chain resumed generating blocks without any errors. I understand that we want to avoid manual munging of priv val state, but providing this info to help us diagnose the root cause.

@erwanor
Copy link
Member

erwanor commented May 22, 2024

Great, I'm going to log into the kube instance to inspect the comet WAL.

@erwanor
Copy link
Member

erwanor commented May 22, 2024

Re @hdevalence: Yeah agreed, on second look the race condition story doesn't checkout since the commit timeout should slow down consensus. Moreover, re-reading the initial error, we can notice that we're already collecting prevotes (step=2):

{
  "height": "300",
  "round": 0,
  "step": 2,
sigdata
}

@erwanor
Copy link
Member

erwanor commented May 22, 2024

@conorsch does this specific setup has a different restart cadence than the others? It shouldn't matter, but it might help pinpoint exactly the cause. I notice that it's always val-1

@erwanor
Copy link
Member

erwanor commented May 22, 2024

@conorsch can you suspend pd/comet without winding down the containers or have them restart? I want to acquire a lock on the comet db but it's runnning.

My hunch is that the "pd crash" logic that we have is bad, but need logs for conclusive evidence

@erwanor
Copy link
Member

erwanor commented May 22, 2024

My hunch is that the "pd crash" logic that we have is bad, but need logs for conclusive evidence

I am certain that this is the problem. The full node should stop, period, but instead it keeps serving ABCI Info requests. This interacts with the restart logic, and after some time, the proposing node restarts, is able to query the full node, and observes that the timeout period has elapsed.

@conorsch
Copy link
Contributor Author

@conorsch can you suspend pd/comet without winding down the containers or have them restart? I want to acquire a lock on the comet db but it's runnning.

The way we perform the migrations is to place the pods in "maintenanceMode", replacing the usual pd and cometbft commands with sleep infinity. That way, the same environment is available, but no services are running or touching the state. It's essentially equivalent to systemctl stop penumbra cometbft in a VM environment. I've done that on the still-running devnet, so you can poke around.

More relevant for our needs is probably the pre-upgrade snapshot archives, which are available for fetching here:

And less interesting but still available are the post-upgrade migration archives, taken immediately after the migration was run, but before the network was resumed:

My hunch is that the "pd crash" logic that we have is bad [...] The full node should stop, period, but instead it keeps serving ABCI Info requests.

It's worth noting the crash behavior we're investigating is in the v0.75.0 code. We've used the halt-count mechanism, now deprecated, to halt the chain successfully before. But I agree with you, we're not seeing the behavior I'd expect.

@erwanor
Copy link
Member

erwanor commented May 22, 2024

The problem is that we don't actually halt the full node. We leave it in an intermediary state where it's able to serve certain requests, but not others. This is because we crash two of its four core services (Consensus and Mempool). We should stop all four, or they should never start, that's the correct behavior.

The node eventually crashes once comet start making consensus requests because it believes the application is "ready". This is really not ideal. Crucially, it's unrelated to whether we use a halt bit or a halt counter, or any other mechanism. The same issue will creep up regardless of what the "halting rule" is.

I will push a PR, but for today's update we will have to do without it, so we need to make sure that the validators do not have auto-restart enabled to avoid having to touch the signing state.

@aubrika aubrika removed the needs-refinement unclear, incomplete, or stub issue that needs work label May 22, 2024
@aubrika aubrika moved this from Backlog to In progress in Penumbra May 22, 2024
@conorsch
Copy link
Contributor Author

First attempt at resolution was #4436, which successfully caused pd to exit completely, but a bit too early: on several test runs, pd exited before cometbft had broadcast the halt-minus-1 block. We observed fullnodes stalled at halt-minus-2, which poses problems for migration. This issues remains a problem, and should be resolved prior to proceeding with the next chain upgrade. Ideally, the fix for this issue will land in a point release 0.75.1, after which we can proceed with the scheduled 0.76.0 upgrade-via-migration.

@conorsch conorsch changed the title upgrades:priv_validator_state munging can fail upgrades: pd exits too early, before submitting pre-upgrade block May 22, 2024
conorsch pushed a commit that referenced this issue May 23, 2024
Previously, the halting logic was structured such that full nodes would
partially crash two of their four ABCI services (`Consensus` and
`Mempool`); relying on future CometBFT consensus requests to crash the
node.

This PR adds an `App::is_ready` method that callers (pd) SHOULD call in
order to make sure that the application is ready, so that they can avoid
to spin up any services unless an override flag (`--force`) is
specified.

Fix #4432. Fix #4443.

## Checklist before requesting a review

- [x] If this code contains consensus-breaking changes, I have added the
"consensus-breaking" label. Otherwise, I declare my belief that there
are not consensus-breaking changes, for the following reason:

  > Full node mechanical refactor
@github-project-automation github-project-automation bot moved this from In progress to Done in Penumbra May 23, 2024
erwanor added a commit that referenced this issue May 23, 2024
Previously, the halting logic was structured such that full nodes would
partially crash two of their four ABCI services (`Consensus` and
`Mempool`); relying on future CometBFT consensus requests to crash the
node.

This PR adds an `App::is_ready` method that callers (pd) SHOULD call in
order to make sure that the application is ready, so that they can avoid
to spin up any services unless an override flag (`--force`) is
specified.

Fix #4432. Fix #4443.

- [x] If this code contains consensus-breaking changes, I have added the
"consensus-breaking" label. Otherwise, I declare my belief that there
are not consensus-breaking changes, for the following reason:

  > Full node mechanical refactor
conorsch pushed a commit that referenced this issue May 23, 2024
Previously, the halting logic was structured such that full nodes would
partially crash two of their four ABCI services (`Consensus` and
`Mempool`); relying on future CometBFT consensus requests to crash the
node.

This PR adds an `App::is_ready` method that callers (pd) SHOULD call in
order to make sure that the application is ready, so that they can avoid
to spin up any services unless an override flag (`--force`) is
specified.

Fix #4432. Fix #4443.

- [x] If this code contains consensus-breaking changes, I have added the
"consensus-breaking" label. Otherwise, I declare my belief that there
are not consensus-breaking changes, for the following reason:

  > Full node mechanical refactor

(cherry picked from commit 2c9c3f3)
conorsch pushed a commit that referenced this issue May 23, 2024
Previously, the halting logic was structured such that full nodes would
partially crash two of their four ABCI services (`Consensus` and
`Mempool`); relying on future CometBFT consensus requests to crash the
node.

This PR adds an `App::is_ready` method that callers (pd) SHOULD call in
order to make sure that the application is ready, so that they can avoid
to spin up any services unless an override flag (`--force`) is
specified.

Fix #4432. Fix #4443.

- [x] If this code contains consensus-breaking changes, I have added the
"consensus-breaking" label. Otherwise, I declare my belief that there
are not consensus-breaking changes, for the following reason:

  > Full node mechanical refactor

(cherry picked from commit 2c9c3f3)
conorsch pushed a commit that referenced this issue May 23, 2024
Previously, the halting logic was structured such that full nodes would
partially crash two of their four ABCI services (`Consensus` and
`Mempool`); relying on future CometBFT consensus requests to crash the
node.

This PR adds an `App::is_ready` method that callers (pd) SHOULD call in
order to make sure that the application is ready, so that they can avoid
to spin up any services unless an override flag (`--force`) is
specified.

Fix #4432. Fix #4443.

- [x] If this code contains consensus-breaking changes, I have added the
"consensus-breaking" label. Otherwise, I declare my belief that there
are not consensus-breaking changes, for the following reason:

  > Full node mechanical refactor

(cherry picked from commit 2c9c3f3)
conorsch pushed a commit that referenced this issue May 23, 2024
Previously, the halting logic was structured such that full nodes would
partially crash two of their four ABCI services (`Consensus` and
`Mempool`); relying on future CometBFT consensus requests to crash the
node.

This PR adds an `App::is_ready` method that callers (pd) SHOULD call in
order to make sure that the application is ready, so that they can avoid
to spin up any services unless an override flag (`--force`) is
specified.

Fix #4432. Fix #4443.

- [x] If this code contains consensus-breaking changes, I have added the
"consensus-breaking" label. Otherwise, I declare my belief that there
are not consensus-breaking changes, for the following reason:

  > Full node mechanical refactor

(cherry picked from commit 2c9c3f3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-upgrades Area: Relates to chain upgrades
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants