upgrades: pd exits too early, before submitting pre-upgrade block #4432

conorsch · 2024-05-22T00:09:09Z

Describe the bug
As of #4339, the pd migrate operation inspects CometBFT's priv_validator_state.json file, and errors out of the height is higher than the halt height. That's a good sanity check, but it's possible that the constraint is violated.

During a recent migration test against an online devnet, I observed val-0 migrating successfully, but val-1 failed during migration, erroring with:

2024-05-21T19:10:42.069620Z  INFO pd_migrate:migrate{self=Testnet76 comet_home=Some("/penumbra-config/penumbra-devnet-val/node0/cometbft")}:migrate_comet_data: pd::migrate: parsed genesis file genesis_json=Object {"genesis_time": String("2024-05-21T18:52:33.849663975Z"), "chain_id": String("penumbra-testnet-deimos-8-77aa47d1"), "initial_height": String("300"), "consensus_params":
 Object {"block": Object {"max_bytes": String("22020096"), "max_gas": String("-1"), "time_iota_ms": String("500")}, "evidence": Object {"max_age_num_blocks": String("100000"), "max_age_duration": String("86400000000000"), "max_bytes": String("1048576")}, "validator": Object {"pub_key_types": Array [String("ed25519")]}, "abci": Object {}}, "validators": Array [], "app_hash": Strin
g("47FF8AAAEE45784C6B6637DC46F1157010DAF9036D8047BA5508192A04FE5E55"), "app_state": Object {"genesisCheckpoint": String("R/+Kqu5FeExrZjfcRvEVcBDa+QNtgEe6VQgZKgT+XlU=")}}
2024-05-21T19:10:42.069672Z  INFO pd_migrate:migrate{self=Testnet76 comet_home=Some("/penumbra-config/penumbra-devnet-val/node0/cometbft")}:migrate_comet_data: pd::migrate: writing genesis file to comet config genesis_file="/penumbra-config/penumbra-devnet-val/node0/cometbft/config/genesis.json"
Error: failed to upgrade state

Caused by:
    priv_validator_state height 300 is already greater than or equal to initial_height 300
command terminated with exit code 1

The contents of the priv_validator_state.json for val-1 was:

{
  "height": "300",
  "round": 0,
  "step": 2,
  "signature": "9dfsrfrUDqtdK+7MUbDgf85idj4orCzrfSZiC/wABWCdhU6cue/h741jptRd+NRJsrmp+R24soIsOEOKBEFBAg==",
  "signbytes": "3D0801112C010000000000002A0C0880DBB3B206108BDD8FA102322270656E756D6272612D746573746E65742D6465696D6F732D382D3737616134376431"
}

So pd did properly halt, but CometBFT still proceeded with preparing 300 block. Across the half of dozen end-to-end migration tests I've run in the past few days, I've encountered this situation only once, but it feels like a race condition: once pd halts, its exiting will close the abci port, and cause cometbft to crash, but it's possible cometbft has advanced slightly.

Fortunately, since pd was stopped, this shouldn't have any bearing on the app state. I believe the proper workaround, if this is encountered in the wild, is to manually decrement the height value in the priv val state by 1, then rerun the migration. More research required to understand what CometBFT is doing behind the scenes, and whether the priv val state is best understood as a WAL or as a watermark (or both).

The text was updated successfully, but these errors were encountered:

conorsch · 2024-05-22T00:09:54Z

Talked this through with @erwanor earlier today, and he's already digging into it. The good news is, we don't believe this is a showstopper for the upcoming release (#4402).

conorsch · 2024-05-22T01:12:50Z

Caught it again:

2024-05-22T00:50:32.293269Z  INFO pd_migrate:migrate{self=Testnet76 comet_home=Some("/penumbra-config/penumbra-devnet-val/node0/cometbft")}:migrate_comet_data: pd::migrate: writing genesis file to comet config genesis_file="/penumbra-config/penumbra-devnet-val/node0/cometbft/config/genesis.jso
n"
Error: failed to upgrade state

Caused by:
    priv_validator_state height 275 is already greater than or equal to initial_height 275
command terminated with exit code 1

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "275",
  "round": 0,
  "step": 2,
  "signature": "FJEQK3cNzlgsyl6/t7RAl3Mxn5maj+6uPaqungu/iQLC5RFw+LmrcPs57ZiDIjmDt49njCMqrdwds4vE8CMkCg==",
  "signbytes": "3D08011113010000000000002A0C08C6FCB4B20610F78DE6E001322270656E756D6272612D746573746E65742D6465696D6F732D382D3865653166333130"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "274",
  "round": 0,
  "step": 3,
  "signature": "SCmvwdmG/QK/r0tgtXGqSJgKTP+RHUAhKbwB2F9K5M56r5lFB5S52yVPijBnxVShtoZEXmZUWwt96IBb17S3AQ==",
  "signbytes": "8701080211120100000000000022480A203D9B6DDAEC9D347F0E3CF34DD7EF33A0CFB73A28B22BFD238C1F11DDF43CEC5C12240801122024A3D9F8A1092256B56FFA6DAC726AB1E56A21FF305FE01C3BCD6C4ED9780A022A0C08BDFCB4B2061090F2AF9E01322270656E756D6272612D746573746E65742D6465696D6F732D382D3865653166333130"
}

The two validators clearly have a different height in their cometbft configs, but both had stopped generating blocks at halt height, as expected. To work around the problem, I tried manually decrementing the value via perl -npi -E 's/.*height.*/ "height": "274",/' priv_validator_state.json and rerunning the migration. It again failed, because pd and cometbft disagreed about the height:

2024-05-22T00:56:43.980023Z  INFO pd_migrate:migrate{self=Testnet76 comet_home=Some("/penumbra-config/penumbra-devnet-val/node0/cometbft")}:migrate_comet_data: pd::migrate: writing genesis file to comet config genesis_file="/penumbra-config/penumbra-devnet-val/node0/cometbft/config/genesis.json"
Error: failed to upgrade state

Caused by:
    priv_validator_state height 274 is already greater than or equal to initial_height 1
command terminated with exit code 1

Next, I decremented the height to "0", and the migration completed successfully after that, but the network would not resume. Error messages include:

E[2024-05-22|01:11:21.021] Peer sent us invalid msg                     module=consensus peer="Peer{MConn{34.16.34.194:26656} 6bc340b9a9b2fbb9d1fac53c972973c22b252c75 out}" msg="[NewRoundStep H:275 R:0 S:RoundStepPrevote LCR:-1]" err="LastCommitRound can only be negative for initial height 1"
E[2024-05-22|01:11:21.021] Stopping peer for error                      module=p2p peer="Peer{MConn{34.16.34.194:26656} 6bc340b9a9b2fbb9d1fac53c972973c22b252c75 out}" err="LastCommitRound can only be negative for initial height 1"

Proceeding with more testing, but this problem is already persistent enough that we need a solution before shipping.

erwanor · 2024-05-22T01:16:12Z

priv_validator_state height 274 is already greater than or equal to initial_height 1

I think this is a different error though, running a migration over chain state that has already been migrated. In the future this should be caught by the halt bit flag.

conorsch · 2024-05-22T01:25:06Z

Fair, but even decrementing to 0 also didn't work. Should I perhaps have decremented to 1? My impulse is to loosen the constraint to LTE for the pd migrate height check, but I don't fully understand the implications of doing so. The devnet is dead in the water:

❯ cargo run -q --release --bin pcli -- --home ~/.local/share/pcli-devnet/ q chain info
Chain Info:

 Current Block Height   0
 Current Epoch          0
 Total Validators       2
 Active Validators      2
 Inactive Validators    0
 Jailed Validators      0
 Tombstoned Validators  0
 Disabled Validators    0

so we need a clear plan on how to handle this error.

hdevalence · 2024-05-22T01:28:22Z

No, none of this manually messing around with the height should be required. The behavior of the current code is correct.

The question that should be answered and isn’t is why Comet ends up past the halt height in the first place.

hdevalence · 2024-05-22T01:30:53Z

I believe the proper workaround, if this is encountered in the wild, is to manually decrement the height value in the priv val state by 1, then rerun the migration.

This is never the correct solution, as it asks the validator to double-sign a block.

hdevalence · 2024-05-22T01:34:30Z

Across the half of dozen end-to-end migration tests I've run in the past few days, I've encountered this situation only once, but it feels like a race condition: once pd halts, its exiting will close the abci port, and cause cometbft to crash, but it's possible cometbft has advanced slightly.

Comet shouldn’t be signing the next block immediately after getting a commit message from the application, unless the block time has been turned all the way down to zero, which means we are testing a situation we don’t care about. Can you confirm that this behavior occurs when using the intended 5-second block time?

erwanor · 2024-05-22T01:35:10Z

The migrate command is not idempotent (yet). So my guess of what happened here is that the exported chain state worked from a migrated chain state (with application height zero). It generated a genesis with initialHeight: 1. This isn't a problem with the priv validator state, even if the error messages look somewhat similar. It's a problem of corrupted chain state which is unfortunate, but should be solved altogether for the next upgrade.

@hdevalence We're looking into it. The current working theory is that this is a race condition.

There's a minuscule window of time between the moment we return Commit and when we interrupt the ABCI connection. This leaves a minuscule window of time for comet to produce and sign a new proposal and tick its signing state (the fact that the mempool is empty helps). AFAICT, it does so even before doing an app round trip to check that the proposal is correct.

erwanor · 2024-05-22T01:35:55Z

To answer your latest comment, AFAICT comet will actually update its signing state as soon as it creates a new proposal, even before checking that it is valid at all.

conorsch · 2024-05-22T01:38:27Z

Can you confirm that this behavior occurs when using the intended 5-second block time?

Yes, still using the default default 5s timeout-commit value. However, the devnet has customized values for proposal_voting_blocks=50 and epoch_duration=20.

AFAICT comet will actually update its signing state as soon as it creates a new proposal

This feels right to me: pd hasn't moved past halt-1, but cometbft updates its priv val state the halt height.

hdevalence · 2024-05-22T01:38:36Z

Why is comet creating a new proposal immediately?

What value of timeout_commit is in use? Should that not prevent Comet from immediately continuing?

conorsch · 2024-05-22T03:01:38Z

Encountered again. Target height in the upgrade plan was 540. Validators showed:

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "539",
  "round": 0,
  "step": 3,
  "signature": "6esGuH2dM9ZWQJKfAi2yTr0G9OrTGN/mwWwoo91eX5nVW/L1rqiQH5SEZK6yXKfBtAKRymyXsKX7KXfxkoEBDA==",
  "signbytes": "87010802111B0200000000000022480A202E5BB4D8B3D30CAF6D5019A8A84C5D3D4E68BAC1B3C34451B1510C957876322F122408011220C2FC1A22BD906AD4053369038BB83DD190E270404A9D4290B914F2EE9CA4
82992A0C08F6B1B5B20610A3EEC0D601322270656E756D6272612D746573746E65742D6465696D6F732D382D6563363732313937"
}

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "540",
  "round": 0,
  "step": 2,
  "signature": "bLvn3a/agsR9heeTaBk7kZuDCO0ZR5HMuadxj5MMAovOTiWuulDHyqjlFoZr41o/G0q53HMzEZfFDzk+CluBAw==",
  "signbytes": "3D0801111C020000000000002A0C0882B2B5B206109387FE9403322270656E756D6272612D746573746E65742D6465696D6F732D382D6563363732313937"
}

This time, before running the migration, I decremented height=540 to height=539, then ran the migration. The migration succeeded, and the chain resumed generating blocks without any errors. I understand that we want to avoid manual munging of priv val state, but providing this info to help us diagnose the root cause.

erwanor · 2024-05-22T03:02:22Z

Great, I'm going to log into the kube instance to inspect the comet WAL.

erwanor · 2024-05-22T03:04:55Z

Re @hdevalence: Yeah agreed, on second look the race condition story doesn't checkout since the commit timeout should slow down consensus. Moreover, re-reading the initial error, we can notice that we're already collecting prevotes (step=2):

{
  "height": "300",
  "round": 0,
  "step": 2,
sigdata
}

erwanor · 2024-05-22T03:06:02Z

@conorsch does this specific setup has a different restart cadence than the others? It shouldn't matter, but it might help pinpoint exactly the cause. ~~I notice that it's always val-1~~

erwanor · 2024-05-22T03:10:57Z

@conorsch can you suspend pd/comet without winding down the containers or have them restart? I want to acquire a lock on the comet db but it's runnning.

My hunch is that the "pd crash" logic that we have is bad, but need logs for conclusive evidence

erwanor · 2024-05-22T03:50:31Z

My hunch is that the "pd crash" logic that we have is bad, but need logs for conclusive evidence

I am certain that this is the problem. The full node should stop, period, but instead it keeps serving ABCI Info requests. This interacts with the restart logic, and after some time, the proposing node restarts, is able to query the full node, and observes that the timeout period has elapsed.

conorsch · 2024-05-22T05:43:49Z

@conorsch can you suspend pd/comet without winding down the containers or have them restart? I want to acquire a lock on the comet db but it's runnning.

The way we perform the migrations is to place the pods in "maintenanceMode", replacing the usual pd and cometbft commands with sleep infinity. That way, the same environment is available, but no services are running or touching the state. It's essentially equivalent to systemctl stop penumbra cometbft in a VM environment. I've done that on the still-running devnet, so you can poke around.

More relevant for our needs is probably the pre-upgrade snapshot archives, which are available for fetching here:

And less interesting but still available are the post-upgrade migration archives, taken immediately after the migration was run, but before the network was resumed:

My hunch is that the "pd crash" logic that we have is bad [...] The full node should stop, period, but instead it keeps serving ABCI Info requests.

It's worth noting the crash behavior we're investigating is in the v0.75.0 code. We've used the halt-count mechanism, now deprecated, to halt the chain successfully before. But I agree with you, we're not seeing the behavior I'd expect.

erwanor · 2024-05-22T13:25:23Z

The problem is that we don't actually halt the full node. We leave it in an intermediary state where it's able to serve certain requests, but not others. This is because we crash two of its four core services (Consensus and Mempool). We should stop all four, or they should never start, that's the correct behavior.

The node eventually crashes once comet start making consensus requests because it believes the application is "ready". This is really not ideal. Crucially, it's unrelated to whether we use a halt bit or a halt counter, or any other mechanism. The same issue will creep up regardless of what the "halting rule" is.

I will push a PR, but for today's update we will have to do without it, so we need to make sure that the validators do not have auto-restart enabled to avoid having to touch the signing state.

conorsch · 2024-05-22T21:23:19Z

First attempt at resolution was #4436, which successfully caused pd to exit completely, but a bit too early: on several test runs, pd exited before cometbft had broadcast the halt-minus-1 block. We observed fullnodes stalled at halt-minus-2, which poses problems for migration. This issues remains a problem, and should be resolved prior to proceeding with the next chain upgrade. Ideally, the fix for this issue will land in a point release 0.75.1, after which we can proceed with the scheduled 0.76.0 upgrade-via-migration.

Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. ## Checklist before requesting a review - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor

Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor

Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor (cherry picked from commit 2c9c3f3)

conorsch added the A-upgrades Area: Relates to chain upgrades label May 22, 2024

github-project-automation bot added this to Penumbra May 22, 2024

github-project-automation bot moved this to Backlog in Penumbra May 22, 2024

github-actions bot added the needs-refinement unclear, incomplete, or stub issue that needs work label May 22, 2024

conorsch assigned erwanor May 22, 2024

erwanor changed the title ~~upgrades: priv_validator_state munging can fail~~ upgrades:priv_validator_state munging can fail May 22, 2024

conorsch mentioned this issue May 22, 2024

upgrades: 🚘 migrations should be idempotent #4365

Closed

erwanor mentioned this issue May 22, 2024

pd: refuse to bootstrap services if the app is not ready #4436

Merged

1 task

aubrika removed the needs-refinement unclear, incomplete, or stub issue that needs work label May 22, 2024

aubrika moved this from Backlog to In progress in Penumbra May 22, 2024

conorsch changed the title ~~upgrades:priv_validator_state munging can fail~~ upgrades: pd exits too early, before submitting pre-upgrade block May 22, 2024

conorsch mentioned this issue May 23, 2024

pd(migrate): set halt flag to false #4445

Merged

1 task

conorsch closed this as completed in #4436 May 23, 2024

github-project-automation bot moved this from In progress to Done in Penumbra May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrades: pd exits too early, before submitting pre-upgrade block #4432

upgrades: pd exits too early, before submitting pre-upgrade block #4432

conorsch commented May 22, 2024

conorsch commented May 22, 2024

conorsch commented May 22, 2024

erwanor commented May 22, 2024

conorsch commented May 22, 2024

hdevalence commented May 22, 2024

hdevalence commented May 22, 2024

hdevalence commented May 22, 2024

erwanor commented May 22, 2024

erwanor commented May 22, 2024

conorsch commented May 22, 2024

hdevalence commented May 22, 2024

conorsch commented May 22, 2024

erwanor commented May 22, 2024

erwanor commented May 22, 2024

erwanor commented May 22, 2024 •

edited

Loading

erwanor commented May 22, 2024

erwanor commented May 22, 2024

conorsch commented May 22, 2024

erwanor commented May 22, 2024

conorsch commented May 22, 2024

upgrades: pd exits too early, before submitting pre-upgrade block #4432

upgrades: pd exits too early, before submitting pre-upgrade block #4432

Comments

conorsch commented May 22, 2024

conorsch commented May 22, 2024

conorsch commented May 22, 2024

erwanor commented May 22, 2024

conorsch commented May 22, 2024

hdevalence commented May 22, 2024

hdevalence commented May 22, 2024

hdevalence commented May 22, 2024

erwanor commented May 22, 2024

erwanor commented May 22, 2024

conorsch commented May 22, 2024

hdevalence commented May 22, 2024

conorsch commented May 22, 2024

erwanor commented May 22, 2024

erwanor commented May 22, 2024

erwanor commented May 22, 2024 • edited Loading

erwanor commented May 22, 2024

erwanor commented May 22, 2024

conorsch commented May 22, 2024

erwanor commented May 22, 2024

conorsch commented May 22, 2024

erwanor commented May 22, 2024 •

edited

Loading