Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syncing a node configured as a validator fails #157

Closed
kayabaNerve opened this issue Nov 14, 2022 · 4 comments
Closed

Syncing a node configured as a validator fails #157

kayabaNerve opened this issue Nov 14, 2022 · 4 comments
Labels
bug Something isn't working tendermint

Comments

@kayabaNerve
Copy link
Member

The Tendermint machine is "stepped" whenever a block is imported, with finalizations, yet wasn't created by the machine itself. This is to recover from network discrepancies, or if there's some delay with the Tendermint subnet, move forward as soon as possible.

If we're initially syncing, as a validator, we may step the machine multiple times per second. With each step, we create a proposal for what we think the next block should be. If we have transactions in our mempool, this may be a time-consuming operation. If the time to step a machine exceeds the time to sync a block, creating a backlog of greater than 256 blocks, we will fail to create proposals as we'll have deleted the state for the block we're building on top of.

Accordingly, we should:

A) Instead of only getting SYN notifications of a block being synced, also send ACK to create a rate limit. The issue is we'd only send ACK if we're a validator (we can have non-validators immediately ACK?).

B) Be able to provide a stub-block whenever. This may be achievable by instructing the block builder to spend 0 seconds building a block (and the current propose function takes in such a boolean). The issue is, we can't always default to this when we sync a block from over the network, as if the network block was simply first available, we'll then propose an empty block next when we should be proposing an actual block. It should only be done when initially syncing which is... difficult, to discover. It also doesn't completely remove this race condition, just generally absolves it.

I'll also note not just the Tendermint machine is at risk of this race condition, yet also the Kafka connection being built into the node. On a new block, it's supposed to index the data from it and submit it to Kafka. If it's doing that when initially syncing, it also has the same time concerns. Considering it can't simply stub it, we may want to take route A (potentially using our own notification system?) and require multiple ACKs (from each party who takes a SYN).

@kayabaNerve kayabaNerve added bug Something isn't working substrate labels Nov 14, 2022
@kayabaNerve kayabaNerve added this to the Protonet 0 milestone Nov 26, 2022
@kayabaNerve kayabaNerve changed the title Syncing a node configured as a validator may fail Syncing a node configured as a validator fails Nov 30, 2022
@kayabaNerve
Copy link
Member Author

Right now, we have an unused stub variable to spend zero time on producing blocks. It's intended to be set to true on initial sync.

What we can actually do is try to create a block, and if the state is missing, now that we're creating a pointless proposal anyways (and accordingly do our own stub there).

We should also not broadcast our own messages if we know we're behind.

@kayabaNerve
Copy link
Member Author

We can void this by recreating the machine. The existing handle construction code is already quite large, so moving it to its own function may be beneficial, and doing so would provide a clearer path on this...

@kayabaNerve
Copy link
Member Author

This also impacts the processor/coordinator, if they attempt to process state which has been pruned.

Immediately, running a node in archive mode will solve this by never pruning the state. In the long run, instead of rate limiting sync, we can configure manual pruning for validators, where Kafka+Tendermint release their right to old state, letting it be pruned.

@kayabaNerve kayabaNerve removed this from the Protonet 0 milestone Jan 28, 2023
@kayabaNerve
Copy link
Member Author

Removing milestone due to workaround.

kayabaNerve added a commit that referenced this issue Mar 26, 2023
Updates to polkadot-v0.9.40, with a variety of dependency updates accordingly.
Substrate thankfully now uses k256 0.13, pathing the way for #256. We couldn't
upgrade to polkadot-v0.9.40 without this due to polkadot-v0.9.40 having
fundamental changes to syncing. While we could've updated tendermint, it's not
worth the continued development effort given its inability to work with
multiple validator sets.

Purges sc-tendermint. Keeps tendermint-machine for #163.

Closes #137, #148, #157, #171. #96 and #99 should be re-scoped/clarified. #134
and #159 also should be clarified. #169 is also no longer a priority since
we're only considering temporal deployments of tendermint. #170 also isn't
since we're looking at effectively sharded validator sets, so there should
be no singular large set needing high performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working tendermint
Projects
None yet
Development

No branches or pull requests

2 participants