-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Syncing a node configured as a validator fails #157
Comments
Right now, we have an unused What we can actually do is try to create a block, and if the state is missing, now that we're creating a pointless proposal anyways (and accordingly do our own stub there). We should also not broadcast our own messages if we know we're behind. |
We can void this by recreating the machine. The existing handle construction code is already quite large, so moving it to its own function may be beneficial, and doing so would provide a clearer path on this... |
This also impacts the processor/coordinator, if they attempt to process state which has been pruned. Immediately, running a node in archive mode will solve this by never pruning the state. In the long run, instead of rate limiting sync, we can configure manual pruning for validators, where Kafka+Tendermint release their right to old state, letting it be pruned. |
Removing milestone due to workaround. |
Updates to polkadot-v0.9.40, with a variety of dependency updates accordingly. Substrate thankfully now uses k256 0.13, pathing the way for #256. We couldn't upgrade to polkadot-v0.9.40 without this due to polkadot-v0.9.40 having fundamental changes to syncing. While we could've updated tendermint, it's not worth the continued development effort given its inability to work with multiple validator sets. Purges sc-tendermint. Keeps tendermint-machine for #163. Closes #137, #148, #157, #171. #96 and #99 should be re-scoped/clarified. #134 and #159 also should be clarified. #169 is also no longer a priority since we're only considering temporal deployments of tendermint. #170 also isn't since we're looking at effectively sharded validator sets, so there should be no singular large set needing high performance.
The Tendermint machine is "stepped" whenever a block is imported, with finalizations, yet wasn't created by the machine itself. This is to recover from network discrepancies, or if there's some delay with the Tendermint subnet, move forward as soon as possible.
If we're initially syncing, as a validator, we may step the machine multiple times per second. With each step, we create a proposal for what we think the next block should be. If we have transactions in our mempool, this may be a time-consuming operation. If the time to step a machine exceeds the time to sync a block, creating a backlog of greater than 256 blocks, we will fail to create proposals as we'll have deleted the state for the block we're building on top of.
Accordingly, we should:
A) Instead of only getting SYN notifications of a block being synced, also send ACK to create a rate limit. The issue is we'd only send ACK if we're a validator (we can have non-validators immediately ACK?).
B) Be able to provide a stub-block whenever. This may be achievable by instructing the block builder to spend 0 seconds building a block (and the current propose function takes in such a boolean). The issue is, we can't always default to this when we sync a block from over the network, as if the network block was simply first available, we'll then propose an empty block next when we should be proposing an actual block. It should only be done when initially syncing which is... difficult, to discover. It also doesn't completely remove this race condition, just generally absolves it.
I'll also note not just the Tendermint machine is at risk of this race condition, yet also the Kafka connection being built into the node. On a new block, it's supposed to index the data from it and submit it to Kafka. If it's doing that when initially syncing, it also has the same time concerns. Considering it can't simply stub it, we may want to take route
A
(potentially using our own notification system?) and require multiple ACKs (from each party who takes a SYN).The text was updated successfully, but these errors were encountered: