Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover nodes requesting ChangeView when possible #579

Merged
merged 153 commits into from
Feb 16, 2019

Conversation

jsolman
Copy link
Contributor

@jsolman jsolman commented Jan 30, 2019

Recovery messages can be used in response to ChangeView messages in order to bring nodes back to the current state of the consensus. This is achieved safely by keeping the payloads from the network messages and including pertinent parts along with the witness invocation scripts in a recovery message. The recovery messages facilitate regenerating the network messages and reprocessing them.

The recovery is all encompassing, including:

  • Replaying ChangeView messages
  • Replaying the PrepareRequest mesage if it was present
  • Replaying PrepareResponse messages were present.
  • Replaying the Commit messages.

Considering n = 3f + 1, only f nodes generally must respond to the ChangeView message with the recovery message; the exception is in the case that a node has their CommitSent flag set; in which case they will always reply to ChangeView messages with a recovery message.

Nodes that have committed CommitSent are not allowed to change views.

Some complications arise from needing to accept a previously generated prepare request if the primary is being recovered when coming from another view, or the same view having not yet sent the prepare request. In these cases the primary will act as a accept its own previously generated prepare request, but won't send any prepare response.

Since ChangeView may result in recovery messages being sent each time they are received, it is necessary to prevent replay of such messages from generating a duplicate recovery message response to prevent the potential for DDOS. This is achieved by causing ChangeView messages to always generate a new timestamp so each request will have a unique hash. In this way ChangeView messages are only replied to once by broadcasting a recovery message.

The code has been tested for a number of scenarios and is ready for thorough review and subsequent merge.

Here are some of the cases I tested:

  • Repeatedly restoring view when no prepare request sent
    • kill off all nodes except 1 that isn't the primary; bring back up 1 node at a time while killing previously brought up node each time -- validated view is always restored on nodes
  • Restore the primary's view with no pepare request sent and have it generate the prepare request
  • Kill the primary after having sent the prepare request and bring him back up with only other node up and verify that the primary restores and accepts his own previously sent prepare request
  • Restore node to a view with prepare request sent, but not enough recovery messages to commit
  • Restore node to a view with prepare request sent and enough messages to commit, verify restore and commit
  • Restore node from a higher view number on it's change view to a lower number that has commit set and verify it restores to the lower view and commits

Closes #426

@jsolman jsolman requested a review from erikzhang January 30, 2019 00:36
@jsolman
Copy link
Contributor Author

jsolman commented Jan 31, 2019

By keeping the signatures from the change view that moved to the current view, we will be able to allow nodes to always accept regeneration messages to a newer view. I'm working on making that change now.

Copy link
Member

@vncoelho vncoelho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, Jeff, It looks like going in the precise and right direction.
I still did not revise it, just looked at the main variables.

@jsolman
Copy link
Contributor Author

jsolman commented Jan 31, 2019

@erikzhang May want to take a look at b938f96 . In some testing where I had killed one CN node after sending commit, it got stuck on start-up after attempting once to send the commit. May want to cherry-pick b938f96 directly from here into consensus/improved_dbft if you like it.

@jsolman
Copy link
Contributor Author

jsolman commented Jan 31, 2019

Since the regeneration message knows the change view witness signatures, it should be possible to send recovery even if the PrepareRequest wasn't sent. If the nodes were able to agree on a view to move to, it needs to recover other nodes to that view also, it may even be able to recover the now that will be the primary and greatly help speed up consensus.

@jsolman
Copy link
Contributor Author

jsolman commented Feb 16, 2019

My last commit message should have said Remove unused private member variable

@jsolman
Copy link
Contributor Author

jsolman commented Feb 16, 2019

I updated the top overview again.

@neo-project neo-project deleted a comment from erikzhang Feb 16, 2019
@jsolman
Copy link
Contributor Author

jsolman commented Feb 16, 2019

Oops didn't mean to delete your comment. I agree with the change from 3c801dc

@jsolman
Copy link
Contributor Author

jsolman commented Feb 16, 2019

It's looking pretty clean. Shall I give it one more round of testing now?

erikzhang pushed a commit that referenced this pull request Mar 3, 2019
* Add commit phase to consensus algorithm (#534)

* Add commit phase to consensus algorithm

* fix tests

* Prevent repeated sending of `Commit` messages

* RPC call gettransactionheight (#541)

* getrawtransactionheight

Nowadays two calls are need to get a transaction height, `getrawtransaction` with `verbose` and then use the `blockhash`.
Other option is to use `confirmations`, but it can be misleading.

* Minnor fix

* Shargon's tip

* modified

* Allow to use the wallet inside a RPC plugin (#536)

* Clean code

* Clean code

* Minor fix on mempoolVerified

* Add MemoryPool Unit tests. Fix bug on initital start of Persisting the Genesis block.

* Prevent `ConsensusService` from receiving messages before starting (#573)

* Prevent `ConsensusService` from receiving messages before starting

* fixed tests - calling OnStart now

* Consensus recovery log (#572)

* Pass store to `ConsensusService`

* Implement `ISerializable` in `ConsensusContext`

* Start from recovery log

* Fix unit tests due to constructor taking the store.

* Add unit tests for serializing and deserializing the consensus context.

* Combine `ConsensusContext.ChangeView()` and `ConsensusContext.Reset()`

* Add `PreparationHash` field to `PrepareResponse` to prevent replay attacks from malicious primary (#576)

* Fixed a problem where `PrepareResponse.PreparationHash` was not assigned.

* Load context from store only when height matches

* Recover nodes requesting ChangeView when possible (#579)

* Fixes bug in `OnPrepareRequestReceived()`

* Send `RecoveryMessage` only when `message.NewViewNumber <= context.ViewNumber`

* Fix and optimize view changing (#590)

* Allow to ignore the recovery logs

* Add `isRecovering` (#594)

* Fix accepting own prepare request (#596)

* Pick some changes from #575.

* Fixes `Prefixes`

* Restore transactions from saved consensus context (#598)

* Refactoring

* AggressiveInlining (#606)

* Reset Block reference when consensus context is initialized after block persist. (#608)

* Change `ConsensusPayload` for compatibility (#609)
Thacryba pushed a commit to simplitech/neo that referenced this pull request Feb 17, 2020
* update consensus

fix minor issues and change the formatting

* minor issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Critical Issues (bugs) that need to be fixed ASAP Enhancement Type - Changes that may affect performance, usability or add new features to existing modules.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants