-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent Commit after ChangeView #642
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just detected this issue this afternoon.
Perfect and well done, @jsolman.
Thanks for solving this quick and in a simple manner. Good vision.
I will soon complete some basic test cases and approve it.
Check this, Jeff: [22:44:38.406] initialize: height=550 view=0 index=3 role=Backup
[22:44:40.282] OnPrepareRequestReceived: height=550 view=0 index=2 tx=1
[22:44:40.282] send prepare response
[22:44:40.946] timeout: height=550 view=0
[22:44:40.946] request change view: height=550 view=0 nv=1
[22:44:40.997] OnCommitReceived: height=550 view=0 index=1
[22:44:41.003] OnCommitReceived: height=550 view=0 index=2
[22:44:41.009] OnCommitReceived: height=550 view=0 index=0
[22:44:41.014] relay block: 0x596994b17b42de6480863dfa4df823fdb56a22f054f8d7efcc91d54c377c9dfd
[22:44:41.044] persist block: 0x596994b17b42de6480863dfa4df823fdb56a22f054f8d7efcc91d54c377c9dfd
[22:44:41.044] initialize: height=551 view=0 index=3 role=Primary aehauheua, as expected relaying a block by acting as a bridge. :D 💃 |
One more adjustment is needed to prevent things getting stuck. If we detect that |
For those who want to check the logs of this last problem. [23:17:07.861] initialize: height=2055 view=0 index=2 role=Backup
[23:17:09.819] OnPrepareRequestReceived: height=2055 view=0 index=3 tx=1
[23:17:09.820] send prepare response
[23:17:09.872] timeout: height=2055 view=0
[23:17:09.878] request change view: height=2055 view=0 nv=1
[23:17:10.645] OnCommitReceived: height=2055 view=0 index=3
[23:17:10.655] OnCommitReceived: height=2055 view=0 index=0
[23:17:10.664] OnChangeViewReceived: height=2055 view=0 index=1 nv=1
[23:17:10.672] OnChangeViewReceived: height=2055 view=0 index=2 nv=1 [23:17:07.927] initialize: height=2055 view=0 index=0 role=Backup
[23:17:09.820] OnPrepareRequestReceived: height=2055 view=0 index=3 tx=1
[23:17:09.820] send prepare response
[23:17:09.877] OnPrepareResponseReceived: height=2055 view=0 index=2
[23:17:09.878] send commit
[23:17:09.935] OnCommitReceived: height=2055 view=0 index=3
[23:17:10.848] OnRecoveryMessageReceived: height=2055 view=0 index=3
[23:17:10.929] timeout: height=2055 view=0 [23:17:08.812] initialize: height=2055 view=0 index=3 role=Primary
[23:17:09.816] timeout: height=2055 view=0
[23:17:09.816] send prepare request: height=2055 view=0
[23:17:09.836] OnPrepareResponseReceived: height=2055 view=0 index=2
[23:17:09.837] OnPrepareResponseReceived: height=2055 view=0 index=0
[23:17:09.838] send commit
[23:17:10.718] OnCommitReceived: height=2055 view=0 index=0
[23:17:10.846] timeout: height=2055 view=0
[23:17:10.846] send recovery to resend commit [23:17:07.931] initialize: height=2055 view=0 index=1 role=Backup
[23:17:09.823] OnPrepareResponseReceived: height=2055 view=0 index=2
[23:17:09.832] OnPrepareResponseReceived: height=2055 view=0 index=0
[23:17:09.938] timeout: height=2055 view=0
[23:17:09.940] request change view: height=2055 view=0 nv=1
[23:17:09.994] OnCommitReceived: height=2055 view=0 index=0
[23:17:10.005] OnCommitReceived: height=2055 view=0 index=3 |
Let's just let this comment and let it flow...aehuaheahuea // A possible attack can happen if the last committed node does not want to recover the others
// In addition, currently, if a the node asking change views crashes it will come back and possibly accept recover from any committed node, thus, splitting nodes among committed nodes and possible stalling the network
public static bool FNodesValidCommitted(this IConsensusContext context) => context.CommitPayloads.Count(p => p != null) >= context.F(); |
…twork to stall creating blocks.
neo/Consensus/ConsensusService.cs
Outdated
@@ -293,7 +296,7 @@ private void OnRecoveryMessageReceived(ConsensusPayload payload, RecoveryMessage | |||
ReverifyAndProcessPayload(changeViewPayload); | |||
} | |||
if (message.ViewNumber != context.ViewNumber) return; | |||
if (!context.CommitSent()) | |||
if (!context.ViewChanging() && !context.CommitSent()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should allow receiving PrepareRequest
and PrepareResponse
even when ViewChanging() == true
. Because ViewChanging()
indicates that I want to change views, it may not be able to change successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comments below if it accepts prepare requests and prepare responses in the current view after it has sent a ChangeView
message it can end up sending the other nodes to the next view, where they will never be able to receive the recovery message from this view with the commits, since it will be a lower view number; thus the network will be stalled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the view has not been changed, then we should allow consensus on the current view. Otherwise, we may not be able to reach a consensus or change views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the view is changing for a node (then the view has to be considered to be changed from it's perspective), because it has already sent a change view message that can be used to allow others to accept a change view, so it cannot commit in the current view. If it did commit in the current view after sending a change view, then the others can all accept moving to the later view, where they won't be able to reach consensus.
If more than F
nodes have committed, then this code does allow receiving PrepareRequest
and PrepareResponse
in the current view effectively to obtain consensus by treating the ViewChanging
flag as not changing. Maybe you missed this line that is changed in this PR:
Line 39 in ee72d58
public static bool MoreThanFNodesCommitted(this IConsensusContext context) => context.CommitPayloads.Count(p => p != null) > context.F(); |
If less than or equal to F
nodes have been committed then they will still be able to obtain consensus in a higher view. So this code should address your concern of nodes being able to reach consensus in the current view when it is not possible to change view anymore.
…r changing view and coming to consensus.
It will be good to have a baseline before this change. However, it doesn't take very severe network latency to potentially cause the stall that should be fixed by this issue, so I suspect we should have this tested afterward if it stalls even once during NGD testing. |
@erikzhang, until this commit 57065f8 it basically solves the issue. Maybe we can revert at this point, @jsolman and open a new PR with 442bfe8. For me, both are good, but until that commit looks quite necessary at the moment. |
@vncoelho I already did remove the code that saved consensus context to a separate PR. This only has the minimal required now to fix the issue. |
Thanks, @jsolman. |
…cepting PrepareRequest and PrepareResponse from different views." This reverts commit 9838afc.
Conflicts: neo/Consensus/ConsensusService.cs neo/Consensus/Helper.cs
As always, great job @erikzhang, it is looks cleaner now. |
NGD will test it these days. |
I'm running a test using the P2P plugin now that drops commits and prepare responses 50% of the time.
Edit : I added dropping prepare response 50% of the time above since this simulates the problem this solves. |
So far testing is going well. No stalls. |
Stressing it harder with dropping other message types now too. Dropping prepare response 50% of the time is a good way to simulate the problem this solves. |
Testing indicates this is working well so far. I've heard good things from NGD so far also. I look forward to hearing their final result. |
Good test @jsolman ! |
I’ve thought of a better solution than this. Expect an alternative PR within 48 hours that solves the same problem as this PR, while being cleaner and maintaining liveness. Also without needing #643. |
@jsolman, I like the idea of allowing committed nodes to move to higher views! I think it is a necessary assumption of the liveness of However, I think we need #643 and also the lock protection for nodes with flag It is necessary to discuss this "special rule" in detail and think about in order to be sure that it will not break our assumption of never creating We created an example in the thread above. Let's discuss that. |
Give me a couple of days for testing, i am trying to make some stats. |
Go for it, @shargon. In addition, PR643 has potential in "helping" (not completely "avoiding") a "crashed" node to not contribute in a lower view instead of the last one it knew. |
Such a hard answer...@erikzhang...aehauheauhea I am joking, Erik. I do not see a problem in merging this and later discussing #653, even if it is after 1-2 weeks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erikzhang, fell free to merge when NGD finishes tests.
I think this should go to Testnet asap.
If a not satisfactory behavior is detected on Testnet we move in another direction.
But if it is ok let's put on Mainnet and keep discussions going on. Maybe in some more couple of months/weeks/even_days we can achieve an even more polished version.
@erikzhang I agree with moving forward with this when you are ready. Other solutions like #653 still have issues and need further adjustments before they are ready; arguably they have other issues that can be potentially worse. |
Since we do not ever allow the view to jump backward during recovery, we cannot allow change view to occur after more than
F
nodes have committed to any view. Therefore, if a node has sent a change view message, it can not be allowed to reach commit in the view from which it has requested to change. This fix addresses the issue and prevents a stall.