You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried running the raft tests with the patch above to test the current coverage of PreVote enabled raft, I then did the same with CheckQuorum and the different permutations of these two opt-in raft enhancements. Naively it seems that by virtue of these two features being opt-in, a lot of the existing tests don't directly test the newly introduced codepaths. With just the patch above:
A lot of this (take TestLeaderSupersedingWithCheckQuorum for e.g.) is of course coupled tightly with CheckQuorum being enabled, which it does set manually, but evidently it's behavior that fails with PreVote simultaneously enabled. Again, naively, if this in fact is the intended permutation to be tested I would expect the test to manually set PreVote to false. More often than not however I don't think it is and we're potentially missing bugs.
Similarly if defaulting CheckQuorum to true, a lot of tests fail given we induce leader elections in the test fixtures with simple MsgHup messages, which under CheckQuorum will get dropped if the follower nodes have heard from the leader within r.electionElapsed < r.electionTimeout.
In light of this it seems that simpler things (like TestProposal is just not tested for when {CheckQuorum,PreVote} is enabled, even when it would make sense to do so. This is not universally true, for TestLeaderElection for example we do have TestLeaderElection{PreVote,WithCheckQuorum} for either variant but notably we're missing a variant testing leader election with both.
I wanted to get your thoughts on the matter, presumably ironing these out in tests first would have uncovered some subtle bugs we've experienced in the past. With PreVote we had #8288 and more recently cockroachdb/cockroach#18151, some other instances of this include #8334 and #8346 which seem like very targeted/ad hoc fixes. At the very least it'll allow us to reason about (and hopefully document) when to use what, under what circumstances does it make sense to use CheckQuorumandPreVote, when does it make sense to explicitly use one and not the other.
Refer to raft thesis 9.6 Preventing disruptions when a server rejoins the cluster.
it seems Pre-Vote is good for network partition or jitter. For example.
Assume cluster have 3 nodes [A, B, C], the network between C and A,B is not stable, the link will jitter every hour.
If without Pre-vote, node C will election periodically to make the cluster leader step down. But if with Pre-Vote, node A/B will work as node C cannot increase term to step down cluster leader.
checkQuorum seems good for disruptive server which had been removed from the cluster, but it don't resolve the issue of network jitter. (thesis 4.2.3 Disruptive servers)
So it's good to enable both checkQuorum and Pre-vote.
Also checkQuorum can step down stale leader, see thesis 6.2 Routing requests to the leader.
Leaders: A server might be in the leader state, but if it isn’t the current leader, it could be
needlessly delaying client requests. For example, suppose a leader is partitioned from the
rest of the cluster, but it can still communicate with a particular client. Without additional
mechanism, it could delay a request from that client forever, being unable to replicate a log
entry to any other servers. Meanwhile, there might be another leader of a newer term that is
able to communicate with a majority of the cluster and would be able to commit the client’s
request. Thus, a leader in Raft steps down if an election timeout elapses without a successful
round of heartbeats to a majority of its cluster; this allows clients to retry their requests with
another server.
I think currently ReadOnlyLeaseBased also relying on checkQuorum. see #8450
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
I tried running the raft tests with the patch above to test the current coverage of
PreVote
enabled raft, I then did the same withCheckQuorum
and the different permutations of these two opt-in raft enhancements. Naively it seems that by virtue of these two features being opt-in, a lot of the existing tests don't directly test the newly introduced codepaths. With just the patch above:A lot of this (take
TestLeaderSupersedingWithCheckQuorum
for e.g.) is of course coupled tightly withCheckQuorum
being enabled, which it does set manually, but evidently it's behavior that fails withPreVote
simultaneously enabled. Again, naively, if this in fact is the intended permutation to be tested I would expect the test to manually setPreVote
to false. More often than not however I don't think it is and we're potentially missing bugs.Similarly if defaulting
CheckQuorum
to true, a lot of tests fail given we induce leader elections in the test fixtures with simpleMsgHup
messages, which underCheckQuorum
will get dropped if the follower nodes have heard from the leader withinr.electionElapsed < r.electionTimeout
.In light of this it seems that simpler things (like
TestProposal
is just not tested for when{CheckQuorum,PreVote}
is enabled, even when it would make sense to do so. This is not universally true, forTestLeaderElection
for example we do haveTestLeaderElection{PreVote,WithCheckQuorum}
for either variant but notably we're missing a variant testing leader election with both.I wanted to get your thoughts on the matter, presumably ironing these out in tests first would have uncovered some subtle bugs we've experienced in the past. With
PreVote
we had #8288 and more recently cockroachdb/cockroach#18151, some other instances of this include #8334 and #8346 which seem like very targeted/ad hoc fixes. At the very least it'll allow us to reason about (and hopefully document) when to use what, under what circumstances does it make sense to useCheckQuorum
andPreVote
, when does it make sense to explicitly use one and not the other.+cc @bdarnell @xiang90.
The text was updated successfully, but these errors were encountered: