raft: lower test coverage for {CheckQuorum,PreVote} permutations #8498

irfansharif · 2017-09-05T16:14:36Z

diff --git i/raft/raft_test.go w/raft/raft_test.go
index 357e9bdf6..38e0de3e1 100644
--- i/raft/raft_test.go
+++ w/raft/raft_test.go
@@ -3469,6 +3469,7 @@ func newTestConfig(id uint64, peers []uint64, election, heartbeat int, storage S
                Storage:         storage,
                MaxSizePerMsg:   noLimit,
                MaxInflightMsgs: 256,
+               PreVote:         true,
        }
 }

I tried running the raft tests with the patch above to test the current coverage of PreVote enabled raft, I then did the same with CheckQuorum and the different permutations of these two opt-in raft enhancements. Naively it seems that by virtue of these two features being opt-in, a lot of the existing tests don't directly test the newly introduced codepaths. With just the patch above:

~ env PKG=raft PASSES=unit ./test | grep "FAIL"
--- FAIL: TestNodeReadIndexToOldLeader (0.00s)
--- FAIL: TestFollowerStartElection (0.00s)
--- FAIL: TestCandidateStartNewElection (0.00s)
--- FAIL: TestLeaderElectionInOneRoundRPC (0.00s)
--- FAIL: TestCandidateFallback (0.00s)
--- FAIL: TestLeaderSyncFollowerLog (0.02s)
--- FAIL: TestVoteRequest (0.00s)
--- FAIL: TestLeaderElection (0.01s)
--- FAIL: TestDuelingCandidates (0.00s)
--- FAIL: TestProposal (0.01s)
--- FAIL: TestLeaderSupersedingWithCheckQuorum (0.00s)
--- FAIL: TestLeaderElectionWithCheckQuorum (0.00s)
--- FAIL: TestFreeStuckCandidateWithCheckQuorum (0.00s)
--- FAIL: TestNodeReadIndexToOldLeader (0.00s)
--- FAIL: TestFollowerStartElection (0.00s)
--- FAIL: TestCandidateStartNewElection (0.00s)
--- FAIL: TestLeaderElectionInOneRoundRPC (0.00s)
--- FAIL: TestCandidateFallback (0.00s)
--- FAIL: TestLeaderSyncFollowerLog (0.02s)
--- FAIL: TestVoteRequest (0.00s)
--- FAIL: TestLeaderElection (0.01s)
--- FAIL: TestDuelingCandidates (0.00s)
--- FAIL: TestProposal (0.01s)
--- FAIL: TestLeaderSupersedingWithCheckQuorum (0.00s)
--- FAIL: TestLeaderElectionWithCheckQuorum (0.00s)
--- FAIL: TestFreeStuckCandidateWithCheckQuorum (0.00s)
--- FAIL: TestNodeReadIndexToOldLeader (0.00s)
--- FAIL: TestFollowerStartElection (0.00s)
--- FAIL: TestCandidateStartNewElection (0.00s)
--- FAIL: TestLeaderElectionInOneRoundRPC (0.00s)
--- FAIL: TestCandidateFallback (0.00s)
--- FAIL: TestLeaderSyncFollowerLog (0.03s)
--- FAIL: TestVoteRequest (0.00s)
--- FAIL: TestLeaderElection (0.01s)
--- FAIL: TestDuelingCandidates (0.00s)
--- FAIL: TestProposal (0.01s)
--- FAIL: TestLeaderSupersedingWithCheckQuorum (0.00s)
--- FAIL: TestLeaderElectionWithCheckQuorum (0.00s)
--- FAIL: TestFreeStuckCandidateWithCheckQuorum (0.00s)
FAIL
FAIL    github.com/coreos/etcd/raft     3.611s

A lot of this (take TestLeaderSupersedingWithCheckQuorum for e.g.) is of course coupled tightly with CheckQuorum being enabled, which it does set manually, but evidently it's behavior that fails with PreVote simultaneously enabled. Again, naively, if this in fact is the intended permutation to be tested I would expect the test to manually set PreVote to false. More often than not however I don't think it is and we're potentially missing bugs.
Similarly if defaulting CheckQuorum to true, a lot of tests fail given we induce leader elections in the test fixtures with simple MsgHup messages, which under CheckQuorum will get dropped if the follower nodes have heard from the leader within r.electionElapsed < r.electionTimeout.
In light of this it seems that simpler things (like TestProposal is just not tested for when {CheckQuorum,PreVote} is enabled, even when it would make sense to do so. This is not universally true, for TestLeaderElection for example we do have TestLeaderElection{PreVote,WithCheckQuorum} for either variant but notably we're missing a variant testing leader election with both.

I wanted to get your thoughts on the matter, presumably ironing these out in tests first would have uncovered some subtle bugs we've experienced in the past. With PreVote we had #8288 and more recently cockroachdb/cockroach#18151, some other instances of this include #8334 and #8346 which seem like very targeted/ad hoc fixes. At the very least it'll allow us to reason about (and hopefully document) when to use what, under what circumstances does it make sense to use CheckQuorum and PreVote, when does it make sense to explicitly use one and not the other.

+cc @bdarnell @xiang90.

The text was updated successfully, but these errors were encountered:

lishuai87 · 2017-09-07T13:33:47Z

see comments: #8334 (comment)

Refer to raft thesis 9.6 Preventing disruptions when a server rejoins the cluster.
it seems Pre-Vote is good for network partition or jitter. For example.

    Assume cluster have 3 nodes [A, B, C], the network between C and A,B is not stable, the link will jitter every hour.
    If without Pre-vote, node C will election periodically to make the cluster leader step down. But if with Pre-Vote, node A/B will work as node C cannot increase term to step down cluster leader.

checkQuorum seems good for disruptive server which had been removed from the cluster, but it don't resolve the issue of network jitter.  (thesis 4.2.3 Disruptive servers)

So it's good to enable both checkQuorum and Pre-vote.

Also checkQuorum can step down stale leader, see thesis 6.2 Routing requests to the leader.

Leaders: A server might be in the leader state, but if it isn’t the current leader, it could be
needlessly delaying client requests. For example, suppose a leader is partitioned from the
rest of the cluster, but it can still communicate with a particular client. Without additional
mechanism, it could delay a request from that client forever, being unable to replicate a log
entry to any other servers. Meanwhile, there might be another leader of a newer term that is
able to communicate with a majority of the cluster and would be able to commit the client’s
request. Thus, a leader in Raft steps down if an election timeout elapses without a successful
round of heartbeats to a majority of its cluster; this allows clients to retry their requests with
another server.

I think currently ReadOnlyLeaseBased also relying on checkQuorum. see #8450

stale · 2020-04-07T07:12:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

irfansharif mentioned this issue Sep 7, 2017

etcdserver: use PreVote so majority partition's leader isn't forced to step down #8499

Closed

heyitsanthony assigned xiang90 Sep 12, 2017

heyitsanthony added the area/raft label Sep 12, 2017

bdarnell mentioned this issue Jan 23, 2018

storage: re-enable Raft PreVote RPC cockroachdb/cockroach#16950

Closed

stale bot added the stale label Apr 7, 2020

stale bot closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: lower test coverage for {CheckQuorum,PreVote} permutations #8498

raft: lower test coverage for {CheckQuorum,PreVote} permutations #8498

irfansharif commented Sep 5, 2017

lishuai87 commented Sep 7, 2017

stale bot commented Apr 7, 2020

raft: lower test coverage for {CheckQuorum,PreVote} permutations #8498

raft: lower test coverage for {CheckQuorum,PreVote} permutations #8498

Comments

irfansharif commented Sep 5, 2017

lishuai87 commented Sep 7, 2017

stale bot commented Apr 7, 2020