Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: lower test coverage for {CheckQuorum,PreVote} permutations #8498

Closed
irfansharif opened this issue Sep 5, 2017 · 2 comments
Closed
Assignees

Comments

@irfansharif
Copy link
Contributor

diff --git i/raft/raft_test.go w/raft/raft_test.go
index 357e9bdf6..38e0de3e1 100644
--- i/raft/raft_test.go
+++ w/raft/raft_test.go
@@ -3469,6 +3469,7 @@ func newTestConfig(id uint64, peers []uint64, election, heartbeat int, storage S
                Storage:         storage,
                MaxSizePerMsg:   noLimit,
                MaxInflightMsgs: 256,
+               PreVote:         true,
        }
 }

I tried running the raft tests with the patch above to test the current coverage of PreVote enabled raft, I then did the same with CheckQuorum and the different permutations of these two opt-in raft enhancements. Naively it seems that by virtue of these two features being opt-in, a lot of the existing tests don't directly test the newly introduced codepaths. With just the patch above:

~ env PKG=raft PASSES=unit ./test | grep "FAIL"
--- FAIL: TestNodeReadIndexToOldLeader (0.00s)
--- FAIL: TestFollowerStartElection (0.00s)
--- FAIL: TestCandidateStartNewElection (0.00s)
--- FAIL: TestLeaderElectionInOneRoundRPC (0.00s)
--- FAIL: TestCandidateFallback (0.00s)
--- FAIL: TestLeaderSyncFollowerLog (0.02s)
--- FAIL: TestVoteRequest (0.00s)
--- FAIL: TestLeaderElection (0.01s)
--- FAIL: TestDuelingCandidates (0.00s)
--- FAIL: TestProposal (0.01s)
--- FAIL: TestLeaderSupersedingWithCheckQuorum (0.00s)
--- FAIL: TestLeaderElectionWithCheckQuorum (0.00s)
--- FAIL: TestFreeStuckCandidateWithCheckQuorum (0.00s)
--- FAIL: TestNodeReadIndexToOldLeader (0.00s)
--- FAIL: TestFollowerStartElection (0.00s)
--- FAIL: TestCandidateStartNewElection (0.00s)
--- FAIL: TestLeaderElectionInOneRoundRPC (0.00s)
--- FAIL: TestCandidateFallback (0.00s)
--- FAIL: TestLeaderSyncFollowerLog (0.02s)
--- FAIL: TestVoteRequest (0.00s)
--- FAIL: TestLeaderElection (0.01s)
--- FAIL: TestDuelingCandidates (0.00s)
--- FAIL: TestProposal (0.01s)
--- FAIL: TestLeaderSupersedingWithCheckQuorum (0.00s)
--- FAIL: TestLeaderElectionWithCheckQuorum (0.00s)
--- FAIL: TestFreeStuckCandidateWithCheckQuorum (0.00s)
--- FAIL: TestNodeReadIndexToOldLeader (0.00s)
--- FAIL: TestFollowerStartElection (0.00s)
--- FAIL: TestCandidateStartNewElection (0.00s)
--- FAIL: TestLeaderElectionInOneRoundRPC (0.00s)
--- FAIL: TestCandidateFallback (0.00s)
--- FAIL: TestLeaderSyncFollowerLog (0.03s)
--- FAIL: TestVoteRequest (0.00s)
--- FAIL: TestLeaderElection (0.01s)
--- FAIL: TestDuelingCandidates (0.00s)
--- FAIL: TestProposal (0.01s)
--- FAIL: TestLeaderSupersedingWithCheckQuorum (0.00s)
--- FAIL: TestLeaderElectionWithCheckQuorum (0.00s)
--- FAIL: TestFreeStuckCandidateWithCheckQuorum (0.00s)
FAIL
FAIL    github.com/coreos/etcd/raft     3.611s

A lot of this (take TestLeaderSupersedingWithCheckQuorum for e.g.) is of course coupled tightly with CheckQuorum being enabled, which it does set manually, but evidently it's behavior that fails with PreVote simultaneously enabled. Again, naively, if this in fact is the intended permutation to be tested I would expect the test to manually set PreVote to false. More often than not however I don't think it is and we're potentially missing bugs.
Similarly if defaulting CheckQuorum to true, a lot of tests fail given we induce leader elections in the test fixtures with simple MsgHup messages, which under CheckQuorum will get dropped if the follower nodes have heard from the leader within r.electionElapsed < r.electionTimeout.
In light of this it seems that simpler things (like TestProposal is just not tested for when {CheckQuorum,PreVote} is enabled, even when it would make sense to do so. This is not universally true, for TestLeaderElection for example we do have TestLeaderElection{PreVote,WithCheckQuorum} for either variant but notably we're missing a variant testing leader election with both.


I wanted to get your thoughts on the matter, presumably ironing these out in tests first would have uncovered some subtle bugs we've experienced in the past. With PreVote we had #8288 and more recently cockroachdb/cockroach#18151, some other instances of this include #8334 and #8346 which seem like very targeted/ad hoc fixes. At the very least it'll allow us to reason about (and hopefully document) when to use what, under what circumstances does it make sense to use CheckQuorum and PreVote, when does it make sense to explicitly use one and not the other.

+cc @bdarnell @xiang90.

@lishuai87
Copy link
Contributor

see comments: #8334 (comment)

Refer to raft thesis 9.6 Preventing disruptions when a server rejoins the cluster.
it seems Pre-Vote is good for network partition or jitter. For example.

    Assume cluster have 3 nodes [A, B, C], the network between C and A,B is not stable, the link will jitter every hour.
    If without Pre-vote, node C will election periodically to make the cluster leader step down. But if with Pre-Vote, node A/B will work as node C cannot increase term to step down cluster leader.

checkQuorum seems good for disruptive server which had been removed from the cluster, but it don't resolve the issue of network jitter.  (thesis 4.2.3 Disruptive servers)

So it's good to enable both checkQuorum and Pre-vote.

Also checkQuorum can step down stale leader, see thesis 6.2 Routing requests to the leader.

Leaders: A server might be in the leader state, but if it isn’t the current leader, it could be
needlessly delaying client requests. For example, suppose a leader is partitioned from the
rest of the cluster, but it can still communicate with a particular client. Without additional
mechanism, it could delay a request from that client forever, being unable to replicate a log
entry to any other servers. Meanwhile, there might be another leader of a newer term that is
able to communicate with a majority of the cluster and would be able to commit the client’s
request. Thus, a leader in Raft steps down if an election timeout elapses without a successful
round of heartbeats to a majority of its cluster; this allows clients to retry their requests with
another server.

I think currently ReadOnlyLeaseBased also relying on checkQuorum. see #8450

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2020
@stale stale bot closed this as completed Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants