You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When playing with the etcd Raft library, I noticed the following issue when PreVote is enabled.
For a three nodes Raft cluster with node A, B and C, let's say C is isolated from both A and B due to network partition, so it is expected to keep the initial Term value 1. A and B had a few elections and thus moved their terms to a higher value, say 5. Let's also assume that B is the current leader and A is a follower.
Now if C is recovered from the network partition and the leader B crash at roughly the same time, we end up having A at term 5 and C at term 1. When they start to do preVote, C's MsgPreVote and MsgPreVoteResp messages will carry term values lower than A's term, they will be both ignored by A, code here. For C, it is expected to receive MsgPreVote from A carrying a higher term value, but C's term will not change in response to a MsgPreVote, see comments here. There is currently no leader, so no MsgHeartbeat message in the system.
It seems to me that the cluster thus get stuck as no node can complete the PreVote phase. @bdarnell and @xiang90, could you please have a look and shed some light on it? Did I miss something? Many thanks!
I hacked a quick test for what is described above.
func TestNodeWithSmallerTermCanCompleteElection(t *testing.T) {
a := newTestRaft(1, []uint64{1, 2, 3}, 10, 1, NewMemoryStorage())
b := newTestRaft(2, []uint64{1, 2, 3}, 10, 1, NewMemoryStorage())
c := newTestRaft(3, []uint64{1, 2, 3}, 10, 1, NewMemoryStorage())
a.becomeFollower(1, None)
b.becomeFollower(1, None)
c.becomeFollower(1, None)
a.preVote = true
b.preVote = true
c.preVote = true
// cause a network partition to isolate node 3
nt := newNetwork(a, b, c)
nt.cut(1, 3)
nt.cut(2, 3)
// start a few elections to bump the term values
nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgHup})
nt.send(pb.Message{From: 3, To: 3, Type: pb.MsgHup})
sm := nt.peers[1].(*raft)
if sm.state != StateLeader {
t.Errorf("peer 1 state: %s, want %s", sm.state, StateLeader)
}
sm = nt.peers[2].(*raft)
if sm.state != StateFollower {
t.Errorf("peer 2 state: %s, want %s", sm.state, StateFollower)
}
sm = nt.peers[3].(*raft)
if sm.state != StatePreCandidate {
t.Errorf("peer 3 state: %s, want %s", sm.state, StatePreCandidate)
}
nt.send(pb.Message{From: 2, To: 2, Type: pb.MsgHup})
nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgHup})
nt.send(pb.Message{From: 2, To: 2, Type: pb.MsgHup})
nt.send(pb.Message{From: 2, To: 2, Type: pb.MsgProp, Entries: []pb.Entry{{Data: []byte("some data")}}})
// check whether the term values are expected
// a.Term == 5
// b.Term == 5
// c.Term == 1
sm = nt.peers[1].(*raft)
if sm.Term != 5 {
t.Errorf("peer 1 term: %d, want %d", sm.Term, 5)
}
sm = nt.peers[2].(*raft)
if sm.Term != 5 {
t.Errorf("peer 2 term: %d, want %d", sm.Term, 5)
}
sm = nt.peers[3].(*raft)
if sm.Term != 1 {
t.Errorf("peer 3 term: %d, want %d", sm.Term, 1)
}
// check state
// a == follower
// b == leader
// c == pre-candidate
sm = nt.peers[1].(*raft)
if sm.state != StateFollower {
t.Errorf("peer 1 state: %s, want %s", sm.state, StateFollower)
}
sm = nt.peers[2].(*raft)
if sm.state != StateLeader {
t.Errorf("peer 2 state: %s, want %s", sm.state, StateLeader)
}
sm = nt.peers[3].(*raft)
if sm.state != StatePreCandidate {
t.Errorf("peer 3 state: %s, want %s", sm.state, StatePreCandidate)
}
sm.logger.Infof("going to bring back peer 3 and kill peer 2")
// recover the network then immediately isolate b which is currently
// the leader, this is to emulate the crash of b.
nt.recover()
nt.cut(2, 1)
nt.cut(2, 3)
// call for election
nt.send(pb.Message{From: 3, To: 3, Type: pb.MsgHup})
nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgHup})
// do we have a leader
sma := nt.peers[1].(*raft)
smb := nt.peers[3].(*raft)
if sma.state != StateLeader && smb.state != StateLeader {
t.Errorf("no leader")
}
}
The text was updated successfully, but these errors were encountered:
We don't currently use PreVote in CockroachDB - we tried it briefly and ran into problems, which were probably similar to this one. Thank you for the test, this should help us get to the bottom of this problem quickly. CC @irfansharif, who just started looking into this.
Hi,
When playing with the etcd Raft library, I noticed the following issue when PreVote is enabled.
For a three nodes Raft cluster with node A, B and C, let's say C is isolated from both A and B due to network partition, so it is expected to keep the initial Term value 1. A and B had a few elections and thus moved their terms to a higher value, say 5. Let's also assume that B is the current leader and A is a follower.
Now if C is recovered from the network partition and the leader B crash at roughly the same time, we end up having A at term 5 and C at term 1. When they start to do preVote, C's MsgPreVote and MsgPreVoteResp messages will carry term values lower than A's term, they will be both ignored by A, code here. For C, it is expected to receive MsgPreVote from A carrying a higher term value, but C's term will not change in response to a MsgPreVote, see comments here. There is currently no leader, so no MsgHeartbeat message in the system.
It seems to me that the cluster thus get stuck as no node can complete the PreVote phase. @bdarnell and @xiang90, could you please have a look and shed some light on it? Did I miss something? Many thanks!
I hacked a quick test for what is described above.
The text was updated successfully, but these errors were encountered: