Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: adjust election timeout on restart #9364

Closed
wants to merge 6 commits into from

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Feb 26, 2018

Still advance ticks on bootstrapping to fresh cluster.
But on restart, only advance 1/10 of original election ticks.

Address #9333.

Manually tested that it adjusts election ticks.

/cc @xiang90 @jpbetz

@gyuho gyuho force-pushed the sync-rafthttp branch 2 times, most recently from 5bfa52a to fe9bff5 Compare February 26, 2018 22:34
@codecov-io
Copy link

codecov-io commented Feb 27, 2018

Codecov Report

Merging #9364 into master will increase coverage by 0.39%.
The diff coverage is 98.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9364      +/-   ##
==========================================
+ Coverage   72.36%   72.75%   +0.39%     
==========================================
  Files         362      362              
  Lines       30795    30846      +51     
==========================================
+ Hits        22285    22443     +158     
+ Misses       6869     6787      -82     
+ Partials     1641     1616      -25
Impacted Files Coverage Δ
rafthttp/remote.go 71.42% <100%> (ø) ⬆️
rafthttp/peer_status.go 88.88% <100%> (+2.68%) ⬆️
etcdserver/membership/cluster.go 87.2% <100%> (+0.21%) ⬆️
etcdserver/raft.go 89.36% <100%> (ø) ⬆️
etcdserver/server.go 80.41% <100%> (+1.16%) ⬆️
rafthttp/peer.go 90.22% <100%> (+1.5%) ⬆️
rafthttp/transport.go 83.06% <75%> (-0.19%) ⬇️
raft/node.go 90.62% <0%> (-1.79%) ⬇️
etcdctl/ctlv3/command/printer_simple.go 72.72% <0%> (-1.4%) ⬇️
lease/lessor.go 85.81% <0%> (-0.81%) ⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5132017...3257823. Read the comment docs.

@wojtek-t
Copy link
Contributor

/subscribe

cc @mborsz

@gyuho gyuho force-pushed the sync-rafthttp branch 3 times, most recently from d63b940 to 5405135 Compare March 1, 2018 23:21
@xiang90
Copy link
Contributor

xiang90 commented Mar 1, 2018

the approach seems fine. can someone reproduce the observed problem with/without this patch to make sure the problem is fixed by the patch?

@@ -417,7 +407,6 @@ func startNode(cfg ServerConfig, cl *membership.RaftCluster, ids []types.ID) (id
raftStatusMu.Lock()
raftStatus = n.Status
raftStatusMu.Unlock()
advanceTicksForElection(n, c.ElectionTick)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still should advanceTicks for newly start node. is there a reason not to do so?

@@ -521,9 +523,54 @@ func NewServer(cfg ServerConfig) (srv *EtcdServer, err error) {
}
srv.r.transport = tr

activePeers := 0
for _, m := range cl.Members() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

establishing connection can take time. probably need some delay here.

plog.Infof("%s is advancing %d ticks for faster election (election tick %d)", srv.ID(), tick, cfg.ElectionTicks)
advanceTicksForElection(n, tick)
} else {
// on restart, there is likely an active peer already
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even for restart case, we should still consider the number of active member. if there is none, we still can advance ticks, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we would need wait until the local node finds its peers (cl.Members() > 0), to do that. I will play around it to address https://github.com/coreos/etcd/pull/9364/files#r171727961.

@wojtek-t
Copy link
Contributor

wojtek-t commented Mar 2, 2018

@xiang90 @gyuho - we have patched this PR internall and will run tests with that;
we will get back to you probably some time early next week

@gyuho
Copy link
Contributor Author

gyuho commented Mar 3, 2018

@xiang90 @wojtek-t

Last four commits add more detailed logging and better estimation on active peers.

Out of 3-node cluster,

Case 1. All 3 nodes start fresh (bootstrapping). In this case, fast-forward election ticks with last tick left.

embed: 729934363faa4a24 is advancing 9 ticks for faster election (election tick 10, restarted 'false', 3 found peer(s))

embed: 729934363faa4a24 started serving peers with 1 found peer(s) (no fast-forwarding election ticks after serving peers)

Case 2. Only 2 nodes are up, 1 node is down. The 1-node restarted. In this case, do not advance election ticks.

embed: 729934363faa4a24 started serving peers with 1 found peer(s) (no fast-forwarding election ticks after serving peers)

Case 3. All 3 nodes are down. Third node restarts with no active peer.
In this case, it advances only 1 tick out of 10 ticks (because there is no active peer).

embed: 7339c4e5e833c029 is advancing 1 ticks after serving peers (election tick 10, restarted 'true', 0 found peer(s))

@xiang90
Copy link
Contributor

xiang90 commented Mar 3, 2018

Why do we need to differentiate restart vs fresh start?

The strategy should be simple as this.

  1. if single node cluster -> advance ticks (only self can be leader anyway)

  2. if not single node cluster, waits for the peer connection reports.

    • If all the connection are failed, do nothing (needs to wait for other peer to be online anyway; advancing ticks here have no effect at all)
    • If there is any active peers, advance ticks (if leader exists, this peer should receive a heartbeat soon; or we should start voting)

The only change we should do in this PR is to wait for the peer to be connected or the first connection is failed before advance ticks.

@xiang90
Copy link
Contributor

xiang90 commented Mar 3, 2018

The most serious problem before is just that we failed to wait for the connection status before advancing ticks.

@gyuho
Copy link
Contributor Author

gyuho commented Mar 3, 2018

@xiang90 I differentiated fresh cluster to get away with waiting, since its member list is already populated on start. But you are right, we can simplify this (since we also have discovery services on fresh cluster).

waits for the peer connection reports

Will make server wait up to 5 second, which is rafthttp.ConnReadTimeout, to simplify this logic.

If there is any active peers, advance ticks (if leader exists, this peer should receive a heartbeat soon; or we should start voting)

You mean advancing with adjusted ticks, right? Rejoining node to existing cluster can still have >1 active peers after 5-second wait time. If we have only one tick left, it can be still disruptive when the last tick elapse before leader heartbeat. Fast-forwarding with election tick / 10 (or any other adjustment) is the number that I am experimenting with.

Will clean this up.

@xiang90
Copy link
Contributor

xiang90 commented Mar 3, 2018

Will make server wait up to 5 second, which is rafthttp.ConnReadTimeout, to simplify this logic.

Blindly waiting for 5 seconds is bad. Peer might be connected well before 5 seconds.

@gyuho
Copy link
Contributor Author

gyuho commented Mar 3, 2018

@xiang90

Blindly waiting for 5 seconds is bad. Peer might be connected well before 5 seconds.

Yeah, I was thinking of adding notify routine from rafthttp, so that we discover the connectivity earlier.

@xiang90
Copy link
Contributor

xiang90 commented Mar 3, 2018

If we have only one tick left, it can be still disruptive when the last tick elapse before leader heartbeat.

then forward it until there are two ticks. Leader should send a heartbeat within one tick. Giving it one more tick should be enough.

@gyuho
Copy link
Contributor Author

gyuho commented Mar 3, 2018

Giving it one more tick should be enough.

Sounds good. Will work on it.

Thanks!

@gyuho
Copy link
Contributor Author

gyuho commented Mar 5, 2018

Now

fresh node (to 3-node cluster)

etcdserver: b548c2511513015 fast-forwarding 9 ticks (election ticks 10) with 3 found member(s)

restart node (to 3-node cluster)

rafthttp: peer 729934363faa4a24 became active
etcdserver: b548c2511513015 fast-forwarding 8 ticks (election ticks 10) with 3 found member(s)

restarted single-node cluster

etcdserver: 8e9e05c52164694d waited 5s but no active peer found (or restarted 1-node cluster)

@xiang90
Copy link
Contributor

xiang90 commented Mar 5, 2018

restarted single-node cluster

This we do not fast forward ticks for this case?

The strategy should be simple as this.

  1. if single node cluster -> advance ticks (only self can be leader anyway)

  2. if not single node cluster, waits for the peer connection reports.
    If all the connection are failed, do nothing (needs to wait for other peer to be online anyway; advancing ticks here have no effect at all)
    If there is any active peers, advance ticks (if leader exists, this peer should receive a heartbeat soon; or we should start voting)

if we follow this policy, it should be fast forwarded, no?

@gyuho
Copy link
Contributor Author

gyuho commented Mar 5, 2018

if we follow this policy, it should be fast forwarded, no?

restarted single-node cluster

This we do not fast forward ticks for this case?

I left it as TODO for now.

Let me see if we can also handle the single-node case as well.

@gyuho gyuho force-pushed the sync-rafthttp branch 2 times, most recently from 1f03eea to 9d4440d Compare March 6, 2018 16:10
@gyuho gyuho force-pushed the sync-rafthttp branch 4 times, most recently from d02b04e to eab6108 Compare March 7, 2018 22:34
@mborsz
Copy link
Contributor

mborsz commented Mar 8, 2018

With @wojtek-t we have patched this pull request (the version from March 2nd: commits fe9b909 and 5405135) internally and tested this week.
This patch seems to work -- I haven't seen any member rejoining etcd cluster that triggered leadership election.

@gyuho gyuho mentioned this pull request Mar 8, 2018
25 tasks
gyuho added 6 commits March 8, 2018 11:06
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
@gyuho
Copy link
Contributor Author

gyuho commented Mar 8, 2018

We've made adjust logic more fine-grained so that it can handle the restarting 1-node cluster. It would be great if we can confirm with latest commits as well. Thanks.

@@ -527,6 +539,62 @@ func NewServer(cfg ServerConfig) (srv *EtcdServer, err error) {
}
srv.r.transport = tr

// fresh start
if !haveWAL {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to care about restart vs fresh start?

see #9364 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just easier, so that fresh start does not need to synchronize with peer connection reports. But as you suggested, let me simplify the logic (#9364 (comment)).


srv.goAttach(func() {
select {
case <-cl.InitialAddNotify():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty complicated. let us just get the peer list from the existing snapshot. we do not need to ensure all the configuration in the wal file are executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason for that is reconfiguration in infrequent. and moving from one -> N nodes cluster is even more infrequent. snapshot will contain the correct information 99% of the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to cover all cases where there's no snapshot (which needs to populate member lists from WAL). But, agree that this should be simplified by loading members from snapshot. Will rework on this.

@gyuho
Copy link
Contributor Author

gyuho commented Mar 8, 2018

Xiang has a good point. This is a bit too complicated. I will create a separate PR with simpler solution.

@gyuho
Copy link
Contributor Author

gyuho commented Mar 10, 2018

This will be replaced by #9415.

@gyuho gyuho closed this Mar 10, 2018
@gyuho gyuho deleted the sync-rafthttp branch March 14, 2018 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants