storage: re-enable Raft PreVote RPC #16950

tbg · 2017-07-10T06:05:31Z

Enable the Raft PreVote feature. Note, the implementation of Raft PreVote exists, but stability issues arose when it was turned on. This task is about fixing those issues (if they still exist).

Raft PreVote is a mechanism to lessen the impact of a rogue replica calling a Raft election. When a replica calls a Raft election it forces the current leader to step down and we have to wait for the election to complete. This disrupts activity on the affected Range. A replica might call an election mistakenly if it was partitioned away temporarily. The unfortunate part about it calling an election when it rejoins the cluster is that it can't possibly succeed because it is not up to date. Raft PreVote is a mechanism for a replica to check whether other replicas would vote for it if it held an election.

Archaeology: we don't really know why it wasn't working, and simply disabled it.

commit 1d59f01c9b3523f31d8415ecbd1096d0b51fbbc6
Author: Tamir Duberstein <tamird@gmail.com>
Date:   Thu Nov 3 12:05:59 2016 -0400

    GLOCKFILE: update dependencies
    
    Summary of not-definitely-irrelevant changes:
    - github.com/cenkalti/backoff:
      - moved from github.com/cenk/backoff: https://github.com/cenkalti/backoff/commit/b02f2bb
    - github.com/cockroachdb/c-protobuf:
      - updated upstream to 3.1.0: https://github.com/cockroachdb/c-protobuf/pull/20
    - github.com/gogo/protobuf:
      - rename generated variables data->dAtA: https://github.com/gogo/protobuf/pull/215
    - github.com/etcd/raft:
      - PreVote-related panic fix: https://github.com/coreos/etcd/pull/6749
    - github.com/lib/pq:
      - data race fix: https://github.com/lib/pq/commit/473284b
    
    Not reviewed in detail:
    - github.com/go-sql-driver/mysql
    - github.com/google/go-github
    - golang.org/x/crypto
    - golang.org/x/net
    - golang.org/x/oauth2
    - golang.org/x/sys
    - golang.org/x/text
    - golang.org/x/tools
    - google.golang.org/appengine
    
    Skipped:
    
    - github.com/cockroachdb/c-rocksdb (#9616)
    - github.com/docker/docker (https://github.com/docker/docker/pull/27912) breaks all dependents
    - google.golang.org/grpc (#9697)

commit 9ce883365db916baa8a6d66d16bc153e7b3977e0
Merge: 99e14f718 f61b94d05
Author: Alex Robinson <alexdwanerobinson@gmail.com>
Date:   Wed Nov 2 12:10:33 2016 -0400

    Merge pull request #10395 from bdarnell/bdarnell/disable-prevote
    
    storage: Switch back from PreVote to CheckQuorum

commit f61b94d05fa83ec31885281c19d89fb97b5d796f
Author: Ben Darnell <ben@cockroachlabs.com>
Date:   Wed Nov 2 22:41:06 2016 +0800

    storage: Switch back from PreVote to CheckQuorum
    
    PreVote has bugs that show up under chaos testing.

commit b0bb16e39a762ab49b0c87308829548be0238c24
Merge: 791502496 601cb79aa
Author: Ben Darnell <ben@bendarnell.com>
Date:   Wed Nov 2 19:42:01 2016 +0800

    Merge pull request #10218 from bdarnell/prevote
    
    storage: Use raft PreVote instead of CheckQuorum

commit 601cb79aa65b9eebd2f142b1c64b753ce995b3be
Author: Ben Darnell <ben@cockroachlabs.com>
Date:   Wed Oct 26 08:46:00 2016 +0900

    storage: Stop calling raftGroup.TickQuiesced
    
    This was a workaround for the interaction between quiesced ranges and
    CheckQuorum, and is no longer needed now that we have switched from
    CheckQuorum to PreVote.
    
    Reverts #9407

commit 65963433dea55bee2b9e707d1497f97838df700c
Author: Ben Darnell <ben@cockroachlabs.com>
Date:   Wed Oct 19 11:07:32 2016 +0800

    storage: Use raft PreVote instead of CheckQuorum
    
    CheckQuorum interacts badly with quiesced ranges and coalesced
    heartbeats; the new PreVote implementation provides the same benefits in
    a way that is more compatible with our use.
    
    Fixes #9561

The text was updated successfully, but these errors were encountered:

bdarnell · 2017-07-10T21:32:11Z

The symptom, IIRC, was that after a leader died, the next election just wouldn't succeed. I can't remember whether the range would just do one election and get stuck or if it would go in a loop of elections (or pre-elections) that never resolved.

irfansharif · 2017-07-17T07:37:28Z

Upstream issue tracked in etcd-io/etcd#8243, looks like @distributed-sys has already identified the offending codepaths (+ reproducible test).

irfansharif · 2017-07-20T14:54:02Z

Fix: etcd-io/etcd#8288, but as per discussion today might not make it to 1.1.

a-robinson · 2017-07-31T15:47:26Z

Indigo has been having problems the last few days that pre-vote would probably help with quite a bit.

Specifically, the node liveness range is on two nodes that are close together and one that's farther away (45-50ms). Every so often, the far away node will run into a MsgTimeoutNow and start an election at the next term. It gets elected leader, but in the process disrupts liveness heartbeats enough that some take too long and their nodes' epochs get incremented.

There are definitely other things going wrong as well (including a lot more raft log truncation than I'd expect on the liveness range, every 4-10 seconds), but there are a lot of raft elections being caused by timeouts.

irfansharif · 2017-07-31T15:51:32Z

Was going to start chaos testing this patch post feature freeze, left this issue open in case other things prop up when enabling it. But if you're willing to experiment, COCKROACH_ENABLE_PREVOTE=true and COCKROACH_TICK_QUIESCED=false should use the PreVote mechanism.

a-robinson · 2017-07-31T22:54:59Z

To circle back on this, I ended up increasing COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS instead and it seems to have settled things down.

bdarnell · 2017-08-07T02:02:50Z

Interesting. When PreVote is disabled, we use CheckQuorum instead, which is supposed to have more or less the same effect, but it's a little more timing sensitive. I had thought that the timing sensitivity was unlikely to matter (the main benefit of switching would be to get rid of the TickQuiesced hack), but it looks like PreVote would improve behavior in at least some cases.

I'd also suggest increasing the tick interval instead of the election timeout, since that will also reduce the frequency of heartbeats and reproposals. (although the default tick interval should be fine for 50ms of latency. I think there's something else going on here that we haven't identified yet, and changing the election timeout or even enabling PreVote are just covering up symptoms)

bdarnell · 2017-08-07T02:39:58Z

We may need to do another dependency update to pick up etcd-io/etcd#8346 before enabling PreVote.

a-robinson · 2017-08-07T02:45:29Z

Yeah, there are still problems with raft behavior on indigo that I need to look into even with the increased election timeout. There are still too many leader-not-leaseholder ranges. Also, I don't think we have it in any metrics anywhere yet, but judging by the verbose logs there are a lot more MsgApps getting rejected than I'd expect.

The reason that I changed the election timeout rather than the tick interval, though, is because we don't expose any option for changing that. If we think it needs to be increased in high-latency clusters (which it may), we should go back to exposing it. We removed it because we didn't want it to be a flag (#15547), but didn't add an environment variable to replace it.

irfansharif · 2017-08-07T14:44:51Z

We may need to do another dependency update to pick up etcd-io/etcd#8346 before enabling PreVote.

etcd-io/etcd#8334 too right?

bdarnell · 2017-08-07T15:00:16Z

etcd-io/etcd#8334 shouldn't matter for us since it is about the interaction between PreVote and CheckQuorum, and we only enable one or the other; never both.

petermattis · 2017-08-17T01:50:06Z

@irfansharif Have we tested this on production clusters?

irfansharif · 2017-08-17T03:35:07Z

somewhat, insofar I've only run this against ultramarine which is only just four nodes. I'll follow up bigger clusters (blue or indigo) to run over this weekend.

bdarnell · 2017-09-01T14:15:14Z

For the record: pre-vote was enabled in #17808, then disabled again in #18128

irfansharif · 2017-09-01T14:17:30Z

tracking the recent wedged pre-candidates: #18129

bdarnell · 2018-01-23T20:42:07Z

OK, I'm paging all of this back in to take another shot at enabling PreVote in 2.0.

We have one known problem, a sort of deadlock when one node has the highest term but another node has the highest log index (stability: PreVote can lead to unavailable ranges #18151 and raft: deadlock during PreVote migration process etcd-io/etcd#8501). This has a proposed fix in raft: fix deadlock during PreVote migration process etcd-io/etcd#8525, although there are some questions about whether this change is comprehensive enough.
More test coverage would also be valuable (raft: lower test coverage for {CheckQuorum,PreVote} permutations etcd-io/etcd#8498).
We need to think about a migration plan. stability: PreVote can lead to unavailable ranges #18151 is believed to be caused by a rolling restart that temporarily has clusters in a mixed-version state (and we want to support canary rollouts and other schemes that might make mixed versions a longer-term condition). This may just work. 1.1 nodes with prevote disabled will still respond to prevote requests from 2.0 nodes. However, they won't have whatever fixes we make for the issues above, so I don't think we can be confident that everything will work correctly in that state. We probably need a new cluster version to control the cutover (so we need to make sure that everything works for a mix of prevote-on and prevote-off nodes, but all in the 2.0 codebase)

bdarnell · 2018-02-01T03:35:29Z

In my chaos testing today #18151 appears to be fixed. However, there are still some performance problems: when I kill a node p99 (and p90) write latency on the other nodes spikes up to multiple seconds. (we don't see spikes like this without prevote. Some requests must wait several seconds for lease expiration in any case, but it's normally few enough that they don't show up in the p99 graph). I don't have any leads yet but I'm still investigating.

bdarnell · 2018-02-02T01:11:50Z

It turns out that TickQuiesced is important whether PreVote is enabled or not.

We have tied the use of TickQuiesced to the inverse of the enablePreVote setting because it is necessary to prevent deadlocks when using CheckQuorum (which we enable if and only if we are not using PreVote). However, whether CheckQuorum is used or not, TickQuiesced has the side benefit of improving our responsiveness to node outages (at the expense of a somewhat higher risk of contested elections).

While a Replica is quiesced, time doesn't pass normally for it. WIthout TickQuiesced, this means that the election timeout starts when the Replica unquiesces, so the request that unquiesced the range must wait for the entire election timeout. TickQuiesced keeps the (raft logical) clock running but does not itself trigger elections. This leaves the Replica primed to start an election immediately when something unquiesces it.

Note that no matter what happens at the raft level, we have to wait for leases to expire. However, time spent waiting for leases to expire is real time and applies to all leases on all ranges. The logical time used within raft is per-replica, so it affects a larger number of requests. This is an example where hybrid logical clocks are better than purely logical ones.

Accumulating time in TickQuiesced effectively disables one of the raft algorithm's protections against contested elections: the randomized election timeout (between N ticks and 2*N ticks). Since we've been running with TickQuiesced for over a year, though, this doesn't seem to be a significant issue.

TickQuiesced has recently shown up as a performance issue for nodes with large numbers of ranges. With PreVote (but not with CheckQuorum), I think we can still mitigate this by changing from ticking the logical clock while quiesced to triggering an immediate campaign when unquiescing. (This would normally raise concerns about contested elections but it doesn't seem to matter much in practice).

Further testing is in progress but I think that unconditionally enabling TickQuiesced will be enough to let us switch from CheckQuorum to PreVote.

petermattis · 2018-02-02T01:47:47Z

Interesting. The waking of a quiesced range usually happens on the leaseholder which likely mitigates the lack of randomized election timeout.

bdarnell · 2018-02-02T01:56:52Z

The waking of a quiesced range always happens on the leaseholder as long as the lease is valid. When a lease expires, one of the other nodes will unquiesce their replica. If a leaseholder dies and then both followers of the range receive traffic at the same time, we'll have a contested election. That doesn't happen very often with our uniform kv workloads. It might be more of a problem with more skewed workloads. But even in the worst case I think this just adds 1 RTT to the time taken to complete the election, since once both followers are unquiesced, the normal randomized timeouts apply.

This is backed up by Heidi Howard's ARC paper, which showed benefits from using a shorter non-random timeout for the first election attempt then using randomized elections if there is a conflict.

TickQuiesced is beneficial both with and without PreVote, so we want it enabled even when PreVote is on. Updates cockroachdb#16950 Release note: None

tbg added this to the 1.1 milestone Jul 10, 2017

tbg assigned irfansharif Jul 10, 2017

This was referenced Jul 25, 2017

Update coreos/etcd cockroachdb/vendored#5

Closed

vendor: Update coreos/etcd; pick up pre-vote patch #17210

Merged

irfansharif mentioned this issue Aug 7, 2017

vendor: bump etcd/raft #17495

Merged

irfansharif mentioned this issue Aug 22, 2017

storage: enable pre-vote by default #17808

Merged

petermattis modified the milestones: 1.2, 1.1 Sep 5, 2017

bdarnell assigned bdarnell and unassigned irfansharif Jan 22, 2018

bdarnell mentioned this issue Feb 1, 2018

stability: PreVote can lead to unavailable ranges #18151

Closed

bdarnell added a commit to bdarnell/cockroach that referenced this issue Feb 2, 2018

storage: Decouple TickQuiesced from PreVote

8abc560

TickQuiesced is beneficial both with and without PreVote, so we want it enabled even when PreVote is on. Updates cockroachdb#16950 Release note: None

petermattis mentioned this issue Feb 2, 2018

storage: Enable PreVote and TickQuiesced by default #22316

Merged

bdarnell closed this as completed in e88b34a Feb 5, 2018

erikgrinaker mentioned this issue May 29, 2023

kvserver: Raft prevote does not prevent election despite active leader #92088

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: re-enable Raft PreVote RPC #16950

storage: re-enable Raft PreVote RPC #16950

tbg commented Jul 10, 2017 •

edited

Loading

bdarnell commented Jul 10, 2017

irfansharif commented Jul 17, 2017 •

edited

Loading

irfansharif commented Jul 20, 2017 •

edited

Loading

a-robinson commented Jul 31, 2017

irfansharif commented Jul 31, 2017

a-robinson commented Jul 31, 2017

bdarnell commented Aug 7, 2017

bdarnell commented Aug 7, 2017

a-robinson commented Aug 7, 2017

irfansharif commented Aug 7, 2017

bdarnell commented Aug 7, 2017

petermattis commented Aug 17, 2017

irfansharif commented Aug 17, 2017 •

edited

Loading

bdarnell commented Sep 1, 2017

irfansharif commented Sep 1, 2017

bdarnell commented Jan 23, 2018

bdarnell commented Feb 1, 2018

bdarnell commented Feb 2, 2018

petermattis commented Feb 2, 2018

bdarnell commented Feb 2, 2018

storage: re-enable Raft PreVote RPC #16950

storage: re-enable Raft PreVote RPC #16950

Comments

tbg commented Jul 10, 2017 • edited Loading

bdarnell commented Jul 10, 2017

irfansharif commented Jul 17, 2017 • edited Loading

irfansharif commented Jul 20, 2017 • edited Loading

a-robinson commented Jul 31, 2017

irfansharif commented Jul 31, 2017

a-robinson commented Jul 31, 2017

bdarnell commented Aug 7, 2017

bdarnell commented Aug 7, 2017

a-robinson commented Aug 7, 2017

irfansharif commented Aug 7, 2017

bdarnell commented Aug 7, 2017

petermattis commented Aug 17, 2017

irfansharif commented Aug 17, 2017 • edited Loading

bdarnell commented Sep 1, 2017

irfansharif commented Sep 1, 2017

bdarnell commented Jan 23, 2018

bdarnell commented Feb 1, 2018

bdarnell commented Feb 2, 2018

petermattis commented Feb 2, 2018

bdarnell commented Feb 2, 2018

tbg commented Jul 10, 2017 •

edited

Loading

irfansharif commented Jul 17, 2017 •

edited

Loading

irfansharif commented Jul 20, 2017 •

edited

Loading

irfansharif commented Aug 17, 2017 •

edited

Loading