Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: re-enable Raft PreVote RPC #16950

Closed
tbg opened this issue Jul 10, 2017 · 20 comments
Closed

storage: re-enable Raft PreVote RPC #16950

tbg opened this issue Jul 10, 2017 · 20 comments
Assignees
Milestone

Comments

@tbg
Copy link
Member

tbg commented Jul 10, 2017

Enable the Raft PreVote feature. Note, the implementation of Raft PreVote exists, but stability issues arose when it was turned on. This task is about fixing those issues (if they still exist).

Raft PreVote is a mechanism to lessen the impact of a rogue replica calling a Raft election. When a replica calls a Raft election it forces the current leader to step down and we have to wait for the election to complete. This disrupts activity on the affected Range. A replica might call an election mistakenly if it was partitioned away temporarily. The unfortunate part about it calling an election when it rejoins the cluster is that it can't possibly succeed because it is not up to date. Raft PreVote is a mechanism for a replica to check whether other replicas would vote for it if it held an election.

Archaeology: we don't really know why it wasn't working, and simply disabled it.

commit 1d59f01c9b3523f31d8415ecbd1096d0b51fbbc6
Author: Tamir Duberstein <tamird@gmail.com>
Date:   Thu Nov 3 12:05:59 2016 -0400

    GLOCKFILE: update dependencies
    
    Summary of not-definitely-irrelevant changes:
    - github.com/cenkalti/backoff:
      - moved from github.com/cenk/backoff: https://github.com/cenkalti/backoff/commit/b02f2bb
    - github.com/cockroachdb/c-protobuf:
      - updated upstream to 3.1.0: https://github.com/cockroachdb/c-protobuf/pull/20
    - github.com/gogo/protobuf:
      - rename generated variables data->dAtA: https://github.com/gogo/protobuf/pull/215
    - github.com/etcd/raft:
      - PreVote-related panic fix: https://github.com/coreos/etcd/pull/6749
    - github.com/lib/pq:
      - data race fix: https://github.com/lib/pq/commit/473284b
    
    Not reviewed in detail:
    - github.com/go-sql-driver/mysql
    - github.com/google/go-github
    - golang.org/x/crypto
    - golang.org/x/net
    - golang.org/x/oauth2
    - golang.org/x/sys
    - golang.org/x/text
    - golang.org/x/tools
    - google.golang.org/appengine
    
    Skipped:
    
    - github.com/cockroachdb/c-rocksdb (#9616)
    - github.com/docker/docker (https://github.com/docker/docker/pull/27912) breaks all dependents
    - google.golang.org/grpc (#9697)

commit 9ce883365db916baa8a6d66d16bc153e7b3977e0
Merge: 99e14f718 f61b94d05
Author: Alex Robinson <alexdwanerobinson@gmail.com>
Date:   Wed Nov 2 12:10:33 2016 -0400

    Merge pull request #10395 from bdarnell/bdarnell/disable-prevote
    
    storage: Switch back from PreVote to CheckQuorum

commit f61b94d05fa83ec31885281c19d89fb97b5d796f
Author: Ben Darnell <ben@cockroachlabs.com>
Date:   Wed Nov 2 22:41:06 2016 +0800

    storage: Switch back from PreVote to CheckQuorum
    
    PreVote has bugs that show up under chaos testing.

commit b0bb16e39a762ab49b0c87308829548be0238c24
Merge: 791502496 601cb79aa
Author: Ben Darnell <ben@bendarnell.com>
Date:   Wed Nov 2 19:42:01 2016 +0800

    Merge pull request #10218 from bdarnell/prevote
    
    storage: Use raft PreVote instead of CheckQuorum

commit 601cb79aa65b9eebd2f142b1c64b753ce995b3be
Author: Ben Darnell <ben@cockroachlabs.com>
Date:   Wed Oct 26 08:46:00 2016 +0900

    storage: Stop calling raftGroup.TickQuiesced
    
    This was a workaround for the interaction between quiesced ranges and
    CheckQuorum, and is no longer needed now that we have switched from
    CheckQuorum to PreVote.
    
    Reverts #9407

commit 65963433dea55bee2b9e707d1497f97838df700c
Author: Ben Darnell <ben@cockroachlabs.com>
Date:   Wed Oct 19 11:07:32 2016 +0800

    storage: Use raft PreVote instead of CheckQuorum
    
    CheckQuorum interacts badly with quiesced ranges and coalesced
    heartbeats; the new PreVote implementation provides the same benefits in
    a way that is more compatible with our use.
    
    Fixes #9561
@tbg tbg added this to the 1.1 milestone Jul 10, 2017
@bdarnell
Copy link
Contributor

The symptom, IIRC, was that after a leader died, the next election just wouldn't succeed. I can't remember whether the range would just do one election and get stuck or if it would go in a loop of elections (or pre-elections) that never resolved.

@irfansharif
Copy link
Contributor

irfansharif commented Jul 17, 2017

Upstream issue tracked in etcd-io/etcd#8243, looks like @distributed-sys has already identified the offending codepaths (+ reproducible test).

@irfansharif
Copy link
Contributor

irfansharif commented Jul 20, 2017

Fix: etcd-io/etcd#8288, but as per discussion today might not make it to 1.1.

@a-robinson
Copy link
Contributor

Indigo has been having problems the last few days that pre-vote would probably help with quite a bit.

Specifically, the node liveness range is on two nodes that are close together and one that's farther away (45-50ms). Every so often, the far away node will run into a MsgTimeoutNow and start an election at the next term. It gets elected leader, but in the process disrupts liveness heartbeats enough that some take too long and their nodes' epochs get incremented.

There are definitely other things going wrong as well (including a lot more raft log truncation than I'd expect on the liveness range, every 4-10 seconds), but there are a lot of raft elections being caused by timeouts.

screen shot 2017-07-31 at 11 33 40 am

@irfansharif
Copy link
Contributor

Was going to start chaos testing this patch post feature freeze, left this issue open in case other things prop up when enabling it. But if you're willing to experiment, COCKROACH_ENABLE_PREVOTE=true and COCKROACH_TICK_QUIESCED=false should use the PreVote mechanism.

@a-robinson
Copy link
Contributor

To circle back on this, I ended up increasing COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS instead and it seems to have settled things down.

@bdarnell
Copy link
Contributor

bdarnell commented Aug 7, 2017

Interesting. When PreVote is disabled, we use CheckQuorum instead, which is supposed to have more or less the same effect, but it's a little more timing sensitive. I had thought that the timing sensitivity was unlikely to matter (the main benefit of switching would be to get rid of the TickQuiesced hack), but it looks like PreVote would improve behavior in at least some cases.

I'd also suggest increasing the tick interval instead of the election timeout, since that will also reduce the frequency of heartbeats and reproposals. (although the default tick interval should be fine for 50ms of latency. I think there's something else going on here that we haven't identified yet, and changing the election timeout or even enabling PreVote are just covering up symptoms)

@bdarnell
Copy link
Contributor

bdarnell commented Aug 7, 2017

We may need to do another dependency update to pick up etcd-io/etcd#8346 before enabling PreVote.

@a-robinson
Copy link
Contributor

Yeah, there are still problems with raft behavior on indigo that I need to look into even with the increased election timeout. There are still too many leader-not-leaseholder ranges. Also, I don't think we have it in any metrics anywhere yet, but judging by the verbose logs there are a lot more MsgApps getting rejected than I'd expect.

The reason that I changed the election timeout rather than the tick interval, though, is because we don't expose any option for changing that. If we think it needs to be increased in high-latency clusters (which it may), we should go back to exposing it. We removed it because we didn't want it to be a flag (#15547), but didn't add an environment variable to replace it.

@irfansharif
Copy link
Contributor

We may need to do another dependency update to pick up etcd-io/etcd#8346 before enabling PreVote.

etcd-io/etcd#8334 too right?

@bdarnell
Copy link
Contributor

bdarnell commented Aug 7, 2017

etcd-io/etcd#8334 shouldn't matter for us since it is about the interaction between PreVote and CheckQuorum, and we only enable one or the other; never both.

@petermattis
Copy link
Collaborator

@irfansharif Have we tested this on production clusters?

@irfansharif
Copy link
Contributor

irfansharif commented Aug 17, 2017

somewhat, insofar I've only run this against ultramarine which is only just four nodes. I'll follow up bigger clusters (blue or indigo) to run over this weekend.

@bdarnell
Copy link
Contributor

bdarnell commented Sep 1, 2017

For the record: pre-vote was enabled in #17808, then disabled again in #18128

@irfansharif
Copy link
Contributor

tracking the recent wedged pre-candidates: #18129

@petermattis petermattis modified the milestones: 1.2, 1.1 Sep 5, 2017
@bdarnell bdarnell assigned bdarnell and unassigned irfansharif Jan 22, 2018
@bdarnell
Copy link
Contributor

OK, I'm paging all of this back in to take another shot at enabling PreVote in 2.0.

@bdarnell
Copy link
Contributor

bdarnell commented Feb 1, 2018

In my chaos testing today #18151 appears to be fixed. However, there are still some performance problems: when I kill a node p99 (and p90) write latency on the other nodes spikes up to multiple seconds. (we don't see spikes like this without prevote. Some requests must wait several seconds for lease expiration in any case, but it's normally few enough that they don't show up in the p99 graph). I don't have any leads yet but I'm still investigating.

@bdarnell
Copy link
Contributor

bdarnell commented Feb 2, 2018

It turns out that TickQuiesced is important whether PreVote is enabled or not.

We have tied the use of TickQuiesced to the inverse of the enablePreVote setting because it is necessary to prevent deadlocks when using CheckQuorum (which we enable if and only if we are not using PreVote). However, whether CheckQuorum is used or not, TickQuiesced has the side benefit of improving our responsiveness to node outages (at the expense of a somewhat higher risk of contested elections).

While a Replica is quiesced, time doesn't pass normally for it. WIthout TickQuiesced, this means that the election timeout starts when the Replica unquiesces, so the request that unquiesced the range must wait for the entire election timeout. TickQuiesced keeps the (raft logical) clock running but does not itself trigger elections. This leaves the Replica primed to start an election immediately when something unquiesces it.

Note that no matter what happens at the raft level, we have to wait for leases to expire. However, time spent waiting for leases to expire is real time and applies to all leases on all ranges. The logical time used within raft is per-replica, so it affects a larger number of requests. This is an example where hybrid logical clocks are better than purely logical ones.

Accumulating time in TickQuiesced effectively disables one of the raft algorithm's protections against contested elections: the randomized election timeout (between N ticks and 2*N ticks). Since we've been running with TickQuiesced for over a year, though, this doesn't seem to be a significant issue.

TickQuiesced has recently shown up as a performance issue for nodes with large numbers of ranges. With PreVote (but not with CheckQuorum), I think we can still mitigate this by changing from ticking the logical clock while quiesced to triggering an immediate campaign when unquiescing. (This would normally raise concerns about contested elections but it doesn't seem to matter much in practice).

Further testing is in progress but I think that unconditionally enabling TickQuiesced will be enough to let us switch from CheckQuorum to PreVote.

@petermattis
Copy link
Collaborator

Interesting. The waking of a quiesced range usually happens on the leaseholder which likely mitigates the lack of randomized election timeout.

@bdarnell
Copy link
Contributor

bdarnell commented Feb 2, 2018

The waking of a quiesced range always happens on the leaseholder as long as the lease is valid. When a lease expires, one of the other nodes will unquiesce their replica. If a leaseholder dies and then both followers of the range receive traffic at the same time, we'll have a contested election. That doesn't happen very often with our uniform kv workloads. It might be more of a problem with more skewed workloads. But even in the worst case I think this just adds 1 RTT to the time taken to complete the election, since once both followers are unquiesced, the normal randomized timeouts apply.

This is backed up by Heidi Howard's ARC paper, which showed benefits from using a shorter non-random timeout for the first election attempt then using randomized elections if there is a conflict.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Feb 2, 2018
TickQuiesced is beneficial both with and without PreVote, so we want
it enabled even when PreVote is on.

Updates cockroachdb#16950

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants