-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: re-enable Raft PreVote RPC #16950
Comments
The symptom, IIRC, was that after a leader died, the next election just wouldn't succeed. I can't remember whether the range would just do one election and get stuck or if it would go in a loop of elections (or pre-elections) that never resolved. |
Upstream issue tracked in etcd-io/etcd#8243, looks like @distributed-sys has already identified the offending codepaths (+ reproducible test). |
Fix: etcd-io/etcd#8288, but as per discussion today might not make it to 1.1. |
Indigo has been having problems the last few days that pre-vote would probably help with quite a bit. Specifically, the node liveness range is on two nodes that are close together and one that's farther away (45-50ms). Every so often, the far away node will run into a There are definitely other things going wrong as well (including a lot more raft log truncation than I'd expect on the liveness range, every 4-10 seconds), but there are a lot of raft elections being caused by timeouts. |
Was going to start chaos testing this patch post feature freeze, left this issue open in case other things prop up when enabling it. But if you're willing to experiment, |
To circle back on this, I ended up increasing |
Interesting. When PreVote is disabled, we use CheckQuorum instead, which is supposed to have more or less the same effect, but it's a little more timing sensitive. I had thought that the timing sensitivity was unlikely to matter (the main benefit of switching would be to get rid of the TickQuiesced hack), but it looks like PreVote would improve behavior in at least some cases. I'd also suggest increasing the tick interval instead of the election timeout, since that will also reduce the frequency of heartbeats and reproposals. (although the default tick interval should be fine for 50ms of latency. I think there's something else going on here that we haven't identified yet, and changing the election timeout or even enabling PreVote are just covering up symptoms) |
We may need to do another dependency update to pick up etcd-io/etcd#8346 before enabling PreVote. |
Yeah, there are still problems with raft behavior on indigo that I need to look into even with the increased election timeout. There are still too many leader-not-leaseholder ranges. Also, I don't think we have it in any metrics anywhere yet, but judging by the verbose logs there are a lot more The reason that I changed the election timeout rather than the tick interval, though, is because we don't expose any option for changing that. If we think it needs to be increased in high-latency clusters (which it may), we should go back to exposing it. We removed it because we didn't want it to be a flag (#15547), but didn't add an environment variable to replace it. |
etcd-io/etcd#8334 too right? |
etcd-io/etcd#8334 shouldn't matter for us since it is about the interaction between PreVote and CheckQuorum, and we only enable one or the other; never both. |
@irfansharif Have we tested this on production clusters? |
somewhat, insofar I've only run this against |
tracking the recent wedged pre-candidates: #18129 |
OK, I'm paging all of this back in to take another shot at enabling PreVote in 2.0.
|
In my chaos testing today #18151 appears to be fixed. However, there are still some performance problems: when I kill a node p99 (and p90) write latency on the other nodes spikes up to multiple seconds. (we don't see spikes like this without prevote. Some requests must wait several seconds for lease expiration in any case, but it's normally few enough that they don't show up in the p99 graph). I don't have any leads yet but I'm still investigating. |
It turns out that TickQuiesced is important whether PreVote is enabled or not. We have tied the use of TickQuiesced to the inverse of the enablePreVote setting because it is necessary to prevent deadlocks when using CheckQuorum (which we enable if and only if we are not using PreVote). However, whether CheckQuorum is used or not, TickQuiesced has the side benefit of improving our responsiveness to node outages (at the expense of a somewhat higher risk of contested elections). While a Replica is quiesced, time doesn't pass normally for it. WIthout TickQuiesced, this means that the election timeout starts when the Replica unquiesces, so the request that unquiesced the range must wait for the entire election timeout. TickQuiesced keeps the (raft logical) clock running but does not itself trigger elections. This leaves the Replica primed to start an election immediately when something unquiesces it. Note that no matter what happens at the raft level, we have to wait for leases to expire. However, time spent waiting for leases to expire is real time and applies to all leases on all ranges. The logical time used within raft is per-replica, so it affects a larger number of requests. This is an example where hybrid logical clocks are better than purely logical ones. Accumulating time in TickQuiesced effectively disables one of the raft algorithm's protections against contested elections: the randomized election timeout (between TickQuiesced has recently shown up as a performance issue for nodes with large numbers of ranges. With PreVote (but not with CheckQuorum), I think we can still mitigate this by changing from ticking the logical clock while quiesced to triggering an immediate campaign when unquiescing. (This would normally raise concerns about contested elections but it doesn't seem to matter much in practice). Further testing is in progress but I think that unconditionally enabling TickQuiesced will be enough to let us switch from CheckQuorum to PreVote. |
Interesting. The waking of a quiesced range usually happens on the leaseholder which likely mitigates the lack of randomized election timeout. |
The waking of a quiesced range always happens on the leaseholder as long as the lease is valid. When a lease expires, one of the other nodes will unquiesce their replica. If a leaseholder dies and then both followers of the range receive traffic at the same time, we'll have a contested election. That doesn't happen very often with our uniform This is backed up by Heidi Howard's ARC paper, which showed benefits from using a shorter non-random timeout for the first election attempt then using randomized elections if there is a conflict. |
TickQuiesced is beneficial both with and without PreVote, so we want it enabled even when PreVote is on. Updates cockroachdb#16950 Release note: None
Enable the Raft PreVote feature. Note, the implementation of Raft PreVote exists, but stability issues arose when it was turned on. This task is about fixing those issues (if they still exist).
Raft PreVote is a mechanism to lessen the impact of a rogue replica calling a Raft election. When a replica calls a Raft election it forces the current leader to step down and we have to wait for the election to complete. This disrupts activity on the affected Range. A replica might call an election mistakenly if it was partitioned away temporarily. The unfortunate part about it calling an election when it rejoins the cluster is that it can't possibly succeed because it is not up to date. Raft PreVote is a mechanism for a replica to check whether other replicas would vote for it if it held an election.
Archaeology: we don't really know why it wasn't working, and simply disabled it.
The text was updated successfully, but these errors were encountered: