-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: use separate lease duration for liveness range #93073
base: use separate lease duration for liveness range #93073
Conversation
@tbg I've been playing around with a few different settings here, testing with
(the "Lost leases" metric is unreliable due to the time series sampling resolution) Even with the "old" leases of 9.0s, I had to drop the liveness lease all the way down to 3.0s to get a significant improvement, and I was still unable to eliminate the effect. It seems like regardless of whether we reduce the standard lease durations in #91947, we'll still have a large blast radius on the liveness range leaseholder, but reducing the liveness lease does improve things in both cases. This will likely be far worse in the case of a network outage, where the network timeouts also come into play. I'm not really comfortable reducing it this far down -- it's pretty tight, and the write latency of the extension can't exceed 1s before we start having issues due to the stasis period. I think we should instead consider alternative approaches to coalesced lease extensions that don't have single points of failure, as part of the lease protocol improvements we're exploring. Would be interested to get your take here. |
Ran a tiny experiment: diff --git a/pkg/kv/kvserver/replica_proposal.go b/pkg/kv/kvserver/replica_proposal.go
index 1e3fdf6f89..52fcc49f23 100644
--- a/pkg/kv/kvserver/replica_proposal.go
+++ b/pkg/kv/kvserver/replica_proposal.go
@@ -266,6 +266,13 @@ func (r *Replica) leasePostApplyLocked(
// Log lease acquisitions loudly when verbose logging is enabled or when the
// new leaseholder is draining, in which case it should be shedding leases.
// Otherwise, log a trace event.
+ if prevExp, newPropTS := prevLease.GetExpiration(), newLease.ProposedTS.ToTimestamp(); prevExp.IsSet() && prevExp.Less(newPropTS) {
+ log.Warningf(ctx,
+ "new range lease %s following %s (downtime at least %.2fs, previous expired at %s, acquisition started at %s)",
+ newLease, prevLease, time.Duration(newPropTS.WallTime-prevExp.WallTime).Seconds(),
+ time.Unix(0, prevExp.WallTime), time.Unix(0, newPropTS.WallTime))
+ }
+
if log.V(1) || r.store.IsDraining() {
log.Infof(ctx, "new range lease %s following %s", newLease, prevLease)
} else {
Start local 3 node cluster, wait for things to settle, bring down the leaseholder, grab the log message.
So it took 2.81s post the expiration of the lease until we started proposing the new lease. I had finished killing the old leaseholder at Maybe something in this experiment is broken because I'm not isolating the liveness range (and so there are more leases lost than just liveness). Still, I wonder if we can run with this diff in your liveness-isolated test once or twice to see what outputs we get. Ideally, the number reported by this test for the "min downtime" should be ~0 as we ought to be acquiring the lease non-cooperatively the very instant this is possible. If we find that this is not happening, this is worth looking into. If this peters out - I agree that we may not be able to fix this problem sufficiently by moving towards lower lease liveness expirations, and I share your concern about being too aggressive in lowering them. It may then make sense to postpone attacking this issue. |
I suspect this is because we have to wait for a heartbeat request to come in, and they're fairly rare at 4.5s. But yeah, I can rerun an experiment with this. |
We already waited 7s until that point, and definitely lost the leader during these 7s, so there ought to be 2 of them (one for each remaining node) already queued. |
Yeah, but they have timeouts, so maybe they timed out and then had to wait for the next one to fire. I'll have a closer look. |
a3925f3
to
e2f0d6a
Compare
Release note: None
Release note: None
This patch reduces the lease interval from 9.0s to 5.0s and Raft election timeout from 3.0s to 2.0s, to reduce unavailability following a leaseholder crash. This implies that the node heartbeat interval is reduced from 4.5s to 2.5s, the lease acquisition timeout is reduced from 6.0s to 4.0s, and the proposal timeout is reduced from 3.0s-6.0s to 2.0s-4.0s. The gossip sentinel TTL is now tied to the lease expiration interval rather than the lease renewal interval, and has thus changed from 4.5s to 5.0s. There are three involved quantities: * `defaultRaftHeartbeatIntervalTicks`: 5 (1.0s). Not changed, to avoid increasing Raft heartbeat costs. * `defaultRaftElectionTimeoutTicks`: 10 (2.0s). Changed from 15 (3.0s), which still leaves ample headroom by the following reasoning: * The network round-trip time (RTT) is expected to be below 400ms (worst-case GCP inter-region latency is 350ms). * TCP prevents "dropped" heartbeats. Linux has an RTT-dependant retransmission timeout (RTO) which we can approximate as 1.5x RTT (smoothed RTT + 4x RTT variance), with a lower bound of 200ms. The worst-case RTO is thus 600ms, with another 0.5x RTT (200ms) for the retransmit to reach the peer, leaving 200ms to spare. * The actual election timeout per replica is multiplied by a random factor of 1-2, so it's likely to be significantly larger than 2.0s for a given follower (i.e. TCP connection). * `defaultRangeLeaseRaftElectionTimeoutMultiplier`: 2.5 (5.0s). Changed from 3.0 (9.0s), with the following reasoning: * Our target lease interval is 5.0s. Reducing the lease multiplier is better than reducing the election timeout further, to leave some headroom for Raft elections and related timeouts. * However, we generally want Raft elections to complete before the lease expires. Since Raft elections have a random 1-2 multiplier, 2.5 ensures the nominal lease interval (5.0s) is always greater than the nominal Raft election timeout (4.0s worst-case), but the relative offset of the last Raft heartbeat (up to 1.0s) and lease extension (up to 2.5s) may cause lease acquisition to sometimes block on Raft elections, skewing the average unavailability upwards to roughly 4.0s. Roachtest benchmarks of pMax latency during leaseholder failures show a clear improvement: | Test | `master` | This PR | |---------------------------------|-----------|---------| | `failover/non-system/crash` | 14.5 s | 8.3 s | | `failover/non-system/blackhole` | 18.3 s | 14.5 s | Alternatively, keeping `defaultRangeLeaseRaftElectionTimeoutMultiplier` at 3 but reducing `defaultRaftElectionTimeoutTicks` to 8 would give a lease interval of 4.8s, with slightly lower average unavailability, but a higher probability of spurious Raft election timeouts, especially with higher RTTs. Release note (ops change): The Raft election timeout has been reduced from 3 seconds to 2 seconds, and the lease interval from 9 seconds to 5 seconds, with a corresponding reduction in the node heartbeat interval from 4.5 seconds to 2.5 seconds. This reduces the period of unavailability following leaseholder loss, but places tighter restrictions on network latencies (no more than 500ms RTT). These timeouts can be adjusted via the environment variable `COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS`, which now defaults to 10 and will scale the timeouts proportionally.
This patch adds a separate lease duration for the liveness range. When the liveness range leaseholder is lost, other nodes are unable to heartbeat and extend their leases, which can lead to the loss of most leases in the system. Shortening the liveness lease duration shortens the period of unavailability for other leases. Release note (ops change): The duration of the liveness range lease has been reduced to 2.5s, half of the regular lease duration. When the liveness range leaseholder is lost (e.g. due to an infrastructure outage), other nodes are unable to heartbeat and thus extend their own leases, which can lead to the loss of many of the leases throughtout the cluster. Reducing the liveness lease duration reduces the period of unavailability.
e2f0d6a
to
6afd5c9
Compare
Did some runs with
Notice how the lease expired at 16:58:06, and was reacquired at 16:58:09. I cross-referenced this with liveness heartbeat requests on n2 attempting to acquire leases, and see a bunch of these continually between 16:58:06 and 16:58:09:
That last error comes from here: cockroach/pkg/kv/kvserver/replica_proposal_buf.go Lines 1252 to 1269 in 6ed9148
Which originates while we're proposing the lease transfer, and is rejected because we're not the Raft leader: cockroach/pkg/kv/kvserver/replica_proposal_buf.go Lines 614 to 627 in 6ed9148
So in order to make this work reliably we'll have to drop the Raft election timeout well below the lease expiration, maybe something like 1s considering it's randomized by a 1-2x factor, and that really isn't going to work very well. I think we'll have to consider other approaches here. |
The first few commits are from #91947, only the last commit should be reviewed here.
This patch adds a separate lease duration for the liveness range. When the liveness range leaseholder is lost, other nodes are unable to heartbeat and extend their leases, which can lead to the loss of most leases in the system. Shortening the liveness lease duration shortens the period of unavailability for other leases.
Release note (ops change): The duration of the liveness range lease has been reduced to 2.5s, half of the regular lease duration. When the liveness range leaseholder is lost (e.g. due to an infrastructure outage), other nodes are unable to heartbeat and thus extend their own leases, which can lead to the loss of many of the leases throughtout the cluster. Reducing the liveness lease duration reduces the period of unavailability.