Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
base: reduce lease interval and Raft election timeout
This patch reduces the lease interval from 9.0s to 5.0s and Raft election timeout from 3.0s to 2.0s, to reduce unavailability following a leaseholder crash. This implies that the node heartbeat interval is reduced from 4.5s to 2.5s, the lease acquisition timeout is reduced from 6.0s to 4.0s, and the proposal timeout is reduced from 3.0s-6.0s to 2.0s-4.0s. The gossip sentinel TTL is now tied to the lease expiration interval rather than the lease renewal interval, and has thus changed from 4.5s to 5.0s. There are three involved quantities: * `defaultRaftHeartbeatIntervalTicks`: 5 (1.0s). Not changed, to avoid increasing Raft heartbeat costs. * `defaultRaftElectionTimeoutTicks`: 10 (2.0s). Changed from 15 (3.0s), which still leaves ample headroom by the following reasoning: * The network round-trip time (RTT) is expected to be below 400ms (worst-case GCP inter-region latency is 350ms). * TCP prevents "dropped" heartbeats. Linux has an RTT-dependant retransmission timeout (RTO) which we can approximate as 1.5x RTT (smoothed RTT + 4x RTT variance), with a lower bound of 200ms. The worst-case RTO is thus 600ms, with another 0.5x RTT (200ms) for the retransmit to reach the peer, leaving 200ms to spare. * The actual election timeout per replica is multiplied by a random factor of 1-2, so it's likely to be significantly larger than 2.0s for a given follower (i.e. TCP connection). * `defaultRangeLeaseRaftElectionTimeoutMultiplier`: 2.5 (5.0s). Changed from 3.0 (9.0s), with the following reasoning: * Our target lease interval is 5.0s. Reducing the lease multiplier is better than reducing the election timeout further, to leave some headroom for Raft elections and related timeouts. * However, we generally want Raft elections to complete before the lease expires. Since Raft elections have a random 1-2 multiplier, 2.5 ensures the nominal lease interval (5.0s) is always greater than the nominal Raft election timeout (4.0s worst-case), but the relative offset of the last Raft heartbeat (up to 1.0s) and lease extension (up to 2.5s) may cause lease acquisition to sometimes block on Raft elections, skewing the average unavailability upwards to roughly 4.0s. Roachtest benchmarks of pMax latency during leaseholder failures show a clear improvement: | Test | `master` | This PR | |---------------------------------|-----------|---------| | `failover/non-system/crash` | 14.5 s | 8.3 s | | `failover/non-system/blackhole` | 18.3 s | 14.5 s | Alternatively, keeping `defaultRangeLeaseRaftElectionTimeoutMultiplier` at 3 but reducing `defaultRaftElectionTimeoutTicks` to 8 would give a lease interval of 4.8s, with slightly lower average unavailability, but a higher probability of spurious Raft election timeouts, especially with higher RTTs. Release note (ops change): The Raft election timeout has been reduced from 3 seconds to 2 seconds, and the lease interval from 9 seconds to 5 seconds, with a corresponding reduction in the node heartbeat interval from 4.5 seconds to 2.5 seconds. This reduces the period of unavailability following leaseholder loss, but places tighter restrictions on network latencies (no more than 500ms RTT). These timeouts can be adjusted via the environment variable `COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS`, which now defaults to 10 and will scale the timeouts proportionally.
- Loading branch information