base: use separate lease duration for liveness range #93073

erikgrinaker · 2022-12-05T20:39:15Z

The first few commits are from #91947, only the last commit should be reviewed here.

This patch adds a separate lease duration for the liveness range. When the liveness range leaseholder is lost, other nodes are unable to heartbeat and extend their leases, which can lead to the loss of most leases in the system. Shortening the liveness lease duration shortens the period of unavailability for other leases.

Release note (ops change): The duration of the liveness range lease has been reduced to 2.5s, half of the regular lease duration. When the liveness range leaseholder is lost (e.g. due to an infrastructure outage), other nodes are unable to heartbeat and thus extend their own leases, which can lead to the loss of many of the leases throughtout the cluster. Reducing the liveness lease duration reduces the period of unavailability.

cockroach-teamcity · 2022-12-05T20:39:28Z

This change is

erikgrinaker · 2022-12-05T21:14:23Z

@tbg I've been playing around with a few different settings here, testing with failover/liveness/crash:

Lease	Liveness lease	Lost leases	pMax
9.0s	9.0s	96%	9.7s
9.0s	3.0s	40%	2.3s
5.0s	5.0s	46%	7.5s
5.0s	2.5s	5%	2.4s

(the "Lost leases" metric is unreliable due to the time series sampling resolution)

Even with the "old" leases of 9.0s, I had to drop the liveness lease all the way down to 3.0s to get a significant improvement, and I was still unable to eliminate the effect. It seems like regardless of whether we reduce the standard lease durations in #91947, we'll still have a large blast radius on the liveness range leaseholder, but reducing the liveness lease does improve things in both cases. This will likely be far worse in the case of a network outage, where the network timeouts also come into play.

I'm not really comfortable reducing it this far down -- it's pretty tight, and the write latency of the extension can't exceed 1s before we start having issues due to the stasis period. I think we should instead consider alternative approaches to coalesced lease extensions that don't have single points of failure, as part of the lease protocol improvements we're exploring.

Would be interested to get your take here.

tbg · 2022-12-06T13:49:19Z

Ran a tiny experiment:

diff --git a/pkg/kv/kvserver/replica_proposal.go b/pkg/kv/kvserver/replica_proposal.go
index 1e3fdf6f89..52fcc49f23 100644
--- a/pkg/kv/kvserver/replica_proposal.go
+++ b/pkg/kv/kvserver/replica_proposal.go
@@ -266,6 +266,13 @@ func (r *Replica) leasePostApplyLocked(
 		// Log lease acquisitions loudly when verbose logging is enabled or when the
 		// new leaseholder is draining, in which case it should be shedding leases.
 		// Otherwise, log a trace event.
+		if prevExp, newPropTS := prevLease.GetExpiration(), newLease.ProposedTS.ToTimestamp(); prevExp.IsSet() && prevExp.Less(newPropTS) {
+			log.Warningf(ctx,
+				"new range lease %s following %s (downtime at least %.2fs, previous expired at %s, acquisition started at %s)",
+				newLease, prevLease, time.Duration(newPropTS.WallTime-prevExp.WallTime).Seconds(),
+				time.Unix(0, prevExp.WallTime), time.Unix(0, newPropTS.WallTime))
+		}
+
 		if log.V(1) || r.store.IsDraining() {
 			log.Infof(ctx, "new range lease %s following %s", newLease, prevLease)
 		} else {

Start local 3 node cluster, wait for things to settle, bring down the leaseholder, grab the log message.

I221206 13:44:38.077957 319 kv/kvserver/replica_proposal.go:270 ⋮ [n1,s1,r2/1:‹/System/NodeLiveness{-Max}›,raft] 584 new range lease repl=(n1,s1):1 seq=6 start=1670334275.261080000,1 exp=1670334287.067535000,0 pro=1670334278.067535000,0 following repl=(n3,s3):2 seq=5 start=1670334239.716025000,1 exp=1670334275.261080000,0 pro=1670334266.261080000,0 (downtime at least 2.81s, previous expired at 2022-12-06 14:44:35.26108 +0100 CET, acquisition started at 2022-12-06 14:44:38.067535 +0100 CET)

So it took 2.81s post the expiration of the lease until we started proposing the new lease.

I had finished killing the old leaseholder at t0=14:44:28. We had to wait until 14:44:35 (t0+7s) due to the expiration, but since we heartbeat every 4.5s there were two heartbeats waiting to be served at that point in time. And yet, we spent another 2.81s doing nothing.

Maybe something in this experiment is broken because I'm not isolating the liveness range (and so there are more leases lost than just liveness). Still, I wonder if we can run with this diff in your liveness-isolated test once or twice to see what outputs we get. Ideally, the number reported by this test for the "min downtime" should be ~0 as we ought to be acquiring the lease non-cooperatively the very instant this is possible. If we find that this is not happening, this is worth looking into.

If this peters out - I agree that we may not be able to fix this problem sufficiently by moving towards lower lease liveness expirations, and I share your concern about being too aggressive in lowering them. It may then make sense to postpone attacking this issue.

erikgrinaker · 2022-12-06T14:19:50Z

So it took 2.81s post the expiration of the lease until we started proposing the new lease.

I suspect this is because we have to wait for a heartbeat request to come in, and they're fairly rare at 4.5s. But yeah, I can rerun an experiment with this.

tbg · 2022-12-06T15:25:40Z

I suspect this is because we have to wait for a heartbeat request to come in,

We already waited 7s until that point, and definitely lost the leader during these 7s, so there ought to be 2 of them (one for each remaining node) already queued.

erikgrinaker · 2022-12-06T15:41:04Z

Yeah, but they have timeouts, so maybe they timed out and then had to wait for the next one to fire. I'll have a closer look.

Release note: None

This patch reduces the lease interval from 9.0s to 5.0s and Raft election timeout from 3.0s to 2.0s, to reduce unavailability following a leaseholder crash. This implies that the node heartbeat interval is reduced from 4.5s to 2.5s, the lease acquisition timeout is reduced from 6.0s to 4.0s, and the proposal timeout is reduced from 3.0s-6.0s to 2.0s-4.0s. The gossip sentinel TTL is now tied to the lease expiration interval rather than the lease renewal interval, and has thus changed from 4.5s to 5.0s. There are three involved quantities: * `defaultRaftHeartbeatIntervalTicks`: 5 (1.0s). Not changed, to avoid increasing Raft heartbeat costs. * `defaultRaftElectionTimeoutTicks`: 10 (2.0s). Changed from 15 (3.0s), which still leaves ample headroom by the following reasoning: * The network round-trip time (RTT) is expected to be below 400ms (worst-case GCP inter-region latency is 350ms). * TCP prevents "dropped" heartbeats. Linux has an RTT-dependant retransmission timeout (RTO) which we can approximate as 1.5x RTT (smoothed RTT + 4x RTT variance), with a lower bound of 200ms. The worst-case RTO is thus 600ms, with another 0.5x RTT (200ms) for the retransmit to reach the peer, leaving 200ms to spare. * The actual election timeout per replica is multiplied by a random factor of 1-2, so it's likely to be significantly larger than 2.0s for a given follower (i.e. TCP connection). * `defaultRangeLeaseRaftElectionTimeoutMultiplier`: 2.5 (5.0s). Changed from 3.0 (9.0s), with the following reasoning: * Our target lease interval is 5.0s. Reducing the lease multiplier is better than reducing the election timeout further, to leave some headroom for Raft elections and related timeouts. * However, we generally want Raft elections to complete before the lease expires. Since Raft elections have a random 1-2 multiplier, 2.5 ensures the nominal lease interval (5.0s) is always greater than the nominal Raft election timeout (4.0s worst-case), but the relative offset of the last Raft heartbeat (up to 1.0s) and lease extension (up to 2.5s) may cause lease acquisition to sometimes block on Raft elections, skewing the average unavailability upwards to roughly 4.0s. Roachtest benchmarks of pMax latency during leaseholder failures show a clear improvement: | Test | `master` | This PR | |---------------------------------|-----------|---------| | `failover/non-system/crash` | 14.5 s | 8.3 s | | `failover/non-system/blackhole` | 18.3 s | 14.5 s | Alternatively, keeping `defaultRangeLeaseRaftElectionTimeoutMultiplier` at 3 but reducing `defaultRaftElectionTimeoutTicks` to 8 would give a lease interval of 4.8s, with slightly lower average unavailability, but a higher probability of spurious Raft election timeouts, especially with higher RTTs. Release note (ops change): The Raft election timeout has been reduced from 3 seconds to 2 seconds, and the lease interval from 9 seconds to 5 seconds, with a corresponding reduction in the node heartbeat interval from 4.5 seconds to 2.5 seconds. This reduces the period of unavailability following leaseholder loss, but places tighter restrictions on network latencies (no more than 500ms RTT). These timeouts can be adjusted via the environment variable `COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS`, which now defaults to 10 and will scale the timeouts proportionally.

This patch adds a separate lease duration for the liveness range. When the liveness range leaseholder is lost, other nodes are unable to heartbeat and extend their leases, which can lead to the loss of most leases in the system. Shortening the liveness lease duration shortens the period of unavailability for other leases. Release note (ops change): The duration of the liveness range lease has been reduced to 2.5s, half of the regular lease duration. When the liveness range leaseholder is lost (e.g. due to an infrastructure outage), other nodes are unable to heartbeat and thus extend their own leases, which can lead to the loss of many of the leases throughtout the cluster. Reducing the liveness lease duration reduces the period of unavailability.

erikgrinaker · 2022-12-09T17:18:22Z

Did some runs with failover/liveness/crash on top of this PR, looks like the delay is caused by the Raft election. Here's one particularly long lease acquisition at 2.9 seconds:

W221209 16:58:09.087724 [n2,s2,r2/3:‹/System/NodeLiveness{-Max}›,raft] new range lease repl=(n2,s2):3 seq=7 start=1670605086.189420895,1 exp=1670605091.586130392,0 pro=1670605089.086130392,0 following repl=(n4,s4):4 seq=6 start=1670605057.668047498,0 exp=1670605086.189420895,0 pro=1670605083.689420895,0 (downtime at least 2.90s, previous expired at 2022-12-09 16:58:06.189420895 +0000 UTC, acquisition started at 2022-12-09 16:58:09.086130392 +0000 UTC)

Notice how the lease expired at 16:58:06, and was reacquired at 16:58:09. I cross-referenced this with liveness heartbeat requests on n2 attempting to acquire leases, and see a bunch of these continually between 16:58:06 and 16:58:09:

W221209 16:58:08.667060 kv/kvserver/replica_send.go:827 ⋮ [n2,s2,r2/3:‹/System/NodeLiveness{-Max}›] invalid lease for request ConditionalPut [/System/NodeLiveness/1,/Min), EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/1], [txn: f1ff8912], [can-forward-ts] at 1670605088.664223743,0
W221209 16:58:08.667687 kv/kvserver/replica_send.go:833 ⋮ [n2,s2,r2/3:‹/System/NodeLiveness{-Max}›] lease response: status={<empty> 0,0 0,0 ‹ERROR› ‹› ‹liveness(nid:0 epo:0 exp:0,0)› 0,0} pErr=‹[NotLeaseHolderError] refusing to acquire lease on follower; r2: replica (n2,s2):3 not lease holder; current lease is repl=(n4,s4):4 seq=0 start=0,0 exp=<nil>›

That last error comes from here:

cockroach/pkg/kv/kvserver/replica_proposal_buf.go

Lines 1252 to 1269 in 6ed9148

    
           // rejectProposalWithRedirectLocked is part of the proposer interface. 
        
           func (rp *replicaProposer) rejectProposalWithRedirectLocked( 
        
           	ctx context.Context, prop *ProposalData, redirectTo roachpb.ReplicaID, 
        
           ) { 
        
           	r := (*Replica)(rp) 
        
           	rangeDesc := r.descRLocked() 
        
           	storeID := r.store.StoreID() 
        
           	r.store.metrics.LeaseRequestErrorCount.Inc(1) 
        
           	redirectRep, _ /* ok */ := rangeDesc.GetReplicaDescriptorByID(redirectTo) 
        
           	log.VEventf(ctx, 2, "redirecting proposal to node %s; request: %s", redirectRep.NodeID, prop.Request) 
        
           	rp.rejectProposalWithErrLocked(ctx, prop, roachpb.NewError( 
        
           		roachpb.NewNotLeaseHolderErrorWithSpeculativeLease( 
        
           			redirectRep, 
        
           			storeID, 
        
           			rangeDesc, 
        
           			"refusing to acquire lease on follower"), 
        
           	)) 
        
           }

Which originates while we're proposing the lease transfer, and is rejected because we're not the Raft leader:

cockroach/pkg/kv/kvserver/replica_proposal_buf.go

Lines 614 to 627 in 6ed9148

    
           leaderKnownAndEligible := li.leaderKnown && li.leaderEligibleForLease 
        
           ownsCurrentLease := b.p.ownsValidLease(ctx, b.clock.NowAsClockTimestamp()) 
        
           if leaderKnownAndEligible && !ownsCurrentLease && !b.testing.allowLeaseProposalWhenNotLeader { 
        
           	log.VEventf(ctx, 2, "not proposing lease acquisition because we're not the leader; replica %d is", 
        
           		li.leader) 
        
           	b.p.rejectProposalWithRedirectLocked(ctx, p, li.leader) 
        
           	if b.p.shouldCampaignOnRedirect(raftGroup) { 
        
           		log.VEventf(ctx, 2, "campaigning because Raft leader not live in node liveness map") 
        
           		if err := raftGroup.Campaign(); err != nil { 
        
           			log.VEventf(ctx, 1, "failed to campaign: %s", err) 
        
           		} 
        
           	} 
        
           	return true 
        
           }

So in order to make this work reliably we'll have to drop the Raft election timeout well below the lease expiration, maybe something like 1s considering it's randomized by a 1-2x factor, and that really isn't going to work very well.

I think we'll have to consider other approaches here.

erikgrinaker requested review from tbg and a team December 5, 2022 20:39

erikgrinaker self-assigned this Dec 5, 2022

tbg removed their request for review December 8, 2022 07:37

erikgrinaker force-pushed the reduce-liveness-lease-interval branch from a3925f3 to e2f0d6a Compare December 9, 2022 14:37

erikgrinaker added 4 commits December 9, 2022 16:18

base: add RaftConfig.RangeLeaseAcquireTimeout()

183e425

Release note: None

base: add data-driven test for default Raft config

dcdeb8b

Release note: None

erikgrinaker force-pushed the reduce-liveness-lease-interval branch from e2f0d6a to 6afd5c9 Compare December 9, 2022 16:19

erikgrinaker mentioned this pull request Dec 9, 2022

kvserver: use shorter lease expiration for liveness range #88443

Open

erikgrinaker closed this Dec 9, 2022

erikgrinaker deleted the reduce-liveness-lease-interval branch December 28, 2022 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

base: use separate lease duration for liveness range #93073

base: use separate lease duration for liveness range #93073

erikgrinaker commented Dec 5, 2022

cockroach-teamcity commented Dec 5, 2022

erikgrinaker commented Dec 5, 2022 •

edited

Loading

tbg commented Dec 6, 2022

erikgrinaker commented Dec 6, 2022 •

edited

Loading

tbg commented Dec 6, 2022

erikgrinaker commented Dec 6, 2022

erikgrinaker commented Dec 9, 2022 •

edited

Loading

base: use separate lease duration for liveness range #93073

base: use separate lease duration for liveness range #93073

Conversation

erikgrinaker commented Dec 5, 2022

cockroach-teamcity commented Dec 5, 2022

erikgrinaker commented Dec 5, 2022 • edited Loading

tbg commented Dec 6, 2022

erikgrinaker commented Dec 6, 2022 • edited Loading

tbg commented Dec 6, 2022

erikgrinaker commented Dec 6, 2022

erikgrinaker commented Dec 9, 2022 • edited Loading

erikgrinaker commented Dec 5, 2022 •

edited

Loading

erikgrinaker commented Dec 6, 2022 •

edited

Loading

erikgrinaker commented Dec 9, 2022 •

edited

Loading