Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base: use separate lease duration for liveness range #93073

Conversation

erikgrinaker
Copy link
Contributor

The first few commits are from #91947, only the last commit should be reviewed here.


This patch adds a separate lease duration for the liveness range. When the liveness range leaseholder is lost, other nodes are unable to heartbeat and extend their leases, which can lead to the loss of most leases in the system. Shortening the liveness lease duration shortens the period of unavailability for other leases.

Release note (ops change): The duration of the liveness range lease has been reduced to 2.5s, half of the regular lease duration. When the liveness range leaseholder is lost (e.g. due to an infrastructure outage), other nodes are unable to heartbeat and thus extend their own leases, which can lead to the loss of many of the leases throughtout the cluster. Reducing the liveness lease duration reduces the period of unavailability.

@erikgrinaker erikgrinaker requested review from tbg and a team December 5, 2022 20:39
@erikgrinaker erikgrinaker self-assigned this Dec 5, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 5, 2022

@tbg I've been playing around with a few different settings here, testing with failover/liveness/crash:

Lease Liveness lease Lost leases pMax
9.0s 9.0s 96% 9.7s
9.0s 3.0s 40% 2.3s
5.0s 5.0s 46% 7.5s
5.0s 2.5s 5% 2.4s

(the "Lost leases" metric is unreliable due to the time series sampling resolution)

Even with the "old" leases of 9.0s, I had to drop the liveness lease all the way down to 3.0s to get a significant improvement, and I was still unable to eliminate the effect. It seems like regardless of whether we reduce the standard lease durations in #91947, we'll still have a large blast radius on the liveness range leaseholder, but reducing the liveness lease does improve things in both cases. This will likely be far worse in the case of a network outage, where the network timeouts also come into play.

I'm not really comfortable reducing it this far down -- it's pretty tight, and the write latency of the extension can't exceed 1s before we start having issues due to the stasis period. I think we should instead consider alternative approaches to coalesced lease extensions that don't have single points of failure, as part of the lease protocol improvements we're exploring.

Would be interested to get your take here.

@tbg
Copy link
Member

tbg commented Dec 6, 2022

Ran a tiny experiment:

diff --git a/pkg/kv/kvserver/replica_proposal.go b/pkg/kv/kvserver/replica_proposal.go
index 1e3fdf6f89..52fcc49f23 100644
--- a/pkg/kv/kvserver/replica_proposal.go
+++ b/pkg/kv/kvserver/replica_proposal.go
@@ -266,6 +266,13 @@ func (r *Replica) leasePostApplyLocked(
 		// Log lease acquisitions loudly when verbose logging is enabled or when the
 		// new leaseholder is draining, in which case it should be shedding leases.
 		// Otherwise, log a trace event.
+		if prevExp, newPropTS := prevLease.GetExpiration(), newLease.ProposedTS.ToTimestamp(); prevExp.IsSet() && prevExp.Less(newPropTS) {
+			log.Warningf(ctx,
+				"new range lease %s following %s (downtime at least %.2fs, previous expired at %s, acquisition started at %s)",
+				newLease, prevLease, time.Duration(newPropTS.WallTime-prevExp.WallTime).Seconds(),
+				time.Unix(0, prevExp.WallTime), time.Unix(0, newPropTS.WallTime))
+		}
+
 		if log.V(1) || r.store.IsDraining() {
 			log.Infof(ctx, "new range lease %s following %s", newLease, prevLease)
 		} else {

Start local 3 node cluster, wait for things to settle, bring down the leaseholder, grab the log message.

I221206 13:44:38.077957 319 kv/kvserver/replica_proposal.go:270 ⋮ [n1,s1,r2/1:‹/System/NodeLiveness{-Max}›,raft] 584 new range lease repl=(n1,s1):1 seq=6 start=1670334275.261080000,1 exp=1670334287.067535000,0 pro=1670334278.067535000,0 following repl=(n3,s3):2 seq=5 start=1670334239.716025000,1 exp=1670334275.261080000,0 pro=1670334266.261080000,0 (downtime at least 2.81s, previous expired at 2022-12-06 14:44:35.26108 +0100 CET, acquisition started at 2022-12-06 14:44:38.067535 +0100 CET)

So it took 2.81s post the expiration of the lease until we started proposing the new lease.

I had finished killing the old leaseholder at t0=14:44:28. We had to wait until 14:44:35 (t0+7s) due to the expiration, but since we heartbeat every 4.5s there were two heartbeats waiting to be served at that point in time. And yet, we spent another 2.81s doing nothing.

Maybe something in this experiment is broken because I'm not isolating the liveness range (and so there are more leases lost than just liveness). Still, I wonder if we can run with this diff in your liveness-isolated test once or twice to see what outputs we get. Ideally, the number reported by this test for the "min downtime" should be ~0 as we ought to be acquiring the lease non-cooperatively the very instant this is possible. If we find that this is not happening, this is worth looking into.

If this peters out - I agree that we may not be able to fix this problem sufficiently by moving towards lower lease liveness expirations, and I share your concern about being too aggressive in lowering them. It may then make sense to postpone attacking this issue.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 6, 2022

So it took 2.81s post the expiration of the lease until we started proposing the new lease.

I suspect this is because we have to wait for a heartbeat request to come in, and they're fairly rare at 4.5s. But yeah, I can rerun an experiment with this.

@tbg
Copy link
Member

tbg commented Dec 6, 2022

I suspect this is because we have to wait for a heartbeat request to come in,

We already waited 7s until that point, and definitely lost the leader during these 7s, so there ought to be 2 of them (one for each remaining node) already queued.

@erikgrinaker
Copy link
Contributor Author

Yeah, but they have timeouts, so maybe they timed out and then had to wait for the next one to fire. I'll have a closer look.

@tbg tbg removed their request for review December 8, 2022 07:37
@erikgrinaker erikgrinaker force-pushed the reduce-liveness-lease-interval branch from a3925f3 to e2f0d6a Compare December 9, 2022 14:37
This patch reduces the lease interval from 9.0s to 5.0s and Raft
election timeout from 3.0s to 2.0s, to reduce unavailability following a
leaseholder crash. This implies that the node heartbeat interval is
reduced from 4.5s to 2.5s, the lease acquisition timeout is reduced from
6.0s to 4.0s, and the proposal timeout is reduced from 3.0s-6.0s to
2.0s-4.0s. The gossip sentinel TTL is now tied to the lease expiration
interval rather than the lease renewal interval, and has thus changed
from 4.5s to 5.0s.

There are three involved quantities:

* `defaultRaftHeartbeatIntervalTicks`: 5 (1.0s). Not changed, to avoid
  increasing Raft heartbeat costs.

* `defaultRaftElectionTimeoutTicks`: 10 (2.0s). Changed from 15 (3.0s),
  which still leaves ample headroom by the following reasoning:

  * The network round-trip time (RTT) is expected to be below 400ms
    (worst-case GCP inter-region latency is 350ms).

  * TCP prevents "dropped" heartbeats. Linux has an RTT-dependant
    retransmission timeout (RTO) which we can approximate as 1.5x RTT
    (smoothed RTT + 4x RTT variance), with a lower bound of 200ms. The
    worst-case RTO is thus 600ms, with another 0.5x RTT (200ms) for
    the retransmit to reach the peer, leaving 200ms to spare.

  * The actual election timeout per replica is multiplied by a random
    factor of 1-2, so it's likely to be significantly larger than 2.0s
    for a given follower (i.e. TCP connection).

* `defaultRangeLeaseRaftElectionTimeoutMultiplier`: 2.5 (5.0s). Changed
  from 3.0 (9.0s), with the following reasoning:

  * Our target lease interval is 5.0s. Reducing the lease multiplier
    is better than reducing the election timeout further, to leave
    some headroom for Raft elections and related timeouts.

  * However, we generally want Raft elections to complete before the
    lease expires. Since Raft elections have a random 1-2 multiplier,
    2.5 ensures the nominal lease interval (5.0s) is always greater than
    the nominal Raft election timeout (4.0s worst-case), but the
    relative offset of the last Raft heartbeat (up to 1.0s) and lease
    extension (up to 2.5s) may cause lease acquisition to sometimes
    block on Raft elections, skewing the average unavailability upwards
    to roughly 4.0s.

Roachtest benchmarks of pMax latency during leaseholder failures show a
clear improvement:

| Test                            | `master`  | This PR |
|---------------------------------|-----------|---------|
| `failover/non-system/crash`     | 14.5 s    | 8.3 s   |
| `failover/non-system/blackhole` | 18.3 s    | 14.5 s  |

Alternatively, keeping `defaultRangeLeaseRaftElectionTimeoutMultiplier`
at 3 but reducing `defaultRaftElectionTimeoutTicks` to 8 would give a
lease interval of 4.8s, with slightly lower average unavailability, but
a higher probability of spurious Raft election timeouts, especially with
higher RTTs.

Release note (ops change): The Raft election timeout has been reduced
from 3 seconds to 2 seconds, and the lease interval from 9 seconds to 5
seconds, with a corresponding reduction in the node heartbeat interval
from 4.5 seconds to 2.5 seconds. This reduces the period of
unavailability following leaseholder loss, but places tighter
restrictions on network latencies (no more than 500ms RTT). These
timeouts can be adjusted via the environment variable
`COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS`, which now defaults to 10
and will scale the timeouts proportionally.
This patch adds a separate lease duration for the liveness range. When
the liveness range leaseholder is lost, other nodes are unable to
heartbeat and extend their leases, which can lead to the loss of most
leases in the system. Shortening the liveness lease duration shortens
the period of unavailability for other leases.

Release note (ops change): The duration of the liveness range lease has
been reduced to 2.5s, half of the regular lease duration. When the
liveness range leaseholder is lost (e.g. due to an infrastructure
outage), other nodes are unable to heartbeat and thus extend their own
leases, which can lead to the loss of many of the leases throughtout the
cluster. Reducing the liveness lease duration reduces the period of
unavailability.
@erikgrinaker erikgrinaker force-pushed the reduce-liveness-lease-interval branch from e2f0d6a to 6afd5c9 Compare December 9, 2022 16:19
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 9, 2022

Did some runs with failover/liveness/crash on top of this PR, looks like the delay is caused by the Raft election. Here's one particularly long lease acquisition at 2.9 seconds:

W221209 16:58:09.087724 [n2,s2,r2/3:‹/System/NodeLiveness{-Max}›,raft] new range lease repl=(n2,s2):3 seq=7 start=1670605086.189420895,1 exp=1670605091.586130392,0 pro=1670605089.086130392,0 following repl=(n4,s4):4 seq=6 start=1670605057.668047498,0 exp=1670605086.189420895,0 pro=1670605083.689420895,0 (downtime at least 2.90s, previous expired at 2022-12-09 16:58:06.189420895 +0000 UTC, acquisition started at 2022-12-09 16:58:09.086130392 +0000 UTC)

Notice how the lease expired at 16:58:06, and was reacquired at 16:58:09. I cross-referenced this with liveness heartbeat requests on n2 attempting to acquire leases, and see a bunch of these continually between 16:58:06 and 16:58:09:

W221209 16:58:08.667060 kv/kvserver/replica_send.go:827 ⋮ [n2,s2,r2/3:‹/System/NodeLiveness{-Max}›] invalid lease for request ConditionalPut [/System/NodeLiveness/1,/Min), EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/1], [txn: f1ff8912], [can-forward-ts] at 1670605088.664223743,0
W221209 16:58:08.667687 kv/kvserver/replica_send.go:833 ⋮ [n2,s2,r2/3:‹/System/NodeLiveness{-Max}›] lease response: status={<empty> 0,0 0,0 ‹ERROR› ‹› ‹liveness(nid:0 epo:0 exp:0,0)› 0,0} pErr=‹[NotLeaseHolderError] refusing to acquire lease on follower; r2: replica (n2,s2):3 not lease holder; current lease is repl=(n4,s4):4 seq=0 start=0,0 exp=<nil>›

That last error comes from here:

// rejectProposalWithRedirectLocked is part of the proposer interface.
func (rp *replicaProposer) rejectProposalWithRedirectLocked(
ctx context.Context, prop *ProposalData, redirectTo roachpb.ReplicaID,
) {
r := (*Replica)(rp)
rangeDesc := r.descRLocked()
storeID := r.store.StoreID()
r.store.metrics.LeaseRequestErrorCount.Inc(1)
redirectRep, _ /* ok */ := rangeDesc.GetReplicaDescriptorByID(redirectTo)
log.VEventf(ctx, 2, "redirecting proposal to node %s; request: %s", redirectRep.NodeID, prop.Request)
rp.rejectProposalWithErrLocked(ctx, prop, roachpb.NewError(
roachpb.NewNotLeaseHolderErrorWithSpeculativeLease(
redirectRep,
storeID,
rangeDesc,
"refusing to acquire lease on follower"),
))
}

Which originates while we're proposing the lease transfer, and is rejected because we're not the Raft leader:

leaderKnownAndEligible := li.leaderKnown && li.leaderEligibleForLease
ownsCurrentLease := b.p.ownsValidLease(ctx, b.clock.NowAsClockTimestamp())
if leaderKnownAndEligible && !ownsCurrentLease && !b.testing.allowLeaseProposalWhenNotLeader {
log.VEventf(ctx, 2, "not proposing lease acquisition because we're not the leader; replica %d is",
li.leader)
b.p.rejectProposalWithRedirectLocked(ctx, p, li.leader)
if b.p.shouldCampaignOnRedirect(raftGroup) {
log.VEventf(ctx, 2, "campaigning because Raft leader not live in node liveness map")
if err := raftGroup.Campaign(); err != nil {
log.VEventf(ctx, 1, "failed to campaign: %s", err)
}
}
return true
}

So in order to make this work reliably we'll have to drop the Raft election timeout well below the lease expiration, maybe something like 1s considering it's randomized by a 1-2x factor, and that really isn't going to work very well.

I think we'll have to consider other approaches here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants