storage: must never reuse state from old replicaIDs #40367

tbg · 2019-08-30T13:19:24Z

I was looking at the range merge tests and in particular
TestStoreRangeMergeAddReplicaRace (at its time of introduction; it has since
changed due to learners and lost a lot of its context), see

ce93650

The worry is roughly that when a snapshot that does not contain a merge is
sitting on some store (which is not part of the range) and this store is then
added to the range, the replica will likely blow up while applying the merge
trigger, is the subsumed replica is very unlikely to be present - this colocation
is only achieved for the duration of the merge txn and for the stores that
are members of the range.

At the time of the below comment, the way such a situation would come to be is
because preemptive snapshots were sent before starting the replica change txn.
So you'd send a snapshot, then do a merge, and (before we introduced the range
desc generation for that purpose) the upreplication could succeed, boom.

However, this sequence of events might unfold just the same without any preemptive snaps:

add replica on store
remove replica from store; replicaGC is delayed
run a merge
readd store

Here, in 4 I don't mean "re-add as a learner and promote to voter"! It's enough to
readd it as a learner. Raft will start catching up the old snapshot and boom, we
hit the problem above.

This actually seems like a likely problem to occur in practice if replicas are
removed and re-added frequently while merges occur, though the fact that we still
send explicit learner snapshots could mask it somewhat (that snapshot will postdate
the merge, so it can hide the problem by fast-forwarding the old replica past the
merge).

Fixing this is probably not easy. We want to "not reuse old snapshots", i.e. when the
replicaID changes we want to delete all our state and start anew. This is far from a
trivial change, but it's in the spirit of changes that we want to make anyway to lower
the complexity in the storage package. Anyway the first step here is reproducing the
problem.

cc @danhhz @nvanbenschoten

// TestStoreRangeMergeAddReplicaRace verifies that an add replica request that
// occurs concurrently with a merge is aborted.
//
// To see why aborting the add replica request is necessary, consider two
// adjacent and collocated ranges, Q and R. Say the replicate queue decides to
// rebalance Q onto store S4. It will initiate a ChangeReplicas command that
// will send S4 a preemptive snapshot, then launch a replica-change transaction
// to update R's range descriptor with the new replica. Now say the merge queue
// decides to merge Q and R after the preemptive snapshot of Q has been sent to
// S4 but before the replica-change transaction has started. The merge can
// succeed because the ranges are still collocated. (The new replica of Q is
// only considered added once the replica-change transaction commits.) If the
// replica-change transaction were to commit, the new replica of Q on S4 would
// have a snapshot of Q that predated the merge. In order to catch up, it would
// need to apply the merge trigger, but the merge trigger will explode because
// S4 does not have a replica of R.
//
// To avoid this scenario, ChangeReplicas commands will abort if they discover
// the range descriptor has changed between when the snapshot is sent and when
// the replica-change transaction starts.
//
// There is a particularly diabolical edge case here. Consider the same
// situation as above, except that Q and R merge together and then split at
// exactly the same key, all before the replica-change transaction starts. Q's
// range descriptor will have the same start key, end key, and next replica ID
// that it did when the preemptive snapshot started. That is, it will look
// unchanged! To protect against this, range descriptors contain a generation
// counter, which is incremented on every split or merge. The presence of this
// counter means that ChangeReplicas commands can detect and abort if any merges
// have occurred since the preemptive snapshot, even if the sequence of splits
// or merges left the keyspan of the range unchanged. This diabolical edge case
// is tested here.

The text was updated successfully, but these errors were encountered:

ChangeReplicas (and AdminSplit, and AdminMerge) take a RangeDescriptor that they use as a basis for a CPut to make sure the operations mutating the range serialize. This is great for correctness but generally unfortunate for usability since on a mismatch, the caller usually wanted to do the thing they were trying to do anyway, using the new descriptor. The fact that every split (replication change, merge) basically needs a retry loop is constant trickle of test flakes and UX papercuts. It became more pressing to do something against this as we are routinely using joint configs when atomic replication changes are enabled. A joint configuration is transitioned out of opportunistically whenever it is encountered, but this frequently causes a race in which actor A finds a joint config, begins a transaction out of it but is raced by actor B getting there first. The end result is that what actor A wanted to achieve has been achieved, though by someone else, and the result is a spurious error. This commit fixes that behavior in the targeted case of wanting to leave a joint configuration: actor A will get a successful result. Before this change, make stress PKG=./pkg/sql TESTS=TestShowRangesWithLocal would fail immediately when `kv.atomic_replication_changes.enabled` was true because the splits this test carries out would run into the joint configuration changes of the initial upreplication, and would race the replicate queue to transition out of them, which at least one split would typically lose. This still happens, but now it's not an error any more. I do think that it makes sense to use a similar strategy in general (fail replication changes only if the *replicas* changed, allow all splits except when the split key moves out of the current descriptor, etc) but in the process of coding this up I got reminded of all of the problems relating to range merges and also found what I think is a long-standing pretty fatal bug, cockroachdb#40367, so I don't want to do anything until the release is out of the door. But I'm basically convinced that if we did it, it wouldn't cause a new "bug" because any replication change carried out in that way is just one that could be triggered just the same by a user under the old checks. Release note: None

tbg · 2019-09-06T12:14:12Z

Another aspect of this came up in https://github.com/cockroachdb/cockroach/pull/39796/files. (see commit message of the testing commit). If a peer is removed, it seems possible that it would be caught up past the replication change that removes it. At the time of writing, this happens routinely when peers are removed and re-added without undergoing replicaGC in the meantime, but it's always an awkward situation to handle. With #40459 it will become the exception, so we want to make sure that it's not possible at all.

@ajwerner this is closely related to #40459. There, I think your latest approach drops incoming raft messages until the state has been GC'ed. What we want to do here is basically that when a replica sees itself get removed in a ChangeReplicasTrigger log entry, it needs to replicaGC itself. There's not anything to "drop" really, so I don't think we can use the strategy of "stalling" until the replicaGCQueue comes along. We sort of need the replica to just stop existing in-place, I think. Curious if you had any thoughts on this.

This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. There is one more bug here that is not addressed by this change, namely the fact that there might be a tombstone for the right hand side. This continues to be tracked in cockroachdb#40470. See also cockroachdb#40367 (comment) for more work we need to do to establish sanity. Release note: None

ajwerner · 2019-09-06T13:32:06Z

We sort of need the replica to just stop existing in-place, I think. Curious if you had any thoughts on this.

Just to check my understanding, the issue is that we need this replica to stop applying raft entries as soon as it notices that it has been removed.

cockroach/pkg/storage/replica_application_result.go

Lines 296 to 312 in 1dcd427

    
           func (r *Replica) handleChangeReplicasResult(ctx context.Context, chng *storagepb.ChangeReplicas) { 
        
           	storeID := r.store.StoreID() 
        
           	var found bool 
        
           	for _, rDesc := range chng.Replicas() { 
        
           		if rDesc.StoreID == storeID { 
        
           			found = true 
        
           			break 
        
           		} 
        
           	} 
        
           	if !found { 
        
           		// This wants to run as late as possible, maximizing the chances 
        
           		// that the other nodes have finished this command as well (since 
        
           		// processing the removal from the queue looks up the Range at the 
        
           		// lease holder, being too early here turns this into a no-op). 
        
           		r.store.replicaGCQueue.AddAsync(ctx, r, replicaGCPriorityRemoved) 
        
           	} 
        
           }

It seems like if the replica were to set its destroyStatus and then find a way to bail out of entry application at that point then we'd be okay waiting for replicaGC.

Additionally we should probably ignore ticks for destroyed replicas.

Another concern I have is the tight loop spinning waiting for replicaGC in tryGetOrCreateReplica():

cockroach/pkg/storage/store.go

Line 3955 in 1dcd427

} else if ds := repl.mu.destroyStatus; ds.reason == destroyReasonRemoved {

ajwerner · 2019-09-06T13:39:40Z

It seems like we're ultimately trying to eliminate setReplicaID(). Is there any reason why we want that method to continue to exist after we have learners and we prevent processing entries across a remove and re-add?

tbg · 2019-09-06T14:53:21Z

Just to check my understanding,

Yep, exactly. We don't want it to advance on the log past its removal. In fact we'd do well to not even process the removal itself (or it will be harder to manage the case in which a node restarts immediately after).

It seems like if the replica were to set its destroyStatus and then find a way to bail out of entry application at that point then we'd be okay waiting for replicaGC.

Yeah, that's the part I wasn't sure we could achieve reasonably well. It seems conceptually OK (propagate a "stop" flag when we see a raft command that would remove the replica from the desc and handle that in the right place to set the destroy status, and make it such that replicas which are destroyed never handle readies)

Also cc @danhhz because we spent many a merry 1:1 discussing how good it would be if we had this invariant about replicas always being in their range descriptor, and now it seems that we might be getting it earlier than we thought it possible.

This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. There is one more bug here that is not addressed by this change, namely the fact that there might be a tombstone for the right hand side. This continues to be tracked in cockroachdb#40470. See also cockroachdb#40367 (comment) for more work we need to do to establish sanity. Release note: None

WARNING: this change needs more testing to target the changes it makes and at least some of the disabled tests should be reworked. This is a big and scary change at this point in the cycle so I'm getting it out before I'm really happy with it. There are some known TODOs. On the plus side it does not seem to reproduce any crashes in hours with the `partitionccl.TestRepartitioning` which readily produces crashes on master when run under roachprod stress within ~20 minutes. We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. * The queues are now replica ID aware. If a replica was added to the queue and the replica found when trying to pop are not the same and we knew the replica ID of replica when we added it then we should not process it. This hopefully gives us some new invariants: * There is only ever at most 1 *Replica which IsAlive() for a range on a store at a time. * Once a *Replica has a non-zero ReplicaID is never changes. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Fix crashes by preventing replica ID change.

We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 *Replica which IsAlive() for a range on a store at a time. * Once a *Replica has a non-zero ReplicaID is never changes. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Fix crashes by preventing replica ID change.

We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Fix crashes by preventing replica ID change.

We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Fix crashes by preventing replica ID change.

We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"

40892: storage: ensure Replica objects never change replicaID r=ajwerner a=ajwerner We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 *Replica which IsAlive() for a range on a store at a time. * Once a *Replica has a non-zero ReplicaID is never changes. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes #40367. The first commit is #40889. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Fix crashes by preventing replica ID change. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. labels Aug 30, 2019

tbg self-assigned this Aug 30, 2019

tbg mentioned this issue Aug 30, 2019

secret core release blockers tbg/cockroach#163

Closed

14 tasks

tbg mentioned this issue Sep 3, 2019

storage: crash with snapshot widens existing replica, but no replica exists for subsumed key #40257

Closed

andy-kimball mentioned this issue Sep 3, 2019

19.2 release blockers list #40447

Closed

53 tasks

tbg mentioned this issue Sep 4, 2019

roachtest: clearrange/checks=false failed #38772

Closed

tbg changed the title ~~storage: must never reuse snapshots from old replicaIDs~~ storage: must never reuse state from old replicaIDs Sep 6, 2019

ajwerner assigned ajwerner and unassigned tbg Sep 12, 2019

craig bot closed this as completed in 2020115 Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: must never reuse state from old replicaIDs #40367

storage: must never reuse state from old replicaIDs #40367

tbg commented Aug 30, 2019

tbg commented Sep 6, 2019

ajwerner commented Sep 6, 2019

ajwerner commented Sep 6, 2019

tbg commented Sep 6, 2019

storage: must never reuse state from old replicaIDs #40367

storage: must never reuse state from old replicaIDs #40367

Comments

tbg commented Aug 30, 2019

tbg commented Sep 6, 2019

ajwerner commented Sep 6, 2019

ajwerner commented Sep 6, 2019

tbg commented Sep 6, 2019