Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: *roachpb.RaftGroupDeletedError for RHS in splitPostApply #21146

Closed
tbg opened this issue Jan 1, 2018 · 15 comments · Fixed by #39571
Closed

storage: *roachpb.RaftGroupDeletedError for RHS in splitPostApply #21146

tbg opened this issue Jan 1, 2018 · 15 comments · Fixed by #39571
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Milestone

Comments

@tbg
Copy link
Member

tbg commented Jan 1, 2018

func splitPostApply(
	ctx context.Context, deltaMS enginepb.MVCCStats, split *roachpb.SplitTrigger, r *Replica,
) {
	// The right hand side of the split was already created (and its raftMu
	// acquired) in Replica.acquireSplitLock. It must be present here.
	rightRng, err := r.store.GetReplica(split.RightDesc.RangeID)
	if err != nil {
		log.Fatalf(ctx, "unable to find RHS replica: %s", err)
	}
	{
		rightRng.mu.Lock()
		// Already holding raftMu, see above.
		err := rightRng.initRaftMuLockedReplicaMuLocked(&split.RightDesc, r.store.Clock(), 0)
		rightRng.mu.Unlock()
		if err != nil {
			log.Fatal(ctx, err) <--
		}
	}

https://sentry.io/cockroach-labs/cockroachdb/issues/426530382/

*log.safeError: store.go:1820: *roachpb.RaftGroupDeletedError
  File "github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go", line 718, in handleReplicatedEvalResult
  File "github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go", line 981, in handleEvalResultRaftMuLocked
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 4480, in processRaftCommand
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 3481, in handleRaftReadyRaftMuLocked
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 3173, in handleRaftReady
...
(4 additional frame(s) were not displayed)

store.go:1820: *roachpb.RaftGroupDeletedError
@tbg tbg self-assigned this Jan 1, 2018
@tbg
Copy link
Member Author

tbg commented Jan 2, 2018

The scenario here should be something like

  • range splits, with one follower lagging behind
  • the RHS of the split decides to drop the lagging replica from the replica set
  • when the split executes on the (yet unsplit) range, it blows up like seen above.

In the error, the split lock has previously been acquired. This implies that getOrCreateReplica was previously called, successfully. In tracing the code, I noticed that we pass a replicaID of zero into getOrCreateReplica while acquiring the split lock. This is a problem, as this is roughly the underlying code for that call:

func (s *Store) tryGetOrCreateReplica(
	ctx context.Context,
	rangeID roachpb.RangeID,
	replicaID roachpb.ReplicaID,
	creatingReplica *roachpb.ReplicaDescriptor,
) (_ *Replica, created bool, _ error) {
	// The common case: look up an existing (initialized) replica.
	if value, ok := s.mu.replicas.Load(int64(rangeID)); ok {
		// ...
		repl.setReplicaIDRaftMuLockedMuLocked(replicaID)
		// ...
		return
	}

	// No replica currently exists, so we'll try to create one. Before creating
	// the replica, see if there is a tombstone which would indicate that this is
	// a stale message.
	tombstoneKey := keys.RaftTombstoneKey(rangeID)
	var tombstone roachpb.RaftTombstone
	if ok, err := engine.MVCCGetProto(
		ctx, s.Engine(), tombstoneKey, hlc.Timestamp{}, true, nil, &tombstone,
	); err != nil {
		return nil, false, err
	} else if ok {
		if replicaID != 0 && replicaID < tombstone.NextReplicaID {
			return nil, false, &roachpb.RaftGroupDeletedError{}
		}
	}

so we won't be checking tombstones in that case (fourth line from the bottom). Similarly repl.setReplicaIDRaftMuLockedMuLocked(replicaID) also ignores the tombstone.

So it's possible to acquire the splitLock for a replica that is in fact absent and can't be recreated with the split descriptor's replicaID. This is because we're abusing the preemptive snapshot path.

Then, in splitPostApply, we call initRaftMuLockedReplicaMuLocked (with a zero replicaID too, but this time, this is the contract of that method when initializing a replica, which is what we do in that case), and it does the proper tombstone checks, and blows up because it looks up the proper replicaID from the descriptor, and calls setReplicaIDRaftMuLockedMuLocked with it:

if replicaID < r.mu.minReplicaID {
		return &roachpb.RaftGroupDeletedError{}
}                    

I think the right thing to do is to pass the proper replicaID when acquiring the split lock. If that succeeds, we're guaranteed that splitPostApply won't run into the same problem as the tombstone check has been passed before, and we've been holding raftMu continuously.

That shifts the possible occurrence of the error to the acquisition of the split lock. What to do in that case? If we get *RaftGroupDeletedError, and we've passed the correct replicaID, the split is moot. We should "apply" it successfully, but not actually create the RHS. Instead, we shorten only the LHS. This is going to be a little annoying to plumb, but shouldn't be intrinsically complicated.

We could also leave the error to bubble up at the end, but interpret it in splitPostApply. This might actually be easier.

@bdarnell I won't get to this any time soon, but it does seem important to fix.

@tbg tbg added high priority C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Jan 2, 2018
@bdarnell
Copy link
Contributor

bdarnell commented Jan 4, 2018

We should "apply" it successfully, but not actually create the RHS. Instead, we shorten only the LHS. This is going to be a little annoying to plumb, but shouldn't be intrinsically complicated.

The complicated part of this is going to be in shortening the LHS. If we can't create the RHS replica, we can't send it through the replica GC queue. We'd need a new code path to destroy the on-disk data that's left over here.

We could also leave the error to bubble up at the end, but interpret it in splitPostApply. This might actually be easier.

Yeah, all that plumbing may not be necessary.

@tbg
Copy link
Member Author

tbg commented Jan 5, 2018

The complicated part of this is going to be in shortening the LHS. If we can't create the RHS replica, we can't send it through the replica GC queue. We'd need a new code path to destroy the on-disk data that's left over here.

In this special case, since we know that the on-disk data is there, can we force-create the replica with its old replicaID (i.e., allow a recreation), and then let the replica GC queue destroy it?

But, looking into this further, it seems that the existence of the tombstone in itself is a bug, this is from removeReplicaImpl:

	if placeholder := s.getOverlappingKeyRangeLocked(desc); placeholder != rep {
		// This is a fatal error because uninitialized replicas shouldn't make it
		// this far. This method will need some changes when we introduce GC of
		// uninitialized replicas.
		s.mu.Unlock()
		log.Fatalf(ctx, "replica %+v unexpectedly overlapped by %+v", rep, placeholder)
	}

(and that Fatal should fire, since placeholder will never be uninitialized, and since we couldn't insert a placeholder for the RHS since it overlaps the LHS, whatever will be returned can't be rep;
I think it'll be the LHS replica).

And it seems like we actually never queue uninitialized replicas from Raft handling:

func (s *Store) HandleRaftResponse(ctx context.Context, resp *RaftMessageResponse) error {
	ctx = s.AnnotateCtx(ctx)
	repl, replErr := s.GetReplica(resp.RangeID)
	if replErr == nil {
		// Best-effort context annotation of replica.
		ctx = repl.AnnotateCtx(ctx)
	}
	switch val := resp.Union.GetValue().(type) {
	case *roachpb.Error:
		switch tErr := val.GetDetail().(type) {
		case *roachpb.ReplicaTooOldError:
			if replErr != nil {
				// RangeNotFoundErrors are expected here; nothing else is.
				if _, ok := replErr.(*roachpb.RangeNotFoundError); !ok {
					log.Error(ctx, replErr)
				}
				return nil
			}
			// ... now actually replicaGCQueue.Add()

Perhaps one of the other callers to Add is less careful, for example this one:

func (s *Store) canApplySnapshotLocked(
	ctx context.Context, rangeDescriptor *roachpb.RangeDescriptor,
) (*ReplicaPlaceholder, error) {
	if v, ok := s.mu.replicas.Load(int64(rangeDescriptor.RangeID)); ok &&
		(*Replica)(v).IsInitialized() {
		// We have the range and it's initialized, so let the snapshot through.
		return nil, nil
	}

	// We don't have the range (or we have an uninitialized
	// placeholder). Will we be able to create/initialize it?
	if exRng, ok := s.mu.replicaPlaceholders[rangeDescriptor.RangeID]; ok {
		return nil, errors.Errorf("%s: canApplySnapshotLocked: cannot add placeholder, have an existing placeholder %s", s, exRng)
	}

	if exRange := s.getOverlappingKeyRangeLocked(rangeDescriptor); exRange != nil {
		// We have a conflicting range, so we must block the snapshot.
		// When such a conflict exists, it will be resolved by one range
		// either being split or garbage collected.
		exReplica, err := s.GetReplica(exRange.Desc().RangeID)
		msg := IntersectingSnapshotMsg
		if err != nil {
			log.Warning(ctx, errors.Wrapf(
				err, "unable to look up overlapping replica on %s", exReplica))
		} else {
			inactive := func(r *Replica) bool {
				if r.RaftStatus() == nil {
					return true
				}
				lease, pendingLease := r.GetLease()
				now := s.Clock().Now()
				return !r.IsLeaseValid(lease, now) &&
					(pendingLease == nil || !r.IsLeaseValid(*pendingLease, now))
			}

			// If the existing range shows no signs of recent activity, give it a GC
			// run.
			if inactive(exReplica) {
				if _, err := s.replicaGCQueue.Add(exReplica, replicaGCPriorityCandidate); err != nil {
					log.Errorf(ctx, "%s: unable to add replica to GC queue: %s", exReplica, err)
				} else {
					msg += "; initiated GC:"
				}
			}

I don't see how an unitialized replica will do in the above code path because it shouldn't ever overlap an incoming snapshot (it's not in replicasByKey as it doesn't have a descriptor yet). But somehow a tombstone must have been written.

There are two more callers to .Add, one GC's existing on-disk replicas on startup (which I think I can't blame here) and one in which a replica queues itself when it applies a replica change that removes it. I can't blame that one either as the uninitialized replica won't apply anything.

I'm left wondering how we wrote a tombstone. Note that we don't actually have to "write a tombstone" for the original error to pop up. We need minReplicaID to be bumped. This happens in ~5 locations, but I looked at them one by one and they seem to be guarded appropriately.

@bdarnell
Copy link
Contributor

bdarnell commented Jan 8, 2018

In this special case, since we know that the on-disk data is there, can we force-create the replica with its old replicaID (i.e., allow a recreation), and then let the replica GC queue destroy it?

Yes, we probably could, but that's part of the complexity I was thinking about. Do we just special-case it in memory so we have a Replica object that is forbidden by the on-disk tombstone, or do we delete the tombstone as a part of processing the split so we can recreate the replica (and then GC it and recreate the tombstone)? (Now that I've written this out, the latter sounds like a terrible idea)

But hopefully we can forget about that and fix this by preventing the tombstone from being written in the first place.

Even if canApplySnapshotLocked calls replicaGCQueue.Add with an uninitialized replica, baseQueue.addInternal does another IsInitialized check, so we should never get into a queue.process with an uninitialized replica. Is there any other way for a tombstone to be written? The only ones I see are the replicaGCQueue and MergeRange (I guess it's possible that someone is playing with merging and triggered a sentry report, but it seems unlikely)

Note that we don't actually have to "write a tombstone" for the original error to pop up. We need minReplicaID to be bumped.

This seems plausible. We bump minReplicaID whenever we set our replica ID to a new value:

if replicaID >= r.mu.minReplicaID {
r.mu.minReplicaID = replicaID + 1
}

If replicaID ever moves backwards, we get this message. (update: was just looking at the one line without its surrounding if) We could almost have something like this:

  1. Range r1 splits (creating r2) while node n3 is down or behind.
  2. r2 is rebalanced from n3 (replica ID 3) to n4 (replica ID 4)
  3. r2 is rebalanced back from n4 to n3 (replica ID 5). The problem with this is that n3 won't be able to process a preemptive snapshot for r2, and I don't think we can assign a new replica ID without a preemptive snapshot. Or is there some path I'm missing here?
  4. n3 catches up with r1's raft log and processes the split. We can acquire the split lock on r2 because it asks for replica ID 0 (unchecked), but we get this error in SplitRange.

@bdarnell
Copy link
Contributor

The problem with this is that n3 won't be able to process a preemptive snapshot for r2, and I don't think we can assign a new replica ID without a preemptive snapshot. Or is there some path I'm missing here?

Paging this issue back in, I'm not sure what I was thinking here - we set replica ids in response to all raft messages, not just preemptive snapshots. Store.withReplicaForRequest will create an uninitialized Replica with the new replica ID. That method can't complete while the split lock (RHS raftMu) is held, but it works if you flip the order.

  1. Range r1 splits (creating r2) while node n3 is down or behind.
  2. r2 is rebalanced from n3 (replica ID 3) to n4 (replica ID 4).
  3. r2 is rebalanced back from n4 to n3 (replica ID 5) while still down or behind.
  4. n3 catches back up. First, it receives raft messages for r2 containing its new replica ID 5, so it creates the Replica object with that ID (in memory only; this replica ID is not persisted and we do not write a tombstone)
  5. Next, processing the raft log of r1, it reaches the split. acquireSplitLock locks r2's raftMu but does not check the replica ID because we passed 0.
  6. In splitPostApply, we finally check the raft log and it fails.
  7. After process restart, the knowledge of replica ID 5 is gone so the node can come back up as if this race never happened.

So this is a single crash that does not recur (we've seen it occur on three different clusters, and none of them experienced a second error). Does it matter that we forgot our former replica ID? I don't think it does, because the only thing we can do in our uninitialized state is vote, and the votes are correctly persisted in the HardState.

I think we can fix this by allowing the replica ID to move backwards in some cases. minReplicaID should only be set from the tombstone and not for transient replica ID bumps. Alternately, we could change the uninitialized raft-message path to avoid setting the replica ID.

Either way, I'm downgrading the priority of this since it is infrequent and doesn't cause persistent problems.

@bdarnell bdarnell modified the milestones: 2.0, 2.1 Feb 14, 2018
@tbg tbg added S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Apr 29, 2018
@tbg tbg added the A-kv-replication Relating to Raft, consensus, and coordination. label May 15, 2018
@bdarnell bdarnell modified the milestones: 2.1, 2.2 Aug 15, 2018
@tbg
Copy link
Member Author

tbg commented Oct 15, 2018

Reproduced in a 32node import roachtest: #31184 (comment)

@tbg
Copy link
Member Author

tbg commented Mar 1, 2019

Have been looking at this again due to unrelated reasons (investigating the replicaid refactor) and am wondering about minReplicaID. In this issue, the restart "fixed" things because it allowed minReplicaID to regress. But what would happen if after the restart, stale messages to this replica got delivered to it? They would now be processed by the replica. This seems sketchy, though I'm not sure why exactly it would matter (i.e. I can't see any concrete harm that could be done this way). Conceptually we'd want the property that messages to different Raft peers (= different replicaIDs) don't intermingle.

@tbg
Copy link
Member Author

tbg commented Mar 1, 2019

(but that property is out the window already, as we share the Raft log and HardState so conceptually it's really more as if we reused the previous replicaID)

@tbg
Copy link
Member Author

tbg commented Mar 1, 2019

Taking this a step further, I wonder why we can't always mandate that replicaID == storeID. That is, a replica will only be added to store 5 with replicaID 5. If it then got removed and added again, we'd use replicaID 5 again. GC would be made more awkward perhaps, since there's no good way to tell whether a replica is stale or not (i.e. stray messages would more easily recreate removed replicas, and they would do so in dataless state, where we don't even have replicaGC). But this seems like a problem we can solve somehow.

@bdarnell
Copy link
Contributor

bdarnell commented Mar 1, 2019

The whole reason for replicaID's existence is that etcd/raft doesn't handle reuse of node/replica ids. Or more specifically, it assumes that a node ID will never regress in certain ways, even across remove/add cycles. Raft may panic if it sees that replica 5 has acknowledged log position 123 and later asks for a snapshot starting before that point. Inconsistency could result if replica 5 casts different votes in the same term.

Maybe there are alternative solutions here that would be less error-prone than changing replicaIDs, though. We already have permanent tombstones; it wouldn't cost much more to keep the HardState around too. And preemptive snapshots might resolve the panic issue (although learner replicas wouldn't, unless we make a larger series of changes to raft to make it less panicky overall).

@tbg
Copy link
Member Author

tbg commented Mar 7, 2019

more specifically, it assumes that a node ID will never regress in certain ways,

Is this really the reason? I think when we apply a node removal we always nuke the progress for the peer:

https://github.com/cockroachdb/vendored/blob/3f5e5955a4eeffd4c82010e2c249893d94976ffc/go.etcd.io/etcd/raft/raft.go#L1436-L1438

and the removal is processed when the command is applied. I suppose theoretically the command could apply on the Raft leader well after it commits, and another leader could step up and re-add the node which would then contact the old leader and trigger a panic.

But now I had another thought: why is the Raft group keyed on the replicaID in the first place? Assuming we keep everything as is, shouldn't we be able to key the (internal) raft group by storeID? There wouldn't be a change to the external interface, but we'd be freed from the burden of recreating the raft group every time the replicaID changes. We'd still respect deletion tombstones and would tag our replicaID into outgoing Raft messages, so this should not cause any change in functionality and no migration is needed.
We'd have to rewrite from replicaID to nodeID before feeding ConfChanges into Raft:

raftGroup.ApplyConfChange(cc)

The upshot is that debugging gets easier (since peer id == storeID) and we can potentially reduce some of the locking around the raft group, since its lifetime will be that of the surrounding Replica (mod preemptive snapshot shenanigans).

Am I missing some reason for which this won't work?

BTW, re-reading the old RFCs further, the real motivation for replicaIDs seems to have been needing replica tombstones:

1. A range R has replicas on nodes A, B, and C; C is down.
2. Nodes A and B execute a `ChangeReplicas` transactions to remove
node C. Several more `ChangeReplicas` transactions follow, adding
nodes D, E, and F and removing A and B.
3. Nodes A and B garbage-collect their copies of the range.
4. Node C comes back up. When it doesn't hear from the lease holder of
range R, it starts an election.
5. Nodes A and B see that node C has a more advanced log position for
range R than they do (since they have nothing), so they vote for it.
C becomes lease holder and sends snapshots to A and B.
6. There are now two "live" versions of the range. Clients
(`DistSenders`) whose range descriptor cache is out of date may
talk to the ABC group instead of the correct DEF group.

That we went ahead and used the replicaID as the peer ID is sensible, but may have been a bad idea.

@bdarnell
Copy link
Contributor

Is this really the reason? I think when we apply a node removal we always nuke the progress for the peer:

It's more about votes than log progress. If a node is removed and re-added with the same node id within a single term, it could cast a different vote in that term and elect a second leader, leading to split-brain. That's why I said we'd at least need to keep the HardState around indefinitely. I haven't thought through the log ack issues so I don't know whether that would be enough.

BTW, re-reading the old RFCs further, the real motivation for replicaIDs seems to have been needing replica tombstones:

This was concurrent with the discovery that we can't reuse replica IDs for the above split-brain reason.

tbg added a commit to tbg/cockroach that referenced this issue Apr 9, 2019
Currently, in-memory Replica objects can end up having a replicaID zero.
Roughly speaking, this is always the case when a Replica's range
descriptor does not contain the Replica's store, though sometimes we do
have a replicaID taken from incoming Raft messages (which then won't
survive across a restart).

We end up in this unnatural state mostly due to preemptive snapshots,
which are a snapshot of the Range state before adding a certain replica,
sent to the store that will house that replica once the configuration
change to add it has completed. The range descriptor in the snapshot
cannot yet assign the Replica a proper replicaID because none has been
allocated yet (and this allocation has to be done in the replica change
transaction, which hasn't started yet).

Even when the configuration change completes and the leader starts
"catching up" the preemptive snapshot and informs it of the replicaID,
it will take a few moments until the Replica catches up to the log entry
that actually updates the descriptor. If the node reboots before that
log entry is durably applied, the replicaID will "restart" at zero until
the leader contacts the Replica again.

This suggests that preemptive snapshots introduce fundamental complexity
which we'd like to avoid - as long as we use preemptive snapshots there
will not be sanity in this department.

This PR introduces a mechanism which delays the application of
preemptive snapshots so that we apply them only when the first request
*after* the completed configuration change comes in (at which point a
replicaID is present).

Superficially, this seems to solve the above problem (since the Replica
will only be instantiated the moment a replicaID is known), though it
doesn't do so across restarts.

However, if we synchronously persisted (not done in this PR) the
replicaID from incoming Raft messages whenever it changed, it seems that
we should always be able to assign a replicaID when creating a Replica,
even when dealing with descriptors that don't contain the replica itself
(since there would've been a Raft message with a replicaID at some
point, and we persist that). This roughly corresponds to persisting
`Replica.lastToReplica`.

We ultimately want to switch to learner replicas instead of preemptive
snapshots. Learner replicas have the advantage that they are always
represented in the replica descriptor, and so the snapshot that
initializes them will be a proper Raft snapshot containing a descriptor
containing the learner Replica itself. However, it's clear that we need
to continue supporting preemptive snapshots in 19.2 due to the need to
support mixed 19.1/19.2 clusters.

This PR in conjunction with persisting the replicaID (and auxiliary
work, for example on the split lock which currently also creates a
replica with replicaID zero and which we know [has bugs]) should allow
us to remove replicaID zero from the code base without waiting out the
19.1 release cycle.

[has bugs]: cockroachdb#21146

Release note: None
@bdarnell bdarnell removed their assignment Jul 31, 2019
@tbg
Copy link
Member Author

tbg commented Aug 12, 2019

minReplicaID is set in three places (as of the time of writing):

  1. setTombstoneKey (as you'd expect)
  2. when initializing a replica from on-disk state
  3. preemptive snaps - irrelevant since it is soon dead code
  4. when the replicaID changes:
    previousReplicaID := r.mu.replicaID
    r.mu.replicaID = replicaID
    if replicaID >= r.mu.minReplicaID {
    r.mu.minReplicaID = replicaID + 1
    }

This last code is the one that matters here (though in principle, if we had a way to replicaGC uninit'ed replicas, we'd also get honest tombstones written).

First of all, I was surprised by the +1, I thought it would just set it to the new replicaID. But really this field is more of a nextReplicaID: the field is checked only when the replicaID changes, and then we check that it's >= r.mu.minReplicaID. So the +1 is correct:

if r.mu.replicaID == replicaID {
// The common case: the replica ID is unchanged.
return nil
}
if replicaID == 0 {
// If the incoming message does not have a new replica ID it is a
// preemptive snapshot. We'll update minReplicaID if the snapshot is
// accepted.
return nil
}
if replicaID < r.mu.minReplicaID {
return &roachpb.RaftGroupDeletedError{}
}

Anyway, what the split trigger really does in these scenarios is say "well, I have data here for this replica and even if the replica got removed and re-added any number of times, I'm still going to stick this old data in because I know that's safe". The safety comes from knowing that at no point could the replica have held data (other than the HardState, which we preserve). Or, in other words, it's safe to apply snapshots from past incarnations of the replica to a newer one. We just want to bypass the tombstone here.

@bdarnell
Copy link
Contributor

The safety comes from knowing that at no point could the replica have held data (other than the HardState, which we preserve)

Just to complete the thought: we know this because the yet-to-be-split LHS is in the way. This clearly applies to the KV data; I had to think for a minute to convince myself that it works for logs. (The reason is that until we've processed the split, we're at log index zero. We need a snapshot before we can accept any logs, so the LHS blocking the snapshots also blocks any logs.

So bypassing the tombstone check when processing a split sounds good to me.

tbg added a commit to tbg/cockroach that referenced this issue Aug 12, 2019
The right hand side of a split can be readded before the split trigger
fires, in which case the split trigger fails.

See [bug description].

I [suggested] a test to reprduce this bug "properly", so we should look
into that. In the meantime, it'll be good to see that this passes tests.
I verified manually that setting `minReplicaID` to some large number
before the call to `rightRng.initRaftMuLockedReplicaMuLocked` reproduces
the symptoms prior to this commit, but that doesn't come as a surprise
nor does it prove that the fix works flawlessly.

[bug description]: cockroachdb#21146 (comment)
[suggested]: cockroachdb#39034 (comment)

Fixes cockroachdb#21146.

Release note (bug fix): Fixed a rare panic (message: "raft group
deleted") that could occur during splits.
@tbg
Copy link
Member Author

tbg commented Aug 12, 2019

Ack, see #39571

craig bot pushed a commit that referenced this issue Aug 13, 2019
39571: storage: avoid RaftGroupDeletedError from RHS in splitTrigger r=bdarnell a=tbg

The right hand side of a split can be readded before the split trigger
fires, in which case the split trigger fails.

See [bug description].

I [suggested] a test to reprduce this bug "properly", so we should look
into that. In the meantime, it'll be good to see that this passes tests.
I verified manually that setting `minReplicaID` to some large number
before the call to `rightRng.initRaftMuLockedReplicaMuLocked` reproduces
the symptoms prior to this commit, but that doesn't come as a surprise
nor does it prove that the fix works flawlessly.

[bug description]: #21146 (comment)
[suggested]: #39034 (comment)

Fixes #21146.

Release note (bug fix): Fixed a rare panic (message: "raft group
deleted") that could occur during splits.

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig craig bot closed this as completed in #39571 Aug 13, 2019
danhhz added a commit to danhhz/cockroach that referenced this issue Aug 15, 2019
This verifies the behavior of when the application of some split command
(part of the lhs's log) is delayed on some store and meanwhile the rhs
has rebalanced away and back, ending up with a larger ReplicaID than the
split thinks it will have.

Release note: None
danhhz added a commit to danhhz/cockroach that referenced this issue Aug 16, 2019
This verifies the behavior of when the application of some split command
(part of the lhs's log) is delayed on some store and meanwhile the rhs
has rebalanced away and back, ending up with a larger ReplicaID than the
split thinks it will have.

Release note: None
craig bot pushed a commit that referenced this issue Aug 16, 2019
39694: storage: add regression test for #21146 r=tbg a=danhhz

This verifies the behavior of when the application of some split command
(part of the lhs's log) is delayed on some store and meanwhile the rhs
has rebalanced away and back, ending up with a larger ReplicaID than the
split thinks it will have.

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants