-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: *roachpb.RaftGroupDeletedError for RHS in splitPostApply #21146
Comments
The scenario here should be something like
In the error, the split lock has previously been acquired. This implies that func (s *Store) tryGetOrCreateReplica(
ctx context.Context,
rangeID roachpb.RangeID,
replicaID roachpb.ReplicaID,
creatingReplica *roachpb.ReplicaDescriptor,
) (_ *Replica, created bool, _ error) {
// The common case: look up an existing (initialized) replica.
if value, ok := s.mu.replicas.Load(int64(rangeID)); ok {
// ...
repl.setReplicaIDRaftMuLockedMuLocked(replicaID)
// ...
return
}
// No replica currently exists, so we'll try to create one. Before creating
// the replica, see if there is a tombstone which would indicate that this is
// a stale message.
tombstoneKey := keys.RaftTombstoneKey(rangeID)
var tombstone roachpb.RaftTombstone
if ok, err := engine.MVCCGetProto(
ctx, s.Engine(), tombstoneKey, hlc.Timestamp{}, true, nil, &tombstone,
); err != nil {
return nil, false, err
} else if ok {
if replicaID != 0 && replicaID < tombstone.NextReplicaID {
return nil, false, &roachpb.RaftGroupDeletedError{}
}
} so we won't be checking tombstones in that case (fourth line from the bottom). Similarly So it's possible to acquire the Then, in if replicaID < r.mu.minReplicaID {
return &roachpb.RaftGroupDeletedError{}
} I think the right thing to do is to pass the proper That shifts the possible occurrence of the error to the acquisition of the split lock. What to do in that case? If we get We could also leave the error to bubble up at the end, but interpret it in @bdarnell I won't get to this any time soon, but it does seem important to fix. |
The complicated part of this is going to be in shortening the LHS. If we can't create the RHS replica, we can't send it through the replica GC queue. We'd need a new code path to destroy the on-disk data that's left over here.
Yeah, all that plumbing may not be necessary. |
In this special case, since we know that the on-disk data is there, can we force-create the replica with its old replicaID (i.e., allow a recreation), and then let the replica GC queue destroy it? But, looking into this further, it seems that the existence of the tombstone in itself is a bug, this is from if placeholder := s.getOverlappingKeyRangeLocked(desc); placeholder != rep {
// This is a fatal error because uninitialized replicas shouldn't make it
// this far. This method will need some changes when we introduce GC of
// uninitialized replicas.
s.mu.Unlock()
log.Fatalf(ctx, "replica %+v unexpectedly overlapped by %+v", rep, placeholder)
} (and that Fatal should fire, since And it seems like we actually never queue uninitialized replicas from Raft handling: func (s *Store) HandleRaftResponse(ctx context.Context, resp *RaftMessageResponse) error {
ctx = s.AnnotateCtx(ctx)
repl, replErr := s.GetReplica(resp.RangeID)
if replErr == nil {
// Best-effort context annotation of replica.
ctx = repl.AnnotateCtx(ctx)
}
switch val := resp.Union.GetValue().(type) {
case *roachpb.Error:
switch tErr := val.GetDetail().(type) {
case *roachpb.ReplicaTooOldError:
if replErr != nil {
// RangeNotFoundErrors are expected here; nothing else is.
if _, ok := replErr.(*roachpb.RangeNotFoundError); !ok {
log.Error(ctx, replErr)
}
return nil
}
// ... now actually replicaGCQueue.Add() Perhaps one of the other callers to func (s *Store) canApplySnapshotLocked(
ctx context.Context, rangeDescriptor *roachpb.RangeDescriptor,
) (*ReplicaPlaceholder, error) {
if v, ok := s.mu.replicas.Load(int64(rangeDescriptor.RangeID)); ok &&
(*Replica)(v).IsInitialized() {
// We have the range and it's initialized, so let the snapshot through.
return nil, nil
}
// We don't have the range (or we have an uninitialized
// placeholder). Will we be able to create/initialize it?
if exRng, ok := s.mu.replicaPlaceholders[rangeDescriptor.RangeID]; ok {
return nil, errors.Errorf("%s: canApplySnapshotLocked: cannot add placeholder, have an existing placeholder %s", s, exRng)
}
if exRange := s.getOverlappingKeyRangeLocked(rangeDescriptor); exRange != nil {
// We have a conflicting range, so we must block the snapshot.
// When such a conflict exists, it will be resolved by one range
// either being split or garbage collected.
exReplica, err := s.GetReplica(exRange.Desc().RangeID)
msg := IntersectingSnapshotMsg
if err != nil {
log.Warning(ctx, errors.Wrapf(
err, "unable to look up overlapping replica on %s", exReplica))
} else {
inactive := func(r *Replica) bool {
if r.RaftStatus() == nil {
return true
}
lease, pendingLease := r.GetLease()
now := s.Clock().Now()
return !r.IsLeaseValid(lease, now) &&
(pendingLease == nil || !r.IsLeaseValid(*pendingLease, now))
}
// If the existing range shows no signs of recent activity, give it a GC
// run.
if inactive(exReplica) {
if _, err := s.replicaGCQueue.Add(exReplica, replicaGCPriorityCandidate); err != nil {
log.Errorf(ctx, "%s: unable to add replica to GC queue: %s", exReplica, err)
} else {
msg += "; initiated GC:"
}
} I don't see how an unitialized replica will do in the above code path because it shouldn't ever overlap an incoming snapshot (it's not in There are two more callers to I'm left wondering how we wrote a tombstone. Note that we don't actually have to "write a tombstone" for the original error to pop up. We need |
Yes, we probably could, but that's part of the complexity I was thinking about. Do we just special-case it in memory so we have a Replica object that is forbidden by the on-disk tombstone, or do we delete the tombstone as a part of processing the split so we can recreate the replica (and then GC it and recreate the tombstone)? (Now that I've written this out, the latter sounds like a terrible idea) But hopefully we can forget about that and fix this by preventing the tombstone from being written in the first place. Even if
This seems plausible. We bump minReplicaID whenever we set our replica ID to a new value: cockroach/pkg/storage/replica.go Lines 953 to 955 in 4051ba3
|
Paging this issue back in, I'm not sure what I was thinking here - we set replica ids in response to all raft messages, not just preemptive snapshots.
So this is a single crash that does not recur (we've seen it occur on three different clusters, and none of them experienced a second error). Does it matter that we forgot our former replica ID? I don't think it does, because the only thing we can do in our uninitialized state is vote, and the votes are correctly persisted in the HardState. I think we can fix this by allowing the replica ID to move backwards in some cases. Either way, I'm downgrading the priority of this since it is infrequent and doesn't cause persistent problems. |
Reproduced in a 32node import roachtest: #31184 (comment) |
Have been looking at this again due to unrelated reasons (investigating the replicaid refactor) and am wondering about |
(but that property is out the window already, as we share the Raft log and HardState so conceptually it's really more as if we reused the previous replicaID) |
Taking this a step further, I wonder why we can't always mandate that |
The whole reason for replicaID's existence is that etcd/raft doesn't handle reuse of node/replica ids. Or more specifically, it assumes that a node ID will never regress in certain ways, even across remove/add cycles. Raft may panic if it sees that replica 5 has acknowledged log position 123 and later asks for a snapshot starting before that point. Inconsistency could result if replica 5 casts different votes in the same term. Maybe there are alternative solutions here that would be less error-prone than changing replicaIDs, though. We already have permanent tombstones; it wouldn't cost much more to keep the HardState around too. And preemptive snapshots might resolve the panic issue (although learner replicas wouldn't, unless we make a larger series of changes to raft to make it less panicky overall). |
Is this really the reason? I think when we apply a node removal we always nuke the progress for the peer: and the removal is processed when the command is applied. I suppose theoretically the command could apply on the Raft leader well after it commits, and another leader could step up and re-add the node which would then contact the old leader and trigger a panic. But now I had another thought: why is the Raft group keyed on the replicaID in the first place? Assuming we keep everything as is, shouldn't we be able to key the (internal) raft group by storeID? There wouldn't be a change to the external interface, but we'd be freed from the burden of recreating the raft group every time the replicaID changes. We'd still respect deletion tombstones and would tag our replicaID into outgoing Raft messages, so this should not cause any change in functionality and no migration is needed. cockroach/pkg/storage/replica_raft.go Line 841 in ffe2d1d
The upshot is that debugging gets easier (since peer id == storeID) and we can potentially reduce some of the locking around the raft group, since its lifetime will be that of the surrounding Replica (mod preemptive snapshot shenanigans). Am I missing some reason for which this won't work? BTW, re-reading the old RFCs further, the real motivation for replicaIDs seems to have been needing replica tombstones: cockroach/docs/RFCS/20150729_replica_tombstone.md Lines 35 to 47 in 31b2d57
That we went ahead and used the replicaID as the peer ID is sensible, but may have been a bad idea. |
It's more about votes than log progress. If a node is removed and re-added with the same node id within a single term, it could cast a different vote in that term and elect a second leader, leading to split-brain. That's why I said we'd at least need to keep the HardState around indefinitely. I haven't thought through the log ack issues so I don't know whether that would be enough.
This was concurrent with the discovery that we can't reuse replica IDs for the above split-brain reason. |
Currently, in-memory Replica objects can end up having a replicaID zero. Roughly speaking, this is always the case when a Replica's range descriptor does not contain the Replica's store, though sometimes we do have a replicaID taken from incoming Raft messages (which then won't survive across a restart). We end up in this unnatural state mostly due to preemptive snapshots, which are a snapshot of the Range state before adding a certain replica, sent to the store that will house that replica once the configuration change to add it has completed. The range descriptor in the snapshot cannot yet assign the Replica a proper replicaID because none has been allocated yet (and this allocation has to be done in the replica change transaction, which hasn't started yet). Even when the configuration change completes and the leader starts "catching up" the preemptive snapshot and informs it of the replicaID, it will take a few moments until the Replica catches up to the log entry that actually updates the descriptor. If the node reboots before that log entry is durably applied, the replicaID will "restart" at zero until the leader contacts the Replica again. This suggests that preemptive snapshots introduce fundamental complexity which we'd like to avoid - as long as we use preemptive snapshots there will not be sanity in this department. This PR introduces a mechanism which delays the application of preemptive snapshots so that we apply them only when the first request *after* the completed configuration change comes in (at which point a replicaID is present). Superficially, this seems to solve the above problem (since the Replica will only be instantiated the moment a replicaID is known), though it doesn't do so across restarts. However, if we synchronously persisted (not done in this PR) the replicaID from incoming Raft messages whenever it changed, it seems that we should always be able to assign a replicaID when creating a Replica, even when dealing with descriptors that don't contain the replica itself (since there would've been a Raft message with a replicaID at some point, and we persist that). This roughly corresponds to persisting `Replica.lastToReplica`. We ultimately want to switch to learner replicas instead of preemptive snapshots. Learner replicas have the advantage that they are always represented in the replica descriptor, and so the snapshot that initializes them will be a proper Raft snapshot containing a descriptor containing the learner Replica itself. However, it's clear that we need to continue supporting preemptive snapshots in 19.2 due to the need to support mixed 19.1/19.2 clusters. This PR in conjunction with persisting the replicaID (and auxiliary work, for example on the split lock which currently also creates a replica with replicaID zero and which we know [has bugs]) should allow us to remove replicaID zero from the code base without waiting out the 19.1 release cycle. [has bugs]: cockroachdb#21146 Release note: None
This last code is the one that matters here (though in principle, if we had a way to replicaGC uninit'ed replicas, we'd also get honest tombstones written). First of all, I was surprised by the cockroach/pkg/storage/replica_init.go Lines 169 to 181 in 7dcf66d
Anyway, what the split trigger really does in these scenarios is say "well, I have data here for this replica and even if the replica got removed and re-added any number of times, I'm still going to stick this old data in because I know that's safe". The safety comes from knowing that at no point could the replica have held data (other than the HardState, which we preserve). Or, in other words, it's safe to apply snapshots from past incarnations of the replica to a newer one. We just want to bypass the tombstone here. |
Just to complete the thought: we know this because the yet-to-be-split LHS is in the way. This clearly applies to the KV data; I had to think for a minute to convince myself that it works for logs. (The reason is that until we've processed the split, we're at log index zero. We need a snapshot before we can accept any logs, so the LHS blocking the snapshots also blocks any logs. So bypassing the tombstone check when processing a split sounds good to me. |
The right hand side of a split can be readded before the split trigger fires, in which case the split trigger fails. See [bug description]. I [suggested] a test to reprduce this bug "properly", so we should look into that. In the meantime, it'll be good to see that this passes tests. I verified manually that setting `minReplicaID` to some large number before the call to `rightRng.initRaftMuLockedReplicaMuLocked` reproduces the symptoms prior to this commit, but that doesn't come as a surprise nor does it prove that the fix works flawlessly. [bug description]: cockroachdb#21146 (comment) [suggested]: cockroachdb#39034 (comment) Fixes cockroachdb#21146. Release note (bug fix): Fixed a rare panic (message: "raft group deleted") that could occur during splits.
Ack, see #39571 |
39571: storage: avoid RaftGroupDeletedError from RHS in splitTrigger r=bdarnell a=tbg The right hand side of a split can be readded before the split trigger fires, in which case the split trigger fails. See [bug description]. I [suggested] a test to reprduce this bug "properly", so we should look into that. In the meantime, it'll be good to see that this passes tests. I verified manually that setting `minReplicaID` to some large number before the call to `rightRng.initRaftMuLockedReplicaMuLocked` reproduces the symptoms prior to this commit, but that doesn't come as a surprise nor does it prove that the fix works flawlessly. [bug description]: #21146 (comment) [suggested]: #39034 (comment) Fixes #21146. Release note (bug fix): Fixed a rare panic (message: "raft group deleted") that could occur during splits. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
This verifies the behavior of when the application of some split command (part of the lhs's log) is delayed on some store and meanwhile the rhs has rebalanced away and back, ending up with a larger ReplicaID than the split thinks it will have. Release note: None
This verifies the behavior of when the application of some split command (part of the lhs's log) is delayed on some store and meanwhile the rhs has rebalanced away and back, ending up with a larger ReplicaID than the split thinks it will have. Release note: None
39694: storage: add regression test for #21146 r=tbg a=danhhz This verifies the behavior of when the application of some split command (part of the lhs's log) is delayed on some store and meanwhile the rhs has rebalanced away and back, ending up with a larger ReplicaID than the split thinks it will have. Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
https://sentry.io/cockroach-labs/cockroachdb/issues/426530382/
The text was updated successfully, but these errors were encountered: