storage: splitPostApply can see tombstone for RHS #40470

knz · 2019-09-04T15:02:43Z

Describe the problem

node fails with

F190904 14:33:35.276986 138 storage/store.go:2148  [n1,s1,r313/4:/Table/58/1/{2525/9…-3750/1…}] split trigger found right-hand side with tombstone {NextReplicaID:5}: [n1,s1,r316/?:{-}]

To Reproduce

roachprod create knz-gce -u knz -c gce --geo -n 10
roachprod start $CLUSTER:1,4,7-8
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=5000 --db=tpcc --experimental-direct-ingestion"

This fails within 1-2 minutes.

Relevant log lines:

I190904 14:33:33.261425 180 server/status/runtime.go:498  [n1] runtime stats: 4.7 GiB RSS, 407 goroutines, 2.6 GiB/718 MiB/3.4 GiB GO alloc/idle/total, 1.1 GiB/1.3 GiB CGO alloc/total, 210805.4 CGO/sec, 381.5/16.5 %(u/s)time, 0.0 %gc (3x),
 223 MiB/85 MiB (r/w)net
I190904 14:33:33.964062 11702 storage/replica_raft.go:291  [n1,s1,r261/1:/Table/56{-/1}] proposing REMOVE_REPLICA[(n4,s4):3]: after=[(n1,s1):1 (n3,s3):2 (n2,s2):5] next=6
W190904 14:33:34.137913 11847 storage/replica_raft.go:105  [n1,s1,r323/1:/Table/60/1/2{633/2/…-766/4/…}] context canceled before proposing: 1 HeartbeatTxn
I190904 14:33:34.489669 12200 storage/replica_command.go:1521  [n1,replicate,s1,r247/1:/{Table/61/3-Max}] change replicas (add [] remove [(n3,s3):2]): existing descriptor r247:/{Table/61/3-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):5,
next=6, gen=20]
I190904 14:33:34.669612 11610 storage/replica_raftstorage.go:793  [n1,s1,r313/4:{-}] applying LEARNER snapshot [id=faf20096 index=15]
I190904 14:33:34.944105 11610 storage/replica_raftstorage.go:814  [n1,s1,r313/4:/Table/58/1/{2525/9…-3750/1…}] applied LEARNER snapshot [total=274ms ingestion=4@217ms id=faf20096 index=15]
I190904 14:33:35.053208 12088 storage/split_queue.go:149  [n1,split,s1,r307/1:/Table/54/1/125{3/119…-5/522…}] split saw concurrent descriptor modification; maybe retrying
W190904 14:33:35.202730 12090 storage/replica_raft.go:105  [n1,s1,r304/1:/Table/53/1/250{3/7/-…-5/3/-…}] context canceled before proposing: 1 HeartbeatTxn
I190904 14:33:35.232699 12190 storage/replica_command.go:395  [n1,split,s1,r307/1:/Table/54/1/125{3/119…-5/522…}] initiating a split of this range at key /Table/54/1/1254/11837 [r329] (77 MiB above threshold size 64 MiB)
F190904 14:33:35.276986 138 storage/store.go:2148  [n1,s1,r313/4:/Table/58/1/{2525/9…-3750/1…}] split trigger found right-hand side with tombstone {NextReplicaID:5}: [n1,s1,r316/?:{-}]
goroutine 138 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000448301, 0xc000448360, 0x0, 0x7c662a)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x7c04a40, 0xc000000004, 0x73d2595, 0x10, 0x864, 0xc0006e43c0, 0x89)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x93e
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x4e5e460, 0xc03ccc99e0, 0x4, 0x2, 0x4563081, 0x3a, 0xc003394730, 0x2, 0x2)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x4e5e460, 0xc03ccc99e0, 0x1, 0xc000000004, 0x4563081, 0x3a, 0xc003394730, 0x2, 0x2)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:180
github.com/cockroachdb/cockroach/pkg/storage.splitPostApply(0x4e5e460, 0xc03ccc99e0, 0x0, 0x15c142ce810fb293, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2148 +0xb3e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleSplitResult(...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_application_result.go:233
github.com/cockroachdb/cockroach/pkg/storage.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult(0xc003e298c0, 0x4e5e460, 0xc03ccc99e0, 0x0, 0x0, 0xc0003e1e40, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_application_state_machine.go:943 +0x852
github.com/cockroachdb/cockroach/pkg/storage.(*replicaStateMachine).ApplySideEffects(0xc003e298c0, 0x4eacee0, 0xc063842008, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_application_state_machine.go:856 +0x72d
github.com/cockroachdb/cockroach/pkg/storage/apply.mapCheckedCmdIter(0x7f378dc4b0c8, 0xc003e29ad8, 0xc0033953d8, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/apply/cmd.go:182 +0x11b
github.com/cockroachdb/cockroach/pkg/storage/apply.(*Task).applyOneBatch(0xc003395800, 0x4e5e460, 0xc03ccc99e0, 0x4eacfa0, 0xc003e29a78, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/apply/task.go:276 +0x228
github.com/cockroachdb/cockroach/pkg/storage/apply.(*Task).ApplyCommittedEntries(0xc003395800, 0x4e5e460, 0xc03ccc99e0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/apply/task.go:242 +0xcf
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc003e29800, 0x4e5e460, 0xc03ccc99e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:759 +0xd87
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x4e5e460, 0xc03ccc99e0, 0xc003e29800, 0x4e5e460)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3599 +0x131
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc000adc000, 0x4e5e460, 0xc03ccc99e0, 0xc009c02200, 0xc074b31e98, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3352 +0x150

Expected behavior

Import succeeds

Context

kena@knz-gce-0001:~$ ./cockroach  version
Build Tag:    v19.2.0-alpha.20190606-2012-g58d0fc3
Build Time:   2019/09/04 11:31:21
Distribution: CCL
Platform:     linux amd64 (x86_64-unknown-linux-gnu)
Go Version:   go1.12.5
C Compiler:   gcc 6.3.0
Build SHA-1:  58d0fc3676726c7fa3ebaf41e99f54f305f25fa0
Build Type:   release

The text was updated successfully, but these errors were encountered:

knz · 2019-09-04T15:03:16Z

cc @dt for triage
(I was trying --experimental-direct-ingestion as suggested offline - without the flag, the import seems OK)

dt · 2019-09-04T15:07:29Z

--experimental-direct-ingestion is the default in 19.2 so adding the flag should do nothing.

Is this BulkIO? that stack trace like it is coming from core, no? Or does core think that IMPORT is mis-behaving?

knz · 2019-09-04T16:27:20Z

ok so you're saying this failure is not deterministic, and I was jut (un)lucky to see the panic when I passed the flag explicitly and not see the panic when I omitted the flag?

If so that's a bummer because it will make the investigation a bit lengthy.

knz · 2019-09-04T16:28:24Z

@danhhz how do you think this should be triaged?

tbg · 2019-09-04T16:29:24Z

Will be fixed by #39796.

tbg · 2019-09-04T16:33:05Z

Actually if you haven't removed this cluster yet, could you grab the logs from all nodes?

tbg · 2019-09-04T21:46:11Z

The crash here indicates that a split trigger found a tombstone for the right-hand side replica. We assert against that because the thinking is that this should never happen: we can't replicaGC a replica that is uninitialized (=has never received a snapshot), and (or so we thought) the RHS is always uninitialized before the snapshot because snapshots are blocked by the LHS until the split trigger applies. But I think there's a way in which this can be violated if the LHS comes into existence late on the crashing store. Something like this:

using preemptive snapshots still
r1 is on store s1 only
r1 gets added to s2 and gets split (creating r2) in rapid succession
preemptive snap on s2 gets replicaGC'ed
right hand side r2 manages to place a preemptive snap on s2 (LHS is not there to block it)
preemptive snap on s2/r2 gets replicaGC'ed, leaving a tombstone for actualDesc.NextReplicaID
s2/r1 catches up and executes the split trigger, finding the tombstone on the RHS

Some variant of this basically has to be it. Raft tombstones are only written in two places: when a range is subsumed and during replicaGC. On merges, we use math.MaxInt32 as NextReplicaID but the example shows a 5. So it must be ReplicaGC. ReplicaGC can only happen to initialized replicas (which includes preemptive snaps), so the RHS has to have been init'ed at some point. And this implies that the pre-split LHS wasn't there then.

An alternative could work like this

add s2/r1, split, remove s2/r1 again before it has executed the trigger
r1/s2 replicagc'ed
r2/s2 gets fully added (r1/s2 is gone so snapshot goes through)
r2/s2 gets removed again and replicaGC'ed, leaving tombstone
r1/s2 gets re-added, so it catches up across the split trigger seeing the tombstone.

I think the second scenario is more likely since @knz was running 19.2-alpha which never uses preemptive snaps.

This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. There is one more bug here that is not addressed by this change, namely the fact that there might be a tombstone for the right hand side. This continues to be tracked in cockroachdb#40470. See also cockroachdb#40367 (comment) for more work we need to do to establish sanity. Release note: None

dt · 2019-09-06T13:44:35Z

Also seen in import/tpcc/warehouses=1000/nodes=32 run on 9/2 (node 30's logs) in https://teamcity.cockroachdb.com/viewLog.html?buildId=1463583&buildTypeId=Cockroach_Nightlies_WorkloadNightly

This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. There is one more bug here that is not addressed by this change, namely the fact that there might be a tombstone for the right hand side. This continues to be tracked in cockroachdb#40470. See also cockroachdb#40367 (comment) for more work we need to do to establish sanity. Release note: None

nvanbenschoten · 2019-09-11T19:24:49Z

I still see this with #39796.

We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Fix crashes by preventing replica ID change.

We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"

knz added A-disaster-recovery S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. labels Sep 4, 2019

tbg mentioned this issue Sep 4, 2019

storage: avoid (one) fatal error from splitPostApply #39796

Merged

andy-kimball mentioned this issue Sep 4, 2019

19.2 release blockers list #40447

Closed

53 tasks

tbg changed the title ~~storage+tpcc: import fixtures with direct ingestion triggers assertion failure~~ storage: splitPostApply can see tombstone for RHS Sep 6, 2019

dt mentioned this issue Sep 9, 2019

roachtest: import/tpch/nodes=32 failed #39115

Closed

ajwerner mentioned this issue Sep 10, 2019

roachtest: import/tpcc/warehouses=1000/nodes=32 failed #39072

Closed

ajwerner self-assigned this Sep 12, 2019

craig bot closed this as completed in 2020115 Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: splitPostApply can see tombstone for RHS #40470

storage: splitPostApply can see tombstone for RHS #40470

knz commented Sep 4, 2019 •

edited

Loading

knz commented Sep 4, 2019

dt commented Sep 4, 2019 •

edited

Loading

knz commented Sep 4, 2019

knz commented Sep 4, 2019

tbg commented Sep 4, 2019

tbg commented Sep 4, 2019

tbg commented Sep 4, 2019

dt commented Sep 6, 2019

nvanbenschoten commented Sep 11, 2019

storage: splitPostApply can see tombstone for RHS #40470

storage: splitPostApply can see tombstone for RHS #40470

Comments

knz commented Sep 4, 2019 • edited Loading

knz commented Sep 4, 2019

dt commented Sep 4, 2019 • edited Loading

knz commented Sep 4, 2019

knz commented Sep 4, 2019

tbg commented Sep 4, 2019

tbg commented Sep 4, 2019

tbg commented Sep 4, 2019

dt commented Sep 6, 2019

nvanbenschoten commented Sep 11, 2019

knz commented Sep 4, 2019 •

edited

Loading

dt commented Sep 4, 2019 •

edited

Loading