-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Raft state corruption when receiving snapshot #39604
Comments
👍 what are you trying next? I'd disable the deletion of the ingested SSTs (perhaps manually create an additional hardlink) and also create a Rocks checkpoint ( |
You might also have luck w/ a targeted failpoint in an ingestion snap unit test where you run |
This reverts commit b320ff5. In the above commit, we are starting both to ingest multiple SSTs at the same time, and additionally these SSTs contain range deletion tombstones. Both are firsts, and it turns out that there are some kinks to work out. The commit causes quite a number of failures, so reduce churn while we do so. See cockroachdb#39604. Release note: None
39619: Revert "storage: build SSTs from KV_BATCH snapshot" r=nvanbenschoten a=tbg This reverts commit b320ff5. In the above commit, we are starting both to ingest multiple SSTs at the same time, and additionally these SSTs contain range deletion tombstones. Both are firsts, and it turns out that there are some kinks to work out. The commit causes quite a number of failures, so reduce churn while we do so. See #39604. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
I (hopefully) found a pretty minimal unit test reproduction of this issue. Changing Running Reproduction diff:
|
Hmm, sounds weird. Does the problem disappear when you flush the memtable right after the ingestion (before checking LoadLastIndex)? |
maybe. I tried the repro and found the unexpected key is in WAL and not yet flushed to SST. I couldn't find a range tombstone in sstable that looks like it covers, however. unexpected key: 0x01698a757266746c000000000000001000 (/Local/RangeID/2/u/RaftLog/logIndex:16) found in WAL:
range tombstones:
|
Although, considering there's a point deletion for this key in the WAL, I couldn't guess why it shows up in |
Root cause We're not passing the global seqnum when reading the range tombstone block so they're all treated as having seqnum zero. During ingestion preparation the range tombstone meta-block gets accessed and (probably accidentally) put in block cache. At that time the global seqno has not been determined so the cached block has no global seqno. Later after the file's ingested and we read its range tombstone meta-block, it'll be retrieved from cache. That leads us to obtain a range tombstone meta-block with no global seqno. In that case, we use the actual seqnos stored in the range tombstones, which are all just zero, so the tombstones cover nothing. Short-term plan Ideally Jeffrey can re-land his feature prior to internship ending (end of next week) in order to (a) finish it, and (b) see if it turns up any other fun issues. To do this we are willing to apply some quick and specific fixes to our rocksdb fork even if they aren't easily upstreamable. The original plan is to pass a flag all the way from file ingestion preparation to ReadRangeDelBlock() (https://github.com/facebook/rocksdb/blob/90cd6c2bb17a54e97c32856254fd666c9cab8dc5/table/block_based/block_based_table_reader.cc#L1398-L1402), indicating whether to set Alternatively, rather than plumbing through a flag to not cache the range-del block, I wonder if that block should ever be cached for an sstable which has global_seqno == 0 and the range tombstones all have a sequence number of zero. (citation: @petermattis) Longer-term plan We could move the range tombstone meta-block read during The objective here is for the caller (ingestion preparation) to be able to prevent caching of all blocks by (1) specifying |
As discussed offline, the original sin is including I believe the caching of the range-del block happens when ingestion determines the boundaries for the table. The ingestion code opens a range-del iterator on the sstable and then finds the lower and upper bounds. Note that it also finds the lower and upper bounds for point operations. So we're probably caching the first and last data block in an sstable as well. It should be straightforward to test this hypothesis. If it is true, there is a more serious bug here which could already be affecting CockroachDB. A key in the first or last data block of an ingested sstable which is overwriting an existing key will actually appear older than the existing key. @ajkr or @jeffrey-xiao can you verify this hypothesis? |
The logic that reads the first/last point keys in the sstable during ingestion preparation is here: https://github.com/facebook/rocksdb/blob/4c70cb730614388041b97a31ae2e5addb1279284/db/external_sst_file_ingestion_job.cc#L404-L438. Note it sets |
I like this solution too as having a table-wide, mutable attribute in the |
Thanks for tracking this down. It is reassuring that exactly this bug was already considered, though unfortunate that the solution didn't apply to the range-del block case. |
cockroachdb/rocksdb#43 has a temporary fix for the underlying RocksDB bug. |
I don't think we should close this issue until we have a non-temporary fix. |
39689: storage: reintroduce building SSTs from KV_BATCH snapshot r=jeffrey-xiao a=jeffrey-xiao The final commit of #38932 was previously reverted to due an underlying bug in RocksDB with ingesting range deletion tombstones with a global seqno. See #39604 for discussion on the bug and cockroachdb/rocksdb#43 for the temporary short-term resolution of the bug. Release note: None Co-authored-by: Jeffrey Xiao <jeffrey.xiao1998@gmail.com>
Actually @jeffrey-xiao might try it out. No worries if you run out of time - I can reclaim it later if needed. |
Upstream fix is facebook/rocksdb#5719. |
@petermattis: can this be closed out? |
Currently on master,
acceptance/bank/cluster-recovery
would occasionally fail after(*Replica).applySnapshot
’s call tor.store.engine.IngestExternalFiles
.Adding some logging statements around this area, it looks like the old Raft entries are not being deleted which causes the last index to diverge from the truncated index:
The deletion of the Raft entries should be handled by the range deletion of the unreplicated range-id SST:
cockroach/pkg/storage/replica_raftstorage.go
Lines 847 to 853 in b320ff5
However, when I replace this range deletion tombstone with point deletions, I'm unable to reproduce the failure on
acceptance/bank/cluster-recovery
(~80 successful runs), which leads me to believe that it's some strange happening with the ingestion of range deletion tombstones.CC @nvanbenschoten @tbg @ajkr
The text was updated successfully, but these errors were encountered: