-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark Deleted Snapshot Directories with Tombstones #40228
Conversation
Pinging @elastic/es-distributed |
} catch (IOException ioe) { | ||
// a different IOException occurred while trying to delete - will just log the issue for now | ||
logger.debug(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " + | ||
"but failed to clean up its index folder.", metadata.name(), indexId), ioe); | ||
deleteSuccess = false; | ||
} | ||
if (deleteSuccess) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not adding any logic to deal with the failure scenario here because that is a fairly complex problem to automate (basically you'd want to recursively delete an index path once you're sure no more blobs will be added under it, this is trivial to do manually but non-trivial to do automatically).
deleteSuccess = false; | ||
} | ||
if (deleteSuccess) { | ||
indicesBlobContainer.deleteBlob(tombstoneBlob(indexId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative here could be to not delete these ever:
- The storage cost of keeping them around would be minimal
- It would solve the issue of "rogue" data nodes writing shard data for an already deleted snapshot and us missing this situation because they do so after this code finishes
@ywelsch wdyt? IMO, this is one step we could safely take towards a more controllable future, even though it doesn't fix anything right here and right now (apart from the test) it at least enables some control of the situation in the short-term. |
Closing here, a more complete solution to stale data on master failover that incorporates the ideas here is incoming in https://github.com/elastic/elasticsearch/compare/master...original-brownbear:delete-lock-via-cs?expand=1 |
Marking this WIP for now as it's just a suggestion for how to address the failing to delete indices issue in the short-term. This is not an actual fix for the situation, but rather a way of enabling a future fix in a safe way!
The issue with dangling indices files in the blobstore is, that we delete the indices from the metadata before we delete the actual indices files. So if deleting index files fails and there's no more references to the index in any snapshot metadata -> it will never be deleted.
Enabling an automatic cleanup of these unreferenced indices in a safe way is hard. It can technically be done by simply listing all indices folders and then reading all metadata and defining those indices folders as stale that aren't referenced in any meta data. The trouble with this approach lies in the fact that eventually consistent blob stores like S3 do not offer a consistent view of metadata and list of index folders. So technically, one could run into the situation where an index folder is erroneously assumed to not be referenced because the metadata referencing it was not yet readable.
Fortunately, S3 offers one piece of consistency that we can exploit to make such an automatic cleanup safe after-all. That is, read after write is consistent so long as no read has proceeded a write (see
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel).
So the following would be a safe way of identifying most stale index folders without running the risk of false positives:
Instead of currently doing:
this PR does:
-> any tombstones written will be safe since they are written after the updated metadata was written.
-> this would allow for a consistent check for stale index folders because one could first find the potential stale folders by simply listing index folders and comparing and cross-checking with the metadata and then after doing so check if the tombstone blob exists (if you never check for the existence of this blob in any other situation, this check is a consistent operation as explained in the linked S3 docs).
This has some limitations and would fail to correctly mark stale folders in two scenarios:
-> This still seems like a very very safe approach that we could back port all the way to 6.x without risk and that would allow writing a safe cleanup utility/script for stale blobs.