Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark Deleted Snapshot Directories with Tombstones #40228

Closed
wants to merge 7 commits into from

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Mar 19, 2019

Marking this WIP for now as it's just a suggestion for how to address the failing to delete indices issue in the short-term. This is not an actual fix for the situation, but rather a way of enabling a future fix in a safe way!

The issue with dangling indices files in the blobstore is, that we delete the indices from the metadata before we delete the actual indices files. So if deleting index files fails and there's no more references to the index in any snapshot metadata -> it will never be deleted.

Enabling an automatic cleanup of these unreferenced indices in a safe way is hard. It can technically be done by simply listing all indices folders and then reading all metadata and defining those indices folders as stale that aren't referenced in any meta data. The trouble with this approach lies in the fact that eventually consistent blob stores like S3 do not offer a consistent view of metadata and list of index folders. So technically, one could run into the situation where an index folder is erroneously assumed to not be referenced because the metadata referencing it was not yet readable.

Fortunately, S3 offers one piece of consistency that we can exploit to make such an automatic cleanup safe after-all. That is, read after write is consistent so long as no read has proceeded a write (see
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel).
So the following would be a safe way of identifying most stale index folders without running the risk of false positives:

Instead of currently doing:

  1. Update metadata (removing snapshots from it)
  2. Loop over all index blobs and delete them ignoring errors

this PR does:

  1. Update metadata (removing snapshots from it)
  2. Write tombstone blob for each index that is now unreferenced
  3. Loop over all index blobs and keep track of errors
  4. Delete tombstone only if no errors were registered for a blob

-> any tombstones written will be safe since they are written after the updated metadata was written.
-> this would allow for a consistent check for stale index folders because one could first find the potential stale folders by simply listing index folders and comparing and cross-checking with the metadata and then after doing so check if the tombstone blob exists (if you never check for the existence of this blob in any other situation, this check is a consistent operation as explained in the linked S3 docs).
This has some limitations and would fail to correctly mark stale folders in two scenarios:

  • Writing a tombstone blob fails after the metadata was updated already (possible but a lot less likely than a failure during the following possibly long running delete operation looping over all the files in the index folder)
  • A data node writing a blob for a given index/shard after the master node completed all the steps successfully and removed the tombstone (this can be overcome by never deleting tombstone blobs as explained inline)

-> This still seems like a very very safe approach that we could back port all the way to 6.x without risk and that would allow writing a safe cleanup utility/script for stale blobs.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

} catch (IOException ioe) {
// a different IOException occurred while trying to delete - will just log the issue for now
logger.debug(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " +
"but failed to clean up its index folder.", metadata.name(), indexId), ioe);
deleteSuccess = false;
}
if (deleteSuccess) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not adding any logic to deal with the failure scenario here because that is a fairly complex problem to automate (basically you'd want to recursively delete an index path once you're sure no more blobs will be added under it, this is trivial to do manually but non-trivial to do automatically).

deleteSuccess = false;
}
if (deleteSuccess) {
indicesBlobContainer.deleteBlob(tombstoneBlob(indexId));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative here could be to not delete these ever:

  • The storage cost of keeping them around would be minimal
  • It would solve the issue of "rogue" data nodes writing shard data for an already deleted snapshot and us missing this situation because they do so after this code finishes

@original-brownbear
Copy link
Member Author

@ywelsch wdyt? IMO, this is one step we could safely take towards a more controllable future, even though it doesn't fix anything right here and right now (apart from the test) it at least enables some control of the situation in the short-term.

@original-brownbear
Copy link
Member Author

Closing here, a more complete solution to stale data on master failover that incorporates the ideas here is incoming in https://github.com/elastic/elasticsearch/compare/master...original-brownbear:delete-lock-via-cs?expand=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode failed
5 participants