Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Fix replica failing to download segment files from remote store - NoSuchFileException #11455

Open
linuxpi opened this issue Dec 4, 2023 · 0 comments
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework

Comments

@linuxpi
Copy link
Collaborator

linuxpi commented Dec 4, 2023

Describe the bug

  • While replica is hydrating from remote store, primary can delete the segments from remote if there is merge and old segments files are marked for deletion.
  • We need to ensure replica shard doesn't fail due to unable to fetch old segment files which have been marked stale by primary shard
  • Today as a safeguard to this, we retain all segment files for last N metadata files. This doesn't guarantee we wont run into this issue but since the N is dynamically configurable, we have a mitigation if a cluster faces this issue.
  • But this leads to unnecessary segment files retained in remote store which are stale, hence increasing the remote store usage and cost.
  • With a proper solution in place, we can also remove the current logic to retain segment files corresponding to last N metadata files instead we can just keep the latest one.

** Possible Solutions **

  • We can acquire locks on the segment files before replica starts hydrating from remote, to make sure the files are not deleted in remote. We already have a framework in place for such behavior which is used by shallow snapshots with remote store
  • Another alternative might be to automatically restart recovery on replica if it fails with NoSuchFilesException. But this might not be deterministic and can also lead to false positives.
@linuxpi linuxpi added bug Something isn't working untriaged Storage:Durability Issues and PRs related to the durability framework enhancement Enhancement or improvement to existing feature or request and removed bug Something isn't working untriaged labels Dec 4, 2023
@ashking94 ashking94 moved this from 🆕 New to Ready To Be Picked in Storage Project Board Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework
Projects
Status: Ready To Be Picked
Development

No branches or pull requests

1 participant