-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Cluster State to Track Repository Generation #49729
Use Cluster State to Track Repository Generation #49729
Conversation
Step on the road to elastic#49060. This commit adds the logic to keep track of a repository's generation across repository operations.
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
Show resolved
Hide resolved
@@ -137,7 +140,7 @@ public void testRetrieveSnapshots() throws Exception { | |||
|
|||
public void testReadAndWriteSnapshotsThroughIndexFile() throws Exception { | |||
final BlobStoreRepository repository = setupRepo(); | |||
|
|||
final long pendingGeneration = getPendingGeneration(repository); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generation here will now be something other than -1
because we are using the same repository name across tests. I intentionally didn't add a repository cleanup step between tests here, since this serves as a neat test (and illustration) of how the pending generation is kept consistent no matter what happens to the repo contents.
@@ -497,11 +497,10 @@ public void testSnapshotWithStuckNode() throws Exception { | |||
logger.info("--> Go through a loop of creating and deleting a snapshot to trigger repository cleanup"); | |||
client().admin().cluster().prepareCleanupRepository("test-repo").get(); | |||
|
|||
// Subtract four files that will remain in the repository: | |||
// Expect two files to remain in the repository: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole business of keeping a "backup" index-N has long been fairly obsolete. At best it may have saved some repository status API calls from erros on S3 (when the most recent index-N
would not show up in a listing ... but obviously then also serving old state anyway). With a consistent way of tracking the generation for writes now, I don't see any point in keeping this behavior around and simplified the deleting of old index-N
accordingly in production code.
@ywelsch @tlrx let me know what you think about splitting #49060 in half this way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments and questions. I would prefer to reuse RepositoryMetaData
for this, just so that we have all repo-related information in one location.
@@ -114,6 +115,7 @@ public ClusterModule(Settings settings, ClusterService clusterService, List<Clus | |||
public static List<Entry> getNamedWriteables() { | |||
List<Entry> entries = new ArrayList<>(); | |||
// Cluster State | |||
registerClusterCustom(entries, RepositoriesState.TYPE, RepositoriesState::new, RepositoriesState::readDiffFrom); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use the RepositoryMetaData
for this? It is already persisted in the cluster state and even survives full-cluster restarts. This one is a bit odd, as it now has to keep a map for each repo again, and needs clean-up when repos are added / removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I was thinking we may not actually want to persist this ... but now that I think about S3 and the pending generation, we actually probably want to ... will see what I can do about the RepositoryMetaData
here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywelsch hmm on second thought:
This one is a bit odd, as it now has to keep a map for each repo again, and needs clean-up when repos are added / removed.
This seems to be the only real downside though? And it's not really a downside since if I move this stuff into RepositoryMetaData
I get the some new complication from having to adjust RepositoriesService
to now use some custom comparisons on RepositoryMetaData
that ignores the generation
related fields to see if a repo has changed.
The other downside is, that we now have the generation
logic leak into all the Repository
implementations in a bunch of new spots. CCR
and all the wrapping repositories don't really care about the generation and it's entirely BlobStoreRepository
specific. Would we then make adjustments to the serialization of e.g. GetRepositoriesResponse
so that it doesn't include the generation? We also lose some incrementality/efficiency in the serialization ClusterState
as a result of mixing somewhat dynamic and pretty static things in it don't we?
I think persisting this state across full-cluster restarts is a good thing still, but I'm not so sure about using RepositoryMetaData
here having tried implementating this practically now.
Still think it's worth it and make adjustments to things like GetRepositoriesResponse
accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RepositoriesService to now use some custom comparisons on RepositoryMetaData that ignores the generation related fields to see if a repo has changed
I think this is ok (it's similar to how we have it with most items in the cluster state). Further down the line, we should not reinit the repository implementation when there is an update from the CS, but leave it to the implementation to decide whether it needs complete reinitialization or not. This will allow us to do dynamic throttling (e.g. allow user to dynamically change max_restore_bytes_per_sec
which directly takes place on an ongoing restore). It's a bit bigger investment right now, but will hopefully pay off further down the line.
Regarding serialization, we can make RepositoryMetaData implement Diffable to have it more smartly serialize changes (this would benefit any other field in it as well). I don't expect this to matter much though.
I also think we won't need to adapt GetRepositoriesResponse
but can just conditionally expose the additional fields by parameterizing toXContent if need be (I'm even fine exposing these fields on the repositories API and not having two variants).
if (currentGen != expectedGen) { | ||
// the index file was updated by a concurrent operation, so we were operating on stale | ||
// repository data | ||
throw new RepositoryException(metadata.name(), "concurrent modification of the index-N file, expected current generation [" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
listener.onFailure(new RepositoryException...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, seems nicer :)
server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
Show resolved
Hide resolved
final RepositoriesState.State repoState = Optional.ofNullable(state.state(repoName)).orElseGet( | ||
() -> RepositoriesState.builder().putState(repoName, expectedGen, expectedGen).build().state(repoName)); | ||
if (repoState.pendingGeneration() != repoState.generation()) { | ||
logger.warn("Trying to write new repository data of generation [{}] over unfinished write, repo is in state [{}]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we warn here? Shouldn't this just be info level logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this isn't such a bad state :) made it info
|
||
@Override | ||
public void onFailure(String source, Exception e) { | ||
l.onFailure(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we wrap the exception here to add more context where exactly something went wrong (same for setting pending generation)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea added some wrapping here
|
||
// Step 1: Set repository generation state to the next possible pending generation | ||
final StepListener<Long> setPendingStep = new StepListener<>(); | ||
clusterService.submitStateUpdateTask("set pending repository generation", new ClusterStateUpdateTask() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe mention generation + repo here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure added for both CS update steps
repoState.pendingGeneration(), repoState); | ||
} | ||
if (expectedGen != RepositoryData.EMPTY_REPO_GEN && expectedGen != repoState.generation()) { | ||
throw new IllegalStateException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when do we expect this to happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come to think of it ... it's impossible already :) Removing it in favor of an assert
. We already make sure not to be in a bad spot here in safeRepositoryData
anyway
}); | ||
|
||
// Step 2: Write new index-N blob to repository and update index.latest | ||
setPendingStep.whenComplete(newGen -> threadPool().generic().execute(ActionRunnable.wrap(listener, l -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use the generic threadpool here, and not the snapshot one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right ... moved to the snapshot pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered + addressed all comments working on the below now
I've left some comments and questions. I would prefer to reuse
RepositoryMetaData
for this
@@ -114,6 +115,7 @@ public ClusterModule(Settings settings, ClusterService clusterService, List<Clus | |||
public static List<Entry> getNamedWriteables() { | |||
List<Entry> entries = new ArrayList<>(); | |||
// Cluster State | |||
registerClusterCustom(entries, RepositoriesState.TYPE, RepositoriesState::new, RepositoriesState::readDiffFrom); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I was thinking we may not actually want to persist this ... but now that I think about S3 and the pending generation, we actually probably want to ... will see what I can do about the RepositoryMetaData
here :)
if (currentGen != expectedGen) { | ||
// the index file was updated by a concurrent operation, so we were operating on stale | ||
// repository data | ||
throw new RepositoryException(metadata.name(), "concurrent modification of the index-N file, expected current generation [" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, seems nicer :)
server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
Show resolved
Hide resolved
final RepositoriesState.State repoState = Optional.ofNullable(state.state(repoName)).orElseGet( | ||
() -> RepositoriesState.builder().putState(repoName, expectedGen, expectedGen).build().state(repoName)); | ||
if (repoState.pendingGeneration() != repoState.generation()) { | ||
logger.warn("Trying to write new repository data of generation [{}] over unfinished write, repo is in state [{}]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this isn't such a bad state :) made it info
repoState.pendingGeneration(), repoState); | ||
} | ||
if (expectedGen != RepositoryData.EMPTY_REPO_GEN && expectedGen != repoState.generation()) { | ||
throw new IllegalStateException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come to think of it ... it's impossible already :) Removing it in favor of an assert
. We already make sure not to be in a bad spot here in safeRepositoryData
anyway
}); | ||
|
||
// Step 2: Write new index-N blob to repository and update index.latest | ||
setPendingStep.whenComplete(newGen -> threadPool().generic().execute(ActionRunnable.wrap(listener, l -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right ... moved to the snapshot pool.
|
||
@Override | ||
public void onFailure(String source, Exception e) { | ||
l.onFailure(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea added some wrapping here
|
||
// Step 1: Set repository generation state to the next possible pending generation | ||
final StepListener<Long> setPendingStep = new StepListener<>(); | ||
clusterService.submitStateUpdateTask("set pending repository generation", new ClusterStateUpdateTask() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure added for both CS update steps
@ywelsch thanks, addressed all comments now. Let me know what you think about the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Armin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (Left one comment about an exception message)
Thanks so much Yannick + Tanguy! Just one more step left now in this ordeal I think :) |
This moves the blob store repository to only use the information available in the clusterstate for loading `RepositoryData` without falling back to listing to determine a repositories' generation. Relates elastic#49729 Closes elastic#38941
Step on the road to elastic#49060. This commit adds the logic to keep track of a repository's generation across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine. It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer. With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents. The logic for that will be introduced in elastic#49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In elastic#49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.
Step on the road to #49060. This commit adds the logic to keep track of a repository's generation across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine. It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer. With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents. The logic for that will be introduced in #49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In #49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.
) Follow up to #49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
…stic#49060) Follow up to elastic#49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
) (#50267) Follow up to #49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
Step on the road to elastic#49060. This commit adds the logic to keep track of a repository's generation across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine. It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer. With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents. The logic for that will be introduced in elastic#49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In elastic#49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.
…stic#49060) Follow up to elastic#49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
Step on the road to #49060.
This commit adds the logic to keep track of a repository's generation
across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine.
It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no
index-N
will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the sameindex-N
are not a possibility any longer.With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents.
The logic for that will be introduced in #49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In #49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.