Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Cluster State to Track Repository Generation #49729

Merged

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Nov 29, 2019

Step on the road to #49060.

This commit adds the logic to keep track of a repository's generation
across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine.

It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no index-N will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same index-N are not a possibility any longer.

With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents.
The logic for that will be introduced in #49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In #49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.

Step on the road to elastic#49060.

This commit adds the logic to keep track of a repository's generation
across repository operations.
@original-brownbear original-brownbear added WIP :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Nov 29, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear original-brownbear marked this pull request as ready for review November 30, 2019 19:33
@@ -137,7 +140,7 @@ public void testRetrieveSnapshots() throws Exception {

public void testReadAndWriteSnapshotsThroughIndexFile() throws Exception {
final BlobStoreRepository repository = setupRepo();

final long pendingGeneration = getPendingGeneration(repository);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generation here will now be something other than -1 because we are using the same repository name across tests. I intentionally didn't add a repository cleanup step between tests here, since this serves as a neat test (and illustration) of how the pending generation is kept consistent no matter what happens to the repo contents.

@@ -497,11 +497,10 @@ public void testSnapshotWithStuckNode() throws Exception {
logger.info("--> Go through a loop of creating and deleting a snapshot to trigger repository cleanup");
client().admin().cluster().prepareCleanupRepository("test-repo").get();

// Subtract four files that will remain in the repository:
// Expect two files to remain in the repository:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole business of keeping a "backup" index-N has long been fairly obsolete. At best it may have saved some repository status API calls from erros on S3 (when the most recent index-N would not show up in a listing ... but obviously then also serving old state anyway). With a consistent way of tracking the generation for writes now, I don't see any point in keeping this behavior around and simplified the deleting of old index-N accordingly in production code.

@original-brownbear
Copy link
Member Author

@ywelsch @tlrx let me know what you think about splitting #49060 in half this way.
I was thinking (hoping) that only fixing the write path but retaining all the fallbacks on the read side makes this somewhat easy to review because it seems to me like a relatively non-controversial change. We're not altering behavior in any other way but for not starting at generation -1 when writing to a repo that was emptied externally. This means we get by without BwC changes here and can make the difficult decisions on how to handle external repo modifications in the last step #49060 without having to discuss the details of the new structure in the CS and its handling because it's all introduced here.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments and questions. I would prefer to reuse RepositoryMetaData for this, just so that we have all repo-related information in one location.

@@ -114,6 +115,7 @@ public ClusterModule(Settings settings, ClusterService clusterService, List<Clus
public static List<Entry> getNamedWriteables() {
List<Entry> entries = new ArrayList<>();
// Cluster State
registerClusterCustom(entries, RepositoriesState.TYPE, RepositoriesState::new, RepositoriesState::readDiffFrom);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use the RepositoryMetaData for this? It is already persisted in the cluster state and even survives full-cluster restarts. This one is a bit odd, as it now has to keep a map for each repo again, and needs clean-up when repos are added / removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I was thinking we may not actually want to persist this ... but now that I think about S3 and the pending generation, we actually probably want to ... will see what I can do about the RepositoryMetaData here :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywelsch hmm on second thought:

This one is a bit odd, as it now has to keep a map for each repo again, and needs clean-up when repos are added / removed.

This seems to be the only real downside though? And it's not really a downside since if I move this stuff into RepositoryMetaData I get the some new complication from having to adjust RepositoriesService to now use some custom comparisons on RepositoryMetaData that ignores the generation related fields to see if a repo has changed.

The other downside is, that we now have the generation logic leak into all the Repository implementations in a bunch of new spots. CCR and all the wrapping repositories don't really care about the generation and it's entirely BlobStoreRepository specific. Would we then make adjustments to the serialization of e.g. GetRepositoriesResponse so that it doesn't include the generation? We also lose some incrementality/efficiency in the serialization ClusterState as a result of mixing somewhat dynamic and pretty static things in it don't we?

I think persisting this state across full-cluster restarts is a good thing still, but I'm not so sure about using RepositoryMetaData here having tried implementating this practically now.
Still think it's worth it and make adjustments to things like GetRepositoriesResponse accordingly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RepositoriesService to now use some custom comparisons on RepositoryMetaData that ignores the generation related fields to see if a repo has changed

I think this is ok (it's similar to how we have it with most items in the cluster state). Further down the line, we should not reinit the repository implementation when there is an update from the CS, but leave it to the implementation to decide whether it needs complete reinitialization or not. This will allow us to do dynamic throttling (e.g. allow user to dynamically change max_restore_bytes_per_sec which directly takes place on an ongoing restore). It's a bit bigger investment right now, but will hopefully pay off further down the line.

Regarding serialization, we can make RepositoryMetaData implement Diffable to have it more smartly serialize changes (this would benefit any other field in it as well). I don't expect this to matter much though.

I also think we won't need to adapt GetRepositoriesResponse but can just conditionally expose the additional fields by parameterizing toXContent if need be (I'm even fine exposing these fields on the repositories API and not having two variants).

if (currentGen != expectedGen) {
// the index file was updated by a concurrent operation, so we were operating on stale
// repository data
throw new RepositoryException(metadata.name(), "concurrent modification of the index-N file, expected current generation [" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listener.onFailure(new RepositoryException...)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, seems nicer :)

final RepositoriesState.State repoState = Optional.ofNullable(state.state(repoName)).orElseGet(
() -> RepositoriesState.builder().putState(repoName, expectedGen, expectedGen).build().state(repoName));
if (repoState.pendingGeneration() != repoState.generation()) {
logger.warn("Trying to write new repository data of generation [{}] over unfinished write, repo is in state [{}]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we warn here? Shouldn't this just be info level logging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this isn't such a bad state :) made it info


@Override
public void onFailure(String source, Exception e) {
l.onFailure(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we wrap the exception here to add more context where exactly something went wrong (same for setting pending generation)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea added some wrapping here


// Step 1: Set repository generation state to the next possible pending generation
final StepListener<Long> setPendingStep = new StepListener<>();
clusterService.submitStateUpdateTask("set pending repository generation", new ClusterStateUpdateTask() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe mention generation + repo here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure added for both CS update steps

repoState.pendingGeneration(), repoState);
}
if (expectedGen != RepositoryData.EMPTY_REPO_GEN && expectedGen != repoState.generation()) {
throw new IllegalStateException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do we expect this to happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it ... it's impossible already :) Removing it in favor of an assert. We already make sure not to be in a bad spot here in safeRepositoryData anyway

});

// Step 2: Write new index-N blob to repository and update index.latest
setPendingStep.whenComplete(newGen -> threadPool().generic().execute(ActionRunnable.wrap(listener, l -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use the generic threadpool here, and not the snapshot one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right ... moved to the snapshot pool.

Copy link
Member Author

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered + addressed all comments working on the below now

I've left some comments and questions. I would prefer to reuse RepositoryMetaData for this

@@ -114,6 +115,7 @@ public ClusterModule(Settings settings, ClusterService clusterService, List<Clus
public static List<Entry> getNamedWriteables() {
List<Entry> entries = new ArrayList<>();
// Cluster State
registerClusterCustom(entries, RepositoriesState.TYPE, RepositoriesState::new, RepositoriesState::readDiffFrom);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I was thinking we may not actually want to persist this ... but now that I think about S3 and the pending generation, we actually probably want to ... will see what I can do about the RepositoryMetaData here :)

if (currentGen != expectedGen) {
// the index file was updated by a concurrent operation, so we were operating on stale
// repository data
throw new RepositoryException(metadata.name(), "concurrent modification of the index-N file, expected current generation [" +
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, seems nicer :)

final RepositoriesState.State repoState = Optional.ofNullable(state.state(repoName)).orElseGet(
() -> RepositoriesState.builder().putState(repoName, expectedGen, expectedGen).build().state(repoName));
if (repoState.pendingGeneration() != repoState.generation()) {
logger.warn("Trying to write new repository data of generation [{}] over unfinished write, repo is in state [{}]",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this isn't such a bad state :) made it info

repoState.pendingGeneration(), repoState);
}
if (expectedGen != RepositoryData.EMPTY_REPO_GEN && expectedGen != repoState.generation()) {
throw new IllegalStateException(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it ... it's impossible already :) Removing it in favor of an assert. We already make sure not to be in a bad spot here in safeRepositoryData anyway

});

// Step 2: Write new index-N blob to repository and update index.latest
setPendingStep.whenComplete(newGen -> threadPool().generic().execute(ActionRunnable.wrap(listener, l -> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right ... moved to the snapshot pool.


@Override
public void onFailure(String source, Exception e) {
l.onFailure(e);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea added some wrapping here


// Step 1: Set repository generation state to the next possible pending generation
final StepListener<Long> setPendingStep = new StepListener<>();
clusterService.submitStateUpdateTask("set pending repository generation", new ClusterStateUpdateTask() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure added for both CS update steps

@original-brownbear
Copy link
Member Author

@ywelsch thanks, addressed all comments now. Let me know what you think about the RepositoriesMetaData situation :)

@original-brownbear
Copy link
Member Author

@ywelsch @tlrx all points addressed again I think :)

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Armin

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (Left one comment about an exception message)

@original-brownbear
Copy link
Member Author

Thanks so much Yannick + Tanguy! Just one more step left now in this ordeal I think :)

@original-brownbear original-brownbear merged commit b34daeb into elastic:master Dec 4, 2019
@original-brownbear original-brownbear deleted the repo-uses-cs-increment branch December 4, 2019 12:01
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 5, 2019
This moves the blob store repository to only use the information
available in the clusterstate for loading `RepositoryData` without
falling back to listing to determine a repositories' generation.

Relates elastic#49729
Closes elastic#38941
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 9, 2019
Step on the road to elastic#49060.

This commit adds the logic to keep track of a repository's generation
across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine.

It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer.

With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents.
The logic for that will be introduced in elastic#49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In elastic#49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.
original-brownbear added a commit that referenced this pull request Dec 9, 2019
Step on the road to #49060.

This commit adds the logic to keep track of a repository's generation
across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine.

It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer.

With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents.
The logic for that will be introduced in #49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In #49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.
original-brownbear added a commit that referenced this pull request Dec 17, 2019
)

Follow up to #49729 

This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories.
This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases.
This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 17, 2019
…stic#49060)

Follow up to elastic#49729

This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories.
This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases.
This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
original-brownbear added a commit that referenced this pull request Dec 17, 2019
) (#50267)

Follow up to #49729

This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories.
This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases.
This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
Step on the road to elastic#49060. 

This commit adds the logic to keep track of a repository's generation
across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine.

It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer. 

With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents.
The logic for that will be introduced in elastic#49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In elastic#49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
…stic#49060)

Follow up to elastic#49729 

This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories.
This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases.
This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
@original-brownbear original-brownbear restored the repo-uses-cs-increment branch August 6, 2020 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants