Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make snapshot deletion faster #147

Conversation

nknize
Copy link
Collaborator

@nknize nknize commented Feb 26, 2021

Opened on behalf of @AmiStrn from logz.io:

The delete snapshot task takes longer than expected. A major reason for this is
that the (often many) stale indices are deleted iteratively.
In this commit we change the deletion to be concurrent using the SNAPSHOT threadpool.
Notice that in order to avoid putting too many delete tasks on the threadpool
queue a similar methodology was used as in executeOneFileSnapshot(). This is due to
the fact that the threadpool should allow other tasks to use this threadpool without
too much of a delay.

fixes #146

@nknize nknize added the bug Something isn't working label Feb 26, 2021
@odfe-release-bot
Copy link

Request for Admin to accept this test.

@odfe-release-bot
Copy link

❌   DCO Check Failed
Run ./dev-tools/signoff-check.sh remotes/origin/main 49cb5c84b9ff890ef845e40bb6c401df8a5e0e5b to check locally
Use git commit with -s to add 'Signed-of-by: {EMAIL}' on impacted commits

@odfe-release-bot
Copy link

❌   DCO Check Failed
Run ./dev-tools/signoff-check.sh remotes/origin/main 4c1b52173712a74c86b8df7d754e2a8fea8ed178 to check locally
Use git commit with -s to add 'Signed-of-by: {EMAIL}' on impacted commits

@nknize nknize force-pushed the logz/make-snapshot-deletion-faster branch from 4c1b521 to f46cba9 Compare March 15, 2021 05:15
@odfe-release-bot
Copy link

✅   DCO Check Passed

1 similar comment
@odfe-release-bot
Copy link

✅   DCO Check Passed

@nknize nknize added v1.0.0 Version 1.0.0 v1.0.0-alpha1 Version 1.0.0 alpha 1 v2.0.0 Version 2.0.0 labels Mar 20, 2021
@nknize nknize force-pushed the logz/make-snapshot-deletion-faster branch 2 times, most recently from c86d8ea to ff56b19 Compare March 22, 2021 14:49
@odfe-release-bot
Copy link

✅   DCO Check Passed

@odfe-release-bot
Copy link

❌   DCO Check Failed
Run ./dev-tools/signoff-check.sh remotes/origin/main c86d8ea5b79f5a5c5a5db7d9bed5b5725c8bcafd to check locally
Use git commit with -s to add 'Signed-of-by: {EMAIL}' on impacted commits

@odfe-release-bot
Copy link

✅   DCO Check Passed

@peterzhuamazon
Copy link
Member

start gradle precommit

@odfe-release-bot
Copy link

✅   Gradle Precommit success

@nknize nknize force-pushed the logz/make-snapshot-deletion-faster branch from ff56b19 to a551d55 Compare March 24, 2021 22:25
@odfe-release-bot
Copy link

✅   Gradle Wrapper Validation success a551d55d9f0d0ce6a4703e0999708f1ada377252

@odfe-release-bot
Copy link

✅   DCO Check Passed a551d55d9f0d0ce6a4703e0999708f1ada377252

@odfe-release-bot
Copy link

✅   Gradle Precommit success a551d55d9f0d0ce6a4703e0999708f1ada377252

@peterzhuamazon
Copy link
Member

@nknize Hi, you need to commit again to trigger all 3 automatic check.
Then type start gradle check if you already added to the admin list.

Thanks.

@odfe-release-bot
Copy link

❌   Gradle Check failure a551d55d9f0d0ce6a4703e0999708f1ada377252
Log 37

Reports 37

@harold-wang
Copy link
Contributor

start gradle check

@odfe-release-bot
Copy link

❌   Gradle Check failure a551d55d9f0d0ce6a4703e0999708f1ada377252
Log 39

Reports 39

@nknize nknize force-pushed the logz/make-snapshot-deletion-faster branch from a551d55 to fc0a2b7 Compare April 9, 2021 20:20
@odfe-release-bot
Copy link

✅   Gradle Wrapper Validation success fc0a2b7b3cce044f397b43b3d81443f661f5acfa

@odfe-release-bot
Copy link

✅   DCO Check Passed fc0a2b7b3cce044f397b43b3d81443f661f5acfa

@odfe-release-bot
Copy link

✅   Gradle Wrapper Validation success c910998d87c2bcc4053f0b98217ce1ff2eb12388

@odfe-release-bot
Copy link

✅   Gradle Precommit success c910998d87c2bcc4053f0b98217ce1ff2eb12388

AmiStrn added 2 commits April 10, 2021 22:45
The delete snapshot task takes longer than expected. A major reason for this is
that the (often many) stale indices are deleted iteratively.
In this commit we change the deletion to be concurrent using the SNAPSHOT threadpool.
Notice that in order to avoid putting too many delete tasks on the threadpool
queue a similar methodology was used as in `executeOneFileSnapshot()`. This is due to
 the fact that the threadpool should allow other tasks to use this threadpool without
too much of a delay.

fixes issue #61513 from Elasticsearch project

Signed-off-by: Nicholas Knize <nknize@amazon.com>
Signed-off-by: Nicholas Knize <nknize@amazon.com>
@nknize nknize force-pushed the logz/make-snapshot-deletion-faster branch from c910998 to 29e1d85 Compare April 11, 2021 03:46
@odfe-release-bot
Copy link

✅   Gradle Wrapper Validation success 29e1d85

@odfe-release-bot
Copy link

✅   DCO Check Passed 29e1d85

@odfe-release-bot
Copy link

✅   Gradle Precommit success 29e1d85

@nknize
Copy link
Collaborator Author

nknize commented Apr 12, 2021

start gradle check

@odfe-release-bot
Copy link

✅   Gradle Check success 29e1d85
Log 64

Reports 64

@peterzhuamazon
Copy link
Member

start gradle precommit
start dco check
start wrapper validation

@odfe-release-bot
Copy link

✅   Gradle Wrapper Validation success 29e1d85

@odfe-release-bot
Copy link

✅   DCO Check Passed 29e1d85

@peterzhuamazon
Copy link
Member

start gradle precommit

@odfe-release-bot
Copy link

✅   Gradle Precommit success 29e1d85

@piyushdaftary
Copy link
Contributor

piyushdaftary commented Apr 23, 2021

This PR seems to be cherry pick product code part of PR I submitted to elasticserch repo : https://github.com/elastic/elasticsearch/pull/64513/files

Though the original PR is approved, it is yet to merged in mainline by elasticsearch opensource community.

I suggest better to pick up original PR, as it has needed test cases added.

@nknize
Copy link
Collaborator Author

nknize commented Apr 23, 2021

This was originally cherry-picked from here as documented in the PR description. But yes it looks like the two changes are similar. Are there differences in your PR @AmiStrn that you want to highlight? If not, are you @piyushdaftary willing to submit a PR here and we can close this one?

@AmiStrn
Copy link
Contributor

AmiStrn commented Apr 23, 2021

Hi @piyushdaftary and @nknize, my PR is in fact similar to the way the concurrency is performed in the BlobStoreRepository (see org.opensearch.repositories.blobstore.BlobStoreRepository#executeOneFileSnapshot), and these similarities to your PR are therefore circumstantial. This is the style one would see in the same file, I wanted to maintain this style, as did @piyushdaftary, I assume.

Moreover, after reviewing https://github.com/elastic/elasticsearch/pull/64513/files the main difference is the main function in the PR containing the concurrent behavior - executeOneStaleIndexDelete() (also a style of naming used elsewhere in this class). The difference is in the way the recursion works with the GroupedActionListener. @piyushdaftary please notice a major bug in your implementation:
the recursive nature of this function still must take into account the GroupedActionListener's count down. The GroupedActionListener is set to respond after foundIndices.size() - survivingIndexIds.size() actions are registered (or a failure). In the event of an exception, the recursion is stopped even if not all the deletions were executed! this means that the grouped action listener will never count down the number of actions it is listening for and the deletion would get stuck.
In this PR this is not the case: the recursion comes first, then the return. In line with the behavior of the deletion result before the change to concurrency.

Regarding the test, I can see that although it is not testing the concurrent behavior @piyushdaftary's PR set out to introduce, it is nonetheless, a missing test case!
@piyushdaftary perhaps add a PR for the missing test case? Maybe combine the test case with this PR?
How would you like to proceed? Do you rather open a PR and I will add a commit for the executeOneStaleIndexDelete() implementation so this would be a joint effort?

I am open to collaboration :)

@piyushdaftary
Copy link
Contributor

piyushdaftary commented Apr 24, 2021

Hi @AmiStrn
I revisited my original PR, the bug your are talking about seems to be actually handled in method executeOneStaleIndexDelete(). In this method it is ensured that in case of any exception, it is caught by try catch block and the executor returns DeleteResult.ZERO. This is been done to make sure that if while deleting one stale indices we get any exception (usually repository exception, but not limited to), we don't stop there and try to delete remaining stale indices. The stale indices which couldn't be deleted this time because of exception will be again be tried to be deleted in next snapshot deletion iteration by design. This ensures we don't stop snapshot deletion because of transient repository exception.

This design is done keeping in mind longterm goal of snapshot deletion flow, where we want to make snapshot deletion asynchronous.

The test case added in original PR actually replicates the same scenario, where we intentionally introduces repository exception and let the RepositoriesService throw exception while deleting stale indices and validates following:

  1. Snapshot deletion is complete
  2. The stale indices which couldn't be deleted earlier because of exception is actually been deleted and cleaned up in next snapshot deletion.

But incase you still see there exists a scenario in the original PR code where snapshot deletion will be stuck, I request you to please write test case for the same against the original PR code. Will be happy to collaborate with you on it.

@nknize : I will raise PR shortly.

@AmiStrn
Copy link
Contributor

AmiStrn commented Apr 24, 2021

@piyushdaftary we could take this offline if you would like.
Anyway, I would be glad to explain the bug via a test. I'll get on it first thing tomorrow.
Regarding the test in the PR, I ran it without the added code (TDD style) and it passes so I am sure what it is testing is not related to the concurrency added.

Stay tuned:)

@AmiStrn
Copy link
Contributor

AmiStrn commented Apr 25, 2021

Since we are doing this here it's a bit of a read -
Results from reviewing https://github.com/elastic/elasticsearch/pull/64513/files :

  1. Running testResidualStaleIndicesAreDeletedByConsecutiveDelete() without the concurrency fix in BlobStoreRepository - the test passes. This test is testing a case that is not related to the concurrency but to the case where the deletion is incomplete and is completed the next time it is run. So this test does not have to be part of this PR, but it is important to add nonetheless, perhaps with some additions I will mention later.

  2. The test actually can fail if using the executeOneStaleIndexDelete() function in this PR:
    The existing test is passing because the case is "lucky".
    To reproduce the deletion getting stuck in the test in this PR:
    a) we first create 2 more snapshots in the test like so at this line :

        createIndex("test-idx-3");
        ensureGreen();
        for (int j = 0; j < 10; j++) {
            index("test-idx-3", "_doc", Integer.toString( 10 + j), "foo", "bar" +  10 + j);
        }
        refresh();

        logger.info("--> creating third snapshot");
        createFullSnapshot(repositoryName, snapshot3Name);

        logger.info("--> creating index-4 and ingest data");
        createIndex("test-idx-4");
        ensureGreen();
        for (int j = 0; j < 10; j++) {
            index("test-idx-4", "_doc", Integer.toString( 10 + j), "foo", "bar" +  10 + j);
        }
        refresh();

        logger.info("--> creating fourth snapshot");
        createFullSnapshot(repositoryName, snapshot4Name); 

b) We change this line to:

client.admin().cluster().prepareDeleteSnapshot(repositoryName, snapshot2Name, snapshot3Name, snapshot4Name).get();

c) run the test. it never finishes and gets stuck at the first attempt to delete.

Why was the original test passing? because the test case had 2 snapshots to delete at most and the number of workers is typically 2 or more. 2 threads (each worker instantiates a new thread) were enough to achieve the required countdown for the GroupedActionListener. trying to delete more snapshots than the number of workers (threads available for the deletion) + a number of exceptions greater or equal to the number of threads will result in this deletion getting stuck.

Whereas in my PR (and in the master branch) this is not the case as the behavior regarding exceptions does not stop the GroupedActionListener from getting a full countdown. The behavior takes after the original behavior of the method.

To better understand what happened behind the scenes and why I saw this as a bug please copy and run this code (you can just run it from the BlobStoreRepository file). It mimics a state where the countdown is not reached and therefore the onResponse never executes, commenting out the if statement in the ActionRunnable's lambda function will prove this point:

    public static void main(String[] args) throws InterruptedException {

        ActionListener<Integer> listener = new ActionListener<Integer>() {
            @Override
            public void onResponse(Integer num) {
                System.out.println("response received: " + num);
            }

            @Override
            public void onFailure(Exception e) {
                System.out.println("failed");
            }
        };

        final GroupedActionListener<Integer> groupedListener = new GroupedActionListener<>(ActionListener.wrap(results -> {
            int theResult = 0;
            for (Integer result : results) {
                theResult += result;
            }
            listener.onResponse(theResult);
        }, listener::onFailure), 10);

        BlockingQueue<Integer> lst = new LinkedBlockingQueue<>();
        for (int i=0; i<20; i++) lst.add(i);

        rec(groupedListener, lst);
        System.out.println("after recursive run, should be printed first");
    }

    static void rec(GroupedActionListener<Integer> listener, BlockingQueue<Integer> staleIndicesToDelete) throws InterruptedException {
        Integer indexEntry = staleIndicesToDelete.poll(0L, TimeUnit.MILLISECONDS);
        if (indexEntry != null) {
            new Thread(() -> ActionRunnable.supply(listener, () -> {

                // comment out this if statement to see the groupedActionListener respond
                if (indexEntry==5) return 0;

                rec(listener, staleIndicesToDelete);
                return indexEntry;
            }).run()).start();
        }
    }

@piyushdaftary - I suggest we join our effort on this matter. Please open a PR, I will commit my changes to your PR (The test update and the executeOneStaleIndexDelete() implementation) and then we can close this one. I think credit is due to you as well as to me on our work regarding this issue:)
@nknize would this allow both of us to be signed off on our respective changes? This is where git standards start to confuse me.

@piyushdaftary
Copy link
Contributor

Hi @AmiStrn
Thanks for writing the test to explain the bug in original PR. Back then when the original PR was raised (ES-7.6 version), ES doesn't use to support batch delete (i.e multiple snapshot delete together), hence it was coded the way it is. But now as ES and OpenSearch do support batch snapshot delete, code changes you suggested make sense and should be part of the fix and deserves credit.

Regarding the test case added in original PR, as the name suggest it validates that residual stale indices (which couldn't be deleted by current snapshot delete )are deleted by consecutive snapshot delete.

@nknize : Please suggest how shall we proceed further with the fix.

@nknize
Copy link
Collaborator Author

nknize commented Apr 26, 2021

Please suggest how shall we proceed further with the fix.

My suggestion is to cherry-pick the original commit (@piyushdaftary I think it was yours? so could you do this and open a PR?) and then could @AmiStrn open a separate PR to explain and fix the bug? Seems that's the organic evolution here so we should address in that order?

@AmiStrn
Copy link
Contributor

AmiStrn commented Apr 26, 2021

@nknize and @piyushdaftary I agree, as long as the original PR is committed and merged as is with the bug. Otherwise, this whole thing would have been in vain on my part.
@piyushdaftary do you also agree with this resolution?

@piyushdaftary
Copy link
Contributor

piyushdaftary commented Apr 26, 2021

Raised PR with cherrypick of original PR : #613

I agree with the resolution.

@AmiStrn
Copy link
Contributor

AmiStrn commented Apr 27, 2021

Great! As soon as this PR is committed to the main branch I'll open a PR to fix the bug :)

@AmiStrn
Copy link
Contributor

AmiStrn commented Apr 29, 2021

@nknize I think we can close this branch in favor of #629 and #613.
I was going to do it myself but then realized I wasn't the one who opened it 😄

@nknize
Copy link
Collaborator Author

nknize commented Apr 29, 2021

@nknize I think we can close this branch in favor of #629 and #613.
I was going to do it myself but then realized I wasn't the one who opened it

Thank you! closing in favor of #629 and #613

@nknize nknize closed this Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v1.0.0-alpha1 Version 1.0.0 alpha 1 v1.0.0 Version 1.0.0 v2.0.0 Version 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Duplicate] Make snapshot deletion faster
6 participants