Make snapshot deletion faster #147

nknize · 2021-02-26T05:49:02Z

Opened on behalf of @AmiStrn from logz.io:

The delete snapshot task takes longer than expected. A major reason for this is
that the (often many) stale indices are deleted iteratively.
In this commit we change the deletion to be concurrent using the SNAPSHOT threadpool.
Notice that in order to avoid putting too many delete tasks on the threadpool
queue a similar methodology was used as in executeOneFileSnapshot(). This is due to
the fact that the threadpool should allow other tasks to use this threadpool without
too much of a delay.

fixes #146

odfe-release-bot · 2021-02-26T05:51:02Z

Request for Admin to accept this test.

odfe-release-bot · 2021-03-13T17:35:17Z

❌ DCO Check Failed
Run ./dev-tools/signoff-check.sh remotes/origin/main 49cb5c84b9ff890ef845e40bb6c401df8a5e0e5b to check locally
Use git commit with -s to add 'Signed-of-by: {EMAIL}' on impacted commits

odfe-release-bot · 2021-03-15T05:14:41Z

❌ DCO Check Failed
Run ./dev-tools/signoff-check.sh remotes/origin/main 4c1b52173712a74c86b8df7d754e2a8fea8ed178 to check locally
Use git commit with -s to add 'Signed-of-by: {EMAIL}' on impacted commits

odfe-release-bot · 2021-03-15T05:16:51Z

✅ DCO Check Passed

odfe-release-bot · 2021-03-15T17:21:09Z

✅ DCO Check Passed

odfe-release-bot · 2021-03-22T14:52:33Z

✅ DCO Check Passed

odfe-release-bot · 2021-03-22T14:52:59Z

❌ DCO Check Failed
Run ./dev-tools/signoff-check.sh remotes/origin/main c86d8ea5b79f5a5c5a5db7d9bed5b5725c8bcafd to check locally
Use git commit with -s to add 'Signed-of-by: {EMAIL}' on impacted commits

odfe-release-bot · 2021-03-22T22:01:17Z

✅ DCO Check Passed

peterzhuamazon · 2021-03-23T00:07:16Z

start gradle precommit

odfe-release-bot · 2021-03-23T00:16:27Z

✅ Gradle Precommit success

odfe-release-bot · 2021-03-24T22:26:56Z

✅ Gradle Wrapper Validation success a551d55d9f0d0ce6a4703e0999708f1ada377252

odfe-release-bot · 2021-03-24T22:26:56Z

✅ DCO Check Passed a551d55d9f0d0ce6a4703e0999708f1ada377252

odfe-release-bot · 2021-03-24T22:35:49Z

✅ Gradle Precommit success a551d55d9f0d0ce6a4703e0999708f1ada377252

peterzhuamazon · 2021-03-30T00:18:45Z

@nknize Hi, you need to commit again to trigger all 3 automatic check.
Then type start gradle check if you already added to the admin list.

Thanks.

odfe-release-bot · 2021-03-30T00:32:24Z

❌ Gradle Check failure a551d55d9f0d0ce6a4703e0999708f1ada377252
Log 37

Reports 37

harold-wang · 2021-04-01T16:43:33Z

start gradle check

odfe-release-bot · 2021-04-01T17:02:44Z

❌ Gradle Check failure a551d55d9f0d0ce6a4703e0999708f1ada377252
Log 39

Reports 39

odfe-release-bot · 2021-04-09T20:21:53Z

✅ Gradle Wrapper Validation success fc0a2b7b3cce044f397b43b3d81443f661f5acfa

odfe-release-bot · 2021-04-09T20:22:52Z

✅ DCO Check Passed fc0a2b7b3cce044f397b43b3d81443f661f5acfa

odfe-release-bot · 2021-04-10T00:14:02Z

✅ Gradle Wrapper Validation success c910998d87c2bcc4053f0b98217ce1ff2eb12388

odfe-release-bot · 2021-04-10T00:19:52Z

✅ Gradle Precommit success c910998d87c2bcc4053f0b98217ce1ff2eb12388

The delete snapshot task takes longer than expected. A major reason for this is that the (often many) stale indices are deleted iteratively. In this commit we change the deletion to be concurrent using the SNAPSHOT threadpool. Notice that in order to avoid putting too many delete tasks on the threadpool queue a similar methodology was used as in `executeOneFileSnapshot()`. This is due to the fact that the threadpool should allow other tasks to use this threadpool without too much of a delay. fixes issue #61513 from Elasticsearch project Signed-off-by: Nicholas Knize <nknize@amazon.com>

Signed-off-by: Nicholas Knize <nknize@amazon.com>

odfe-release-bot · 2021-04-11T03:49:34Z

✅ Gradle Wrapper Validation success 29e1d85

odfe-release-bot · 2021-04-11T03:49:34Z

✅ DCO Check Passed 29e1d85

odfe-release-bot · 2021-04-11T03:56:06Z

✅ Gradle Precommit success 29e1d85

nknize · 2021-04-12T23:11:08Z

start gradle check

odfe-release-bot · 2021-04-12T23:38:13Z

✅ Gradle Check success 29e1d85
Log 64

Reports 64

peterzhuamazon · 2021-04-15T16:58:59Z

start gradle precommit
start dco check
start wrapper validation

odfe-release-bot · 2021-04-15T17:00:36Z

✅ Gradle Wrapper Validation success 29e1d85

odfe-release-bot · 2021-04-15T17:01:36Z

✅ DCO Check Passed 29e1d85

peterzhuamazon · 2021-04-15T17:23:03Z

start gradle precommit

odfe-release-bot · 2021-04-15T17:36:31Z

✅ Gradle Precommit success 29e1d85

piyushdaftary · 2021-04-23T20:01:59Z

This PR seems to be cherry pick product code part of PR I submitted to elasticserch repo : https://github.com/elastic/elasticsearch/pull/64513/files

Though the original PR is approved, it is yet to merged in mainline by elasticsearch opensource community.

I suggest better to pick up original PR, as it has needed test cases added.

nknize · 2021-04-23T20:25:12Z

This was originally cherry-picked from here as documented in the PR description. But yes it looks like the two changes are similar. Are there differences in your PR @AmiStrn that you want to highlight? If not, are you @piyushdaftary willing to submit a PR here and we can close this one?

AmiStrn · 2021-04-23T21:47:35Z

Hi @piyushdaftary and @nknize, my PR is in fact similar to the way the concurrency is performed in the BlobStoreRepository (see org.opensearch.repositories.blobstore.BlobStoreRepository#executeOneFileSnapshot), and these similarities to your PR are therefore circumstantial. This is the style one would see in the same file, I wanted to maintain this style, as did @piyushdaftary, I assume.

Moreover, after reviewing https://github.com/elastic/elasticsearch/pull/64513/files the main difference is the main function in the PR containing the concurrent behavior - executeOneStaleIndexDelete() (also a style of naming used elsewhere in this class). The difference is in the way the recursion works with the GroupedActionListener. @piyushdaftary please notice a major bug in your implementation:
the recursive nature of this function still must take into account the GroupedActionListener's count down. The GroupedActionListener is set to respond after foundIndices.size() - survivingIndexIds.size() actions are registered (or a failure). In the event of an exception, the recursion is stopped even if not all the deletions were executed! this means that the grouped action listener will never count down the number of actions it is listening for and the deletion would get stuck.
In this PR this is not the case: the recursion comes first, then the return. In line with the behavior of the deletion result before the change to concurrency.

Regarding the test, I can see that although it is not testing the concurrent behavior @piyushdaftary's PR set out to introduce, it is nonetheless, a missing test case!
@piyushdaftary perhaps add a PR for the missing test case? Maybe combine the test case with this PR?
How would you like to proceed? Do you rather open a PR and I will add a commit for the executeOneStaleIndexDelete() implementation so this would be a joint effort?

I am open to collaboration :)

piyushdaftary · 2021-04-24T02:43:55Z

Hi @AmiStrn
I revisited my original PR, the bug your are talking about seems to be actually handled in method executeOneStaleIndexDelete(). In this method it is ensured that in case of any exception, it is caught by try catch block and the executor returns DeleteResult.ZERO. This is been done to make sure that if while deleting one stale indices we get any exception (usually repository exception, but not limited to), we don't stop there and try to delete remaining stale indices. The stale indices which couldn't be deleted this time because of exception will be again be tried to be deleted in next snapshot deletion iteration by design. This ensures we don't stop snapshot deletion because of transient repository exception.

This design is done keeping in mind longterm goal of snapshot deletion flow, where we want to make snapshot deletion asynchronous.

The test case added in original PR actually replicates the same scenario, where we intentionally introduces repository exception and let the RepositoriesService throw exception while deleting stale indices and validates following:

Snapshot deletion is complete
The stale indices which couldn't be deleted earlier because of exception is actually been deleted and cleaned up in next snapshot deletion.

But incase you still see there exists a scenario in the original PR code where snapshot deletion will be stuck, I request you to please write test case for the same against the original PR code. Will be happy to collaborate with you on it.

@nknize : I will raise PR shortly.

AmiStrn · 2021-04-24T07:29:29Z

@piyushdaftary we could take this offline if you would like.
Anyway, I would be glad to explain the bug via a test. I'll get on it first thing tomorrow.
Regarding the test in the PR, I ran it without the added code (TDD style) and it passes so I am sure what it is testing is not related to the concurrency added.

Stay tuned:)

AmiStrn · 2021-04-25T11:42:00Z

Since we are doing this here it's a bit of a read -
Results from reviewing https://github.com/elastic/elasticsearch/pull/64513/files :

Running testResidualStaleIndicesAreDeletedByConsecutiveDelete() without the concurrency fix in BlobStoreRepository - the test passes. This test is testing a case that is not related to the concurrency but to the case where the deletion is incomplete and is completed the next time it is run. So this test does not have to be part of this PR, but it is important to add nonetheless, perhaps with some additions I will mention later.
The test actually can fail if using the executeOneStaleIndexDelete() function in this PR:
The existing test is passing because the case is "lucky".
To reproduce the deletion getting stuck in the test in this PR:
a) we first create 2 more snapshots in the test like so at this line :

        createIndex("test-idx-3");
        ensureGreen();
        for (int j = 0; j < 10; j++) {
            index("test-idx-3", "_doc", Integer.toString( 10 + j), "foo", "bar" +  10 + j);
        }
        refresh();

        logger.info("--> creating third snapshot");
        createFullSnapshot(repositoryName, snapshot3Name);

        logger.info("--> creating index-4 and ingest data");
        createIndex("test-idx-4");
        ensureGreen();
        for (int j = 0; j < 10; j++) {
            index("test-idx-4", "_doc", Integer.toString( 10 + j), "foo", "bar" +  10 + j);
        }
        refresh();

        logger.info("--> creating fourth snapshot");
        createFullSnapshot(repositoryName, snapshot4Name);

b) We change this line to:

client.admin().cluster().prepareDeleteSnapshot(repositoryName, snapshot2Name, snapshot3Name, snapshot4Name).get();

c) run the test. it never finishes and gets stuck at the first attempt to delete.

Why was the original test passing? because the test case had 2 snapshots to delete at most and the number of workers is typically 2 or more. 2 threads (each worker instantiates a new thread) were enough to achieve the required countdown for the GroupedActionListener. trying to delete more snapshots than the number of workers (threads available for the deletion) + a number of exceptions greater or equal to the number of threads will result in this deletion getting stuck.

Whereas in my PR (and in the master branch) this is not the case as the behavior regarding exceptions does not stop the GroupedActionListener from getting a full countdown. The behavior takes after the original behavior of the method.

To better understand what happened behind the scenes and why I saw this as a bug please copy and run this code (you can just run it from the BlobStoreRepository file). It mimics a state where the countdown is not reached and therefore the onResponse never executes, commenting out the if statement in the ActionRunnable's lambda function will prove this point:

    public static void main(String[] args) throws InterruptedException {

        ActionListener<Integer> listener = new ActionListener<Integer>() {
            @Override
            public void onResponse(Integer num) {
                System.out.println("response received: " + num);
            }

            @Override
            public void onFailure(Exception e) {
                System.out.println("failed");
            }
        };

        final GroupedActionListener<Integer> groupedListener = new GroupedActionListener<>(ActionListener.wrap(results -> {
            int theResult = 0;
            for (Integer result : results) {
                theResult += result;
            }
            listener.onResponse(theResult);
        }, listener::onFailure), 10);

        BlockingQueue<Integer> lst = new LinkedBlockingQueue<>();
        for (int i=0; i<20; i++) lst.add(i);

        rec(groupedListener, lst);
        System.out.println("after recursive run, should be printed first");
    }

    static void rec(GroupedActionListener<Integer> listener, BlockingQueue<Integer> staleIndicesToDelete) throws InterruptedException {
        Integer indexEntry = staleIndicesToDelete.poll(0L, TimeUnit.MILLISECONDS);
        if (indexEntry != null) {
            new Thread(() -> ActionRunnable.supply(listener, () -> {

                // comment out this if statement to see the groupedActionListener respond
                if (indexEntry==5) return 0;

                rec(listener, staleIndicesToDelete);
                return indexEntry;
            }).run()).start();
        }
    }

@piyushdaftary - I suggest we join our effort on this matter. Please open a PR, I will commit my changes to your PR (The test update and the executeOneStaleIndexDelete() implementation) and then we can close this one. I think credit is due to you as well as to me on our work regarding this issue:)
@nknize would this allow both of us to be signed off on our respective changes? This is where git standards start to confuse me.

piyushdaftary · 2021-04-25T14:16:37Z

Hi @AmiStrn
Thanks for writing the test to explain the bug in original PR. Back then when the original PR was raised (ES-7.6 version), ES doesn't use to support batch delete (i.e multiple snapshot delete together), hence it was coded the way it is. But now as ES and OpenSearch do support batch snapshot delete, code changes you suggested make sense and should be part of the fix and deserves credit.

Regarding the test case added in original PR, as the name suggest it validates that residual stale indices (which couldn't be deleted by current snapshot delete )are deleted by consecutive snapshot delete.

@nknize : Please suggest how shall we proceed further with the fix.

nknize · 2021-04-26T15:51:59Z

Please suggest how shall we proceed further with the fix.

My suggestion is to cherry-pick the original commit (@piyushdaftary I think it was yours? so could you do this and open a PR?) and then could @AmiStrn open a separate PR to explain and fix the bug? Seems that's the organic evolution here so we should address in that order?

AmiStrn · 2021-04-26T17:29:05Z

@nknize and @piyushdaftary I agree, as long as the original PR is committed and merged as is with the bug. Otherwise, this whole thing would have been in vain on my part.
@piyushdaftary do you also agree with this resolution?

piyushdaftary · 2021-04-26T20:30:36Z

Raised PR with cherrypick of original PR : #613

I agree with the resolution.

AmiStrn · 2021-04-27T12:11:17Z

Great! As soon as this PR is committed to the main branch I'll open a PR to fix the bug :)

AmiStrn · 2021-04-29T15:13:18Z

@nknize I think we can close this branch in favor of #629 and #613.
I was going to do it myself but then realized I wasn't the one who opened it 😄

nknize · 2021-04-29T16:51:03Z

@nknize I think we can close this branch in favor of #629 and #613.
I was going to do it myself but then realized I wasn't the one who opened it

Thank you! closing in favor of #629 and #613

nknize added the bug Something isn't working label Feb 26, 2021

peternied force-pushed the main branch from 8bdabbf to 807d0e3 Compare March 13, 2021 17:21

nknize force-pushed the logz/make-snapshot-deletion-faster branch from 4c1b521 to f46cba9 Compare March 15, 2021 05:15

nknize added v1.0.0 Version 1.0.0 v1.0.0-alpha1 Version 1.0.0 alpha 1 v2.0.0 Version 2.0.0 labels Mar 20, 2021

nknize force-pushed the logz/make-snapshot-deletion-faster branch 2 times, most recently from c86d8ea to ff56b19 Compare March 22, 2021 14:49

opensearch-project deleted a comment from odfe-release-bot Mar 23, 2021

nknize force-pushed the logz/make-snapshot-deletion-faster branch from ff56b19 to a551d55 Compare March 24, 2021 22:25

nknize force-pushed the logz/make-snapshot-deletion-faster branch from a551d55 to fc0a2b7 Compare April 9, 2021 20:20

AmiStrn added 2 commits April 10, 2021 22:45

fixed codeStyle errors (lines longer than 140 chars)

29e1d85

Signed-off-by: Nicholas Knize <nknize@amazon.com>

nknize force-pushed the logz/make-snapshot-deletion-faster branch from c910998 to 29e1d85 Compare April 11, 2021 03:46

AmiStrn mentioned this pull request Apr 28, 2021

[BUG] Snapshot deletion may get stuck in case of exceptions during bulk delete #627

Closed

nknize closed this Apr 29, 2021

Make snapshot deletion faster #147

Make snapshot deletion faster #147

Conversation

nknize commented Feb 26, 2021

odfe-release-bot commented Feb 26, 2021

odfe-release-bot commented Mar 13, 2021

odfe-release-bot commented Mar 15, 2021

odfe-release-bot commented Mar 15, 2021

odfe-release-bot commented Mar 15, 2021

odfe-release-bot commented Mar 22, 2021

odfe-release-bot commented Mar 22, 2021

odfe-release-bot commented Mar 22, 2021

peterzhuamazon commented Mar 23, 2021

odfe-release-bot commented Mar 23, 2021

odfe-release-bot commented Mar 24, 2021

odfe-release-bot commented Mar 24, 2021

odfe-release-bot commented Mar 24, 2021

peterzhuamazon commented Mar 30, 2021

odfe-release-bot commented Mar 30, 2021

harold-wang commented Apr 1, 2021

odfe-release-bot commented Apr 1, 2021

odfe-release-bot commented Apr 9, 2021

odfe-release-bot commented Apr 9, 2021

odfe-release-bot commented Apr 10, 2021

odfe-release-bot commented Apr 10, 2021

odfe-release-bot commented Apr 11, 2021

odfe-release-bot commented Apr 11, 2021

odfe-release-bot commented Apr 11, 2021

nknize commented Apr 12, 2021

odfe-release-bot commented Apr 12, 2021

peterzhuamazon commented Apr 15, 2021

odfe-release-bot commented Apr 15, 2021

odfe-release-bot commented Apr 15, 2021

peterzhuamazon commented Apr 15, 2021

odfe-release-bot commented Apr 15, 2021

piyushdaftary commented Apr 23, 2021 • edited Loading

nknize commented Apr 23, 2021

AmiStrn commented Apr 23, 2021

piyushdaftary commented Apr 24, 2021 • edited Loading

AmiStrn commented Apr 24, 2021 • edited Loading

AmiStrn commented Apr 25, 2021 • edited Loading

piyushdaftary commented Apr 25, 2021

nknize commented Apr 26, 2021

AmiStrn commented Apr 26, 2021

piyushdaftary commented Apr 26, 2021 • edited Loading

AmiStrn commented Apr 27, 2021

AmiStrn commented Apr 29, 2021 • edited Loading

nknize commented Apr 29, 2021

piyushdaftary commented Apr 23, 2021 •

edited

Loading

piyushdaftary commented Apr 24, 2021 •

edited

Loading

AmiStrn commented Apr 24, 2021 •

edited

Loading

AmiStrn commented Apr 25, 2021 •

edited

Loading

piyushdaftary commented Apr 26, 2021 •

edited

Loading

AmiStrn commented Apr 29, 2021 •

edited

Loading