-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make snapshot deletion faster #147
Make snapshot deletion faster #147
Conversation
Request for Admin to accept this test. |
❌ DCO Check Failed |
❌ DCO Check Failed |
4c1b521
to
f46cba9
Compare
✅ DCO Check Passed |
1 similar comment
✅ DCO Check Passed |
c86d8ea
to
ff56b19
Compare
✅ DCO Check Passed |
❌ DCO Check Failed |
✅ DCO Check Passed |
start gradle precommit |
✅ Gradle Precommit success |
ff56b19
to
a551d55
Compare
✅ Gradle Wrapper Validation success a551d55d9f0d0ce6a4703e0999708f1ada377252 |
✅ DCO Check Passed a551d55d9f0d0ce6a4703e0999708f1ada377252 |
✅ Gradle Precommit success a551d55d9f0d0ce6a4703e0999708f1ada377252 |
@nknize Hi, you need to commit again to trigger all 3 automatic check. Thanks. |
❌ Gradle Check failure a551d55d9f0d0ce6a4703e0999708f1ada377252 |
start gradle check |
❌ Gradle Check failure a551d55d9f0d0ce6a4703e0999708f1ada377252 |
a551d55
to
fc0a2b7
Compare
✅ Gradle Wrapper Validation success fc0a2b7b3cce044f397b43b3d81443f661f5acfa |
✅ DCO Check Passed fc0a2b7b3cce044f397b43b3d81443f661f5acfa |
✅ Gradle Wrapper Validation success c910998d87c2bcc4053f0b98217ce1ff2eb12388 |
✅ Gradle Precommit success c910998d87c2bcc4053f0b98217ce1ff2eb12388 |
The delete snapshot task takes longer than expected. A major reason for this is that the (often many) stale indices are deleted iteratively. In this commit we change the deletion to be concurrent using the SNAPSHOT threadpool. Notice that in order to avoid putting too many delete tasks on the threadpool queue a similar methodology was used as in `executeOneFileSnapshot()`. This is due to the fact that the threadpool should allow other tasks to use this threadpool without too much of a delay. fixes issue #61513 from Elasticsearch project Signed-off-by: Nicholas Knize <nknize@amazon.com>
Signed-off-by: Nicholas Knize <nknize@amazon.com>
c910998
to
29e1d85
Compare
✅ Gradle Wrapper Validation success 29e1d85 |
✅ DCO Check Passed 29e1d85 |
✅ Gradle Precommit success 29e1d85 |
start gradle check |
start gradle precommit |
✅ Gradle Wrapper Validation success 29e1d85 |
✅ DCO Check Passed 29e1d85 |
start gradle precommit |
✅ Gradle Precommit success 29e1d85 |
This PR seems to be cherry pick product code part of PR I submitted to elasticserch repo : https://github.com/elastic/elasticsearch/pull/64513/files Though the original PR is approved, it is yet to merged in mainline by elasticsearch opensource community. I suggest better to pick up original PR, as it has needed test cases added. |
This was originally cherry-picked from here as documented in the PR description. But yes it looks like the two changes are similar. Are there differences in your PR @AmiStrn that you want to highlight? If not, are you @piyushdaftary willing to submit a PR here and we can close this one? |
Hi @piyushdaftary and @nknize, my PR is in fact similar to the way the concurrency is performed in the Moreover, after reviewing https://github.com/elastic/elasticsearch/pull/64513/files the main difference is the main function in the PR containing the concurrent behavior - Regarding the test, I can see that although it is not testing the concurrent behavior @piyushdaftary's PR set out to introduce, it is nonetheless, a missing test case! I am open to collaboration :) |
Hi @AmiStrn This design is done keeping in mind longterm goal of snapshot deletion flow, where we want to make snapshot deletion asynchronous. The test case added in original PR actually replicates the same scenario, where we intentionally introduces repository exception and let the
But incase you still see there exists a scenario in the original PR code where snapshot deletion will be stuck, I request you to please write test case for the same against the original PR code. Will be happy to collaborate with you on it. @nknize : I will raise PR shortly. |
@piyushdaftary we could take this offline if you would like. Stay tuned:) |
Since we are doing this here it's a bit of a read -
b) We change this line to:
c) run the test. it never finishes and gets stuck at the first attempt to delete. Why was the original test passing? because the test case had 2 snapshots to delete at most and the number of Whereas in my PR (and in the master branch) this is not the case as the behavior regarding exceptions does not stop the GroupedActionListener from getting a full countdown. The behavior takes after the original behavior of the method. To better understand what happened behind the scenes and why I saw this as a bug please copy and run this code (you can just run it from the
@piyushdaftary - I suggest we join our effort on this matter. Please open a PR, I will commit my changes to your PR (The test update and the executeOneStaleIndexDelete() implementation) and then we can close this one. I think credit is due to you as well as to me on our work regarding this issue:) |
Hi @AmiStrn Regarding the test case added in original PR, as the name suggest it validates that residual stale indices (which couldn't be deleted by current snapshot delete )are deleted by consecutive snapshot delete. @nknize : Please suggest how shall we proceed further with the fix. |
My suggestion is to cherry-pick the original commit (@piyushdaftary I think it was yours? so could you do this and open a PR?) and then could @AmiStrn open a separate PR to explain and fix the bug? Seems that's the organic evolution here so we should address in that order? |
@nknize and @piyushdaftary I agree, as long as the original PR is committed and merged as is with the bug. Otherwise, this whole thing would have been in vain on my part. |
Raised PR with cherrypick of original PR : #613 I agree with the resolution. |
Great! As soon as this PR is committed to the main branch I'll open a PR to fix the bug :) |
Opened on behalf of @AmiStrn from logz.io:
The delete snapshot task takes longer than expected. A major reason for this is
that the (often many) stale indices are deleted iteratively.
In this commit we change the deletion to be concurrent using the SNAPSHOT threadpool.
Notice that in order to avoid putting too many delete tasks on the threadpool
queue a similar methodology was used as in executeOneFileSnapshot(). This is due to
the fact that the threadpool should allow other tasks to use this threadpool without
too much of a delay.
fixes #146