You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The snapshot deletion task uses a GroupedActionListener within a recursive function to respond when all the deletions are done running. In the event of exceptions during the deletion the listener's countdown is not reached therefore the response is never returned. This causes the deletion task to get stuck. The function with the bug is in the following snippet, notice the error flow ends the recursion:
Run the test. It will get stuck at the first attempt.
Expected behavior
The deletion was supposed to fail but not get stuck, a response of the deletion should return with 0 successful deletions in the response. Then the next successful deletion would try to also delete the snapshots from the previous failed attempt.
The exception does add 0 to the result but it also ends the use of the allocated thread which should not happen because it is required for deleting the other snapshots that are still in the queue.
Host/Environment (please complete the following information):
OS: macOS BigSur
Version 11.2.1
Additional context
To better understand what happened behind the scenes and why I saw this as a bug please copy and run this code (you can just run it from the BlobStoreRepository file). It mimics a state where the countdown is not reached and therefore the onResponse() never executes, commenting out the if statement in the ActionRunnable's lambda function will prove this point:
public static void main(String[] args) throws InterruptedException {
ActionListener<Integer> listener = new ActionListener<Integer>() {
@Override
public void onResponse(Integer num) {
System.out.println("response received: " + num);
}
@Override
public void onFailure(Exception e) {
System.out.println("failed");
}
};
final GroupedActionListener<Integer> groupedListener = new GroupedActionListener<>(ActionListener.wrap(results -> {
int theResult = 0;
for (Integer result : results) {
theResult += result;
}
listener.onResponse(theResult);
}, listener::onFailure), 10);
BlockingQueue<Integer> lst = new LinkedBlockingQueue<>();
for (int i=0; i<20; i++) lst.add(i);
rec(groupedListener, lst);
System.out.println("after recursive run, should be printed first");
}
static void rec(GroupedActionListener<Integer> listener, BlockingQueue<Integer> staleIndicesToDelete) throws InterruptedException {
Integer indexEntry = staleIndicesToDelete.poll(0L, TimeUnit.MILLISECONDS);
if (indexEntry != null) {
new Thread(() -> ActionRunnable.supply(listener, () -> {
// comment out this if statement to see the groupedActionListener respond
if (indexEntry==5) return 0;
rec(listener, staleIndicesToDelete);
return indexEntry;
}).run()).start();
}
}
…aleIndexDelete()` stop
condition to be when the queue of `staleIndicesToDelete` is empty -- also in the error flow.
Otherwise the GroupedActionListener never responds and in the event of a few exceptions the
deletion task gets stuck.
Altered the test case to fail to delete in bulk many snapshots at the first attempt, and then
the next successful deletion also takes care of the previously failed attempt as the test
originally intended.
SNAPSHOT threadpool is at most 5. So in the event we get more than 5 exceptions there are no
more threads to handle the deletion task and there is still one more snapshot to delete in the
queue. Thus, in the test I made the number of extra snapshots be one more than the max in the
SNAPSHOT threadpool.
fixes - opensearch-project#627
Signed-off-by: AmiStrn <amitai.stern@logz.io>
Describe the bug
The snapshot deletion task uses a
GroupedActionListener
within a recursive function to respond when all the deletions are done running. In the event of exceptions during the deletion the listener's countdown is not reached therefore the response is never returned. This causes the deletion task to get stuck. The function with the bug is in the following snippet, notice the error flow ends the recursion:OpenSearch/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java
Lines 1089 to 1114 in eacd732
The listener definition is relying on the number of deletions to be performed in order to respond:
OpenSearch/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java
Lines 1058 to 1064 in eacd732
To Reproduce
Steps to reproduce the behavior:
(i am building this over an existing test for simplicity)
testResidualStaleIndicesAreDeletedByConsecutiveDelete()
atRepositoriesIT.java
and add more snapshots to the test case at line 162:Expected behavior
The deletion was supposed to fail but not get stuck, a response of the deletion should return with 0 successful deletions in the response. Then the next successful deletion would try to also delete the snapshots from the previous failed attempt.
The exception does add 0 to the result but it also ends the use of the allocated thread which should not happen because it is required for deleting the other snapshots that are still in the queue.
Host/Environment (please complete the following information):
Additional context
To better understand what happened behind the scenes and why I saw this as a bug please copy and run this code (you can just run it from the BlobStoreRepository file). It mimics a state where the countdown is not reached and therefore the
onResponse()
never executes, commenting out the if statement in the ActionRunnable's lambda function will prove this point:relevant thread for this issue
The text was updated successfully, but these errors were encountered: