Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Snapshot deletion may get stuck in case of exceptions during bulk delete #627

Closed
AmiStrn opened this issue Apr 28, 2021 · 0 comments · Fixed by #629
Closed

[BUG] Snapshot deletion may get stuck in case of exceptions during bulk delete #627

AmiStrn opened this issue Apr 28, 2021 · 0 comments · Fixed by #629
Labels
bug Something isn't working v1.0.0 Version 1.0.0 v2.0.0 Version 2.0.0

Comments

@AmiStrn
Copy link
Contributor

AmiStrn commented Apr 28, 2021

Describe the bug
The snapshot deletion task uses a GroupedActionListener within a recursive function to respond when all the deletions are done running. In the event of exceptions during the deletion the listener's countdown is not reached therefore the response is never returned. This causes the deletion task to get stuck. The function with the bug is in the following snippet, notice the error flow ends the recursion:

private void executeOneStaleIndexDelete(BlockingQueue<Map.Entry<String, BlobContainer>> staleIndicesToDelete,
GroupedActionListener<DeleteResult> listener) throws InterruptedException {
Map.Entry<String, BlobContainer> indexEntry = staleIndicesToDelete.poll(0L, TimeUnit.MILLISECONDS);
if (indexEntry == null) {
return;
} else {
final String indexSnId = indexEntry.getKey();
threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(ActionRunnable.supply(listener, () -> {
try {
DeleteResult staleIndexDeleteResult = indexEntry.getValue().delete();
logger.debug("[{}] Cleaned up stale index [{}]", metadata.name(), indexSnId);
executeOneStaleIndexDelete(staleIndicesToDelete, listener);
return staleIndexDeleteResult;
} catch (IOException e) {
logger.warn(() -> new ParameterizedMessage(
"[{}] index {} is no longer part of any snapshots in the repository, " +
"but failed to clean up their index folders", metadata.name(), indexSnId), e);
return DeleteResult.ZERO;
} catch (Exception e) {
assert false : e;
logger.warn(new ParameterizedMessage("[{}] Exception during single stale index delete", metadata.name()), e);
return DeleteResult.ZERO;
}
}));
}
}

The listener definition is relying on the number of deletions to be performed in order to respond:

final GroupedActionListener<DeleteResult> groupedListener = new GroupedActionListener<>(ActionListener.wrap(deleteResults -> {
DeleteResult deleteResult = DeleteResult.ZERO;
for (DeleteResult result : deleteResults) {
deleteResult = deleteResult.add(result);
}
listener.onResponse(deleteResult);
}, listener::onFailure), foundIndices.size() - survivingIndexIds.size());

To Reproduce
Steps to reproduce the behavior:
(i am building this over an existing test for simplicity)

  1. Go to testResidualStaleIndicesAreDeletedByConsecutiveDelete() at RepositoriesIT.java and add more snapshots to the test case at line 162:
        createIndex("test-idx-3");
        ensureGreen();
        for (int j = 0; j < 10; j++) {
            index("test-idx-3", "_doc", Integer.toString( 10 + j), "foo", "bar" +  10 + j);
        }
        refresh();

        logger.info("--> creating third snapshot");
        createFullSnapshot(repositoryName, snapshot3Name);

        logger.info("--> creating index-4 and ingest data");
        createIndex("test-idx-4");
        ensureGreen();
        for (int j = 0; j < 10; j++) {
            index("test-idx-4", "_doc", Integer.toString( 10 + j), "foo", "bar" +  10 + j);
        }
        refresh();

        logger.info("--> creating fourth snapshot");
        createFullSnapshot(repositoryName, snapshot4Name); 
  1. Change the first deletion attempt in the test at line 170 to delete more than two snapshots in bulk:
client.admin().cluster().prepareDeleteSnapshot(repositoryName, snapshot2Name, snapshot3Name, snapshot4Name).get();
  1. Run the test. It will get stuck at the first attempt.

Expected behavior
The deletion was supposed to fail but not get stuck, a response of the deletion should return with 0 successful deletions in the response. Then the next successful deletion would try to also delete the snapshots from the previous failed attempt.

The exception does add 0 to the result but it also ends the use of the allocated thread which should not happen because it is required for deleting the other snapshots that are still in the queue.

Host/Environment (please complete the following information):

  • OS: macOS BigSur
  • Version 11.2.1

Additional context

To better understand what happened behind the scenes and why I saw this as a bug please copy and run this code (you can just run it from the BlobStoreRepository file). It mimics a state where the countdown is not reached and therefore the onResponse() never executes, commenting out the if statement in the ActionRunnable's lambda function will prove this point:

    public static void main(String[] args) throws InterruptedException {

        ActionListener<Integer> listener = new ActionListener<Integer>() {
            @Override
            public void onResponse(Integer num) {
                System.out.println("response received: " + num);
            }

            @Override
            public void onFailure(Exception e) {
                System.out.println("failed");
            }
        };

        final GroupedActionListener<Integer> groupedListener = new GroupedActionListener<>(ActionListener.wrap(results -> {
            int theResult = 0;
            for (Integer result : results) {
                theResult += result;
            }
            listener.onResponse(theResult);
        }, listener::onFailure), 10);

        BlockingQueue<Integer> lst = new LinkedBlockingQueue<>();
        for (int i=0; i<20; i++) lst.add(i);

        rec(groupedListener, lst);
        System.out.println("after recursive run, should be printed first");
    }

    static void rec(GroupedActionListener<Integer> listener, BlockingQueue<Integer> staleIndicesToDelete) throws InterruptedException {
        Integer indexEntry = staleIndicesToDelete.poll(0L, TimeUnit.MILLISECONDS);
        if (indexEntry != null) {
            new Thread(() -> ActionRunnable.supply(listener, () -> {

                // comment out this if statement to see the groupedActionListener respond
                if (indexEntry==5) return 0;

                rec(listener, staleIndicesToDelete);
                return indexEntry;
            }).run()).start();
        }
    }

relevant thread for this issue

@AmiStrn AmiStrn added the bug Something isn't working label Apr 28, 2021
AmiStrn added a commit to AmiStrn/OpenSearch that referenced this issue Apr 28, 2021
…aleIndexDelete()` stop

condition to be when the queue of `staleIndicesToDelete` is empty -- also in the error flow.
Otherwise the GroupedActionListener never responds and in the event of a few exceptions the
deletion task gets stuck.
Altered the test case to fail to delete in bulk many snapshots at the first attempt, and then
the next successful deletion also takes care of the previously failed attempt as the test
originally intended.
SNAPSHOT threadpool is at most 5. So in the event we get more than 5 exceptions there are no
more threads to handle the deletion task and there is still one more snapshot to delete in the
queue. Thus, in the test I made the number of extra snapshots be one more than the max in the
SNAPSHOT threadpool.

fixes - opensearch-project#627

Signed-off-by: AmiStrn <amitai.stern@logz.io>
@nknize nknize added v1.0.0 Version 1.0.0 v2.0.0 Version 2.0.0 labels Apr 29, 2021
ritty27 pushed a commit to ritty27/OpenSearch that referenced this issue May 12, 2024
* Added size attribute to MultiTermsAggregation

Signed-off-by: Fabian Kopatschek <fabian.kopatschek@yahoo.de>

* Added integrationTest check if MultiTermsAggregation is supported

Signed-off-by: Fabian Kopatschek <fabian.kopatschek@yahoo.de>

---------

Signed-off-by: Fabian Kopatschek <fabian.kopatschek@yahoo.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v1.0.0 Version 1.0.0 v2.0.0 Version 2.0.0
Projects
None yet
2 participants