[BUG] Snapshot deletion may get stuck in case of exceptions during bulk delete #627

AmiStrn · 2021-04-28T09:37:33Z

Describe the bug
The snapshot deletion task uses a GroupedActionListener within a recursive function to respond when all the deletions are done running. In the event of exceptions during the deletion the listener's countdown is not reached therefore the response is never returned. This causes the deletion task to get stuck. The function with the bug is in the following snippet, notice the error flow ends the recursion:

OpenSearch/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java

Lines 1089 to 1114 in eacd732

    
           private void executeOneStaleIndexDelete(BlockingQueue<Map.Entry<String, BlobContainer>> staleIndicesToDelete, 
        
                                                   GroupedActionListener<DeleteResult> listener) throws InterruptedException { 
        
               Map.Entry<String, BlobContainer> indexEntry  = staleIndicesToDelete.poll(0L, TimeUnit.MILLISECONDS); 
        
               if (indexEntry == null) { 
        
                   return; 
        
               } else { 
        
                   final String indexSnId = indexEntry.getKey(); 
        
                   threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(ActionRunnable.supply(listener, () -> { 
        
                       try { 
        
                           DeleteResult staleIndexDeleteResult = indexEntry.getValue().delete(); 
        
                           logger.debug("[{}] Cleaned up stale index [{}]", metadata.name(), indexSnId); 
        
                           executeOneStaleIndexDelete(staleIndicesToDelete, listener); 
        
                           return staleIndexDeleteResult; 
        
                       } catch (IOException e) { 
        
                           logger.warn(() -> new ParameterizedMessage( 
        
                               "[{}] index {} is no longer part of any snapshots in the repository, " + 
        
                                   "but failed to clean up their index folders", metadata.name(), indexSnId), e); 
        
                           return DeleteResult.ZERO; 
        
                       } catch (Exception e) { 
        
                           assert false : e; 
        
                           logger.warn(new ParameterizedMessage("[{}] Exception during single stale index delete", metadata.name()), e); 
        
                           return DeleteResult.ZERO; 
        
                       } 
        
                   })); 
        
               } 
        
           }

The listener definition is relying on the number of deletions to be performed in order to respond:

OpenSearch/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java

Lines 1058 to 1064 in eacd732

    
           final GroupedActionListener<DeleteResult> groupedListener = new GroupedActionListener<>(ActionListener.wrap(deleteResults -> { 
        
               DeleteResult deleteResult = DeleteResult.ZERO; 
        
               for (DeleteResult result : deleteResults) { 
        
                   deleteResult = deleteResult.add(result); 
        
               } 
        
               listener.onResponse(deleteResult); 
        
           }, listener::onFailure), foundIndices.size() - survivingIndexIds.size());

To Reproduce
Steps to reproduce the behavior:
(i am building this over an existing test for simplicity)

Go to testResidualStaleIndicesAreDeletedByConsecutiveDelete() at RepositoriesIT.java and add more snapshots to the test case at line 162:

        createIndex("test-idx-3");
        ensureGreen();
        for (int j = 0; j < 10; j++) {
            index("test-idx-3", "_doc", Integer.toString( 10 + j), "foo", "bar" +  10 + j);
        }
        refresh();

        logger.info("--> creating third snapshot");
        createFullSnapshot(repositoryName, snapshot3Name);

        logger.info("--> creating index-4 and ingest data");
        createIndex("test-idx-4");
        ensureGreen();
        for (int j = 0; j < 10; j++) {
            index("test-idx-4", "_doc", Integer.toString( 10 + j), "foo", "bar" +  10 + j);
        }
        refresh();

        logger.info("--> creating fourth snapshot");
        createFullSnapshot(repositoryName, snapshot4Name);

Change the first deletion attempt in the test at line 170 to delete more than two snapshots in bulk:

client.admin().cluster().prepareDeleteSnapshot(repositoryName, snapshot2Name, snapshot3Name, snapshot4Name).get();

Run the test. It will get stuck at the first attempt.

Expected behavior
The deletion was supposed to fail but not get stuck, a response of the deletion should return with 0 successful deletions in the response. Then the next successful deletion would try to also delete the snapshots from the previous failed attempt.

The exception does add 0 to the result but it also ends the use of the allocated thread which should not happen because it is required for deleting the other snapshots that are still in the queue.

Host/Environment (please complete the following information):

OS: macOS BigSur
Version 11.2.1

Additional context

To better understand what happened behind the scenes and why I saw this as a bug please copy and run this code (you can just run it from the BlobStoreRepository file). It mimics a state where the countdown is not reached and therefore the onResponse() never executes, commenting out the if statement in the ActionRunnable's lambda function will prove this point:

    public static void main(String[] args) throws InterruptedException {

        ActionListener<Integer> listener = new ActionListener<Integer>() {
            @Override
            public void onResponse(Integer num) {
                System.out.println("response received: " + num);
            }

            @Override
            public void onFailure(Exception e) {
                System.out.println("failed");
            }
        };

        final GroupedActionListener<Integer> groupedListener = new GroupedActionListener<>(ActionListener.wrap(results -> {
            int theResult = 0;
            for (Integer result : results) {
                theResult += result;
            }
            listener.onResponse(theResult);
        }, listener::onFailure), 10);

        BlockingQueue<Integer> lst = new LinkedBlockingQueue<>();
        for (int i=0; i<20; i++) lst.add(i);

        rec(groupedListener, lst);
        System.out.println("after recursive run, should be printed first");
    }

    static void rec(GroupedActionListener<Integer> listener, BlockingQueue<Integer> staleIndicesToDelete) throws InterruptedException {
        Integer indexEntry = staleIndicesToDelete.poll(0L, TimeUnit.MILLISECONDS);
        if (indexEntry != null) {
            new Thread(() -> ActionRunnable.supply(listener, () -> {

                // comment out this if statement to see the groupedActionListener respond
                if (indexEntry==5) return 0;

                rec(listener, staleIndicesToDelete);
                return indexEntry;
            }).run()).start();
        }
    }

relevant thread for this issue

The text was updated successfully, but these errors were encountered:

…aleIndexDelete()` stop condition to be when the queue of `staleIndicesToDelete` is empty -- also in the error flow. Otherwise the GroupedActionListener never responds and in the event of a few exceptions the deletion task gets stuck. Altered the test case to fail to delete in bulk many snapshots at the first attempt, and then the next successful deletion also takes care of the previously failed attempt as the test originally intended. SNAPSHOT threadpool is at most 5. So in the event we get more than 5 exceptions there are no more threads to handle the deletion task and there is still one more snapshot to delete in the queue. Thus, in the test I made the number of extra snapshots be one more than the max in the SNAPSHOT threadpool. fixes - opensearch-project#627 Signed-off-by: AmiStrn <amitai.stern@logz.io>

* Added size attribute to MultiTermsAggregation Signed-off-by: Fabian Kopatschek <fabian.kopatschek@yahoo.de> * Added integrationTest check if MultiTermsAggregation is supported Signed-off-by: Fabian Kopatschek <fabian.kopatschek@yahoo.de> --------- Signed-off-by: Fabian Kopatschek <fabian.kopatschek@yahoo.de>

AmiStrn added the bug Something isn't working label Apr 28, 2021

AmiStrn mentioned this issue Apr 28, 2021

Fix snapshot deletion task getting stuck in the event of exceptions #628

Closed

5 tasks

AmiStrn mentioned this issue Apr 28, 2021

Fix snapshot deletion task getting stuck in the event of exceptions #629

Merged

5 tasks

nknize added v1.0.0 Version 1.0.0 v2.0.0 Version 2.0.0 labels Apr 29, 2021

nknize closed this as completed in #629 May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Snapshot deletion may get stuck in case of exceptions during bulk delete #627

[BUG] Snapshot deletion may get stuck in case of exceptions during bulk delete #627

AmiStrn commented Apr 28, 2021

[BUG] Snapshot deletion may get stuck in case of exceptions during bulk delete #627

[BUG] Snapshot deletion may get stuck in case of exceptions during bulk delete #627

Comments

AmiStrn commented Apr 28, 2021