[BUG] Segment Replication - Shard close while copying files can leave listeners open #8292

mch2 · 2023-06-27T17:16:28Z

Describe the bug
With Segment Replication it is possible that a shard is closed while segment copy is still ongoing, while rare this can result in a case where a ReplicationListener is left open on shard close. This can happen because in the beforeIndexShardClosed method in SegmentReplicationTargetService we cancel ongoing replications, however this is more of a request for cancellation as the method does not block and wait for all resources to be cleaned up.

Within these components there are two listeners. One that is passed to the ReplicationTarget (ReplicationListener), that is only resolved through ReplicationCollection when it is done/failed. The other that is internal to SegmentReplicationTarget directly on execution, that is used to alert the target service a replication event is completed and the collection done/fail methods should be invoked.

Cancellation will remove the shard from SegmentReplicationTargetService's ReplicationCollection immediately in beforeIndexShardClosed, so when the internal listener to startReplication is resolved with failure/success the collection does not notify the original ReplicationListener as it is already removed from the collection. This surfaces with the following stacktrace:

REPRODUCE WITH: ./gradlew 'null' --tests "org.opensearch.index.mapper.ICUCollationKeywordFieldMapperIT.testCustomRules" -Dtests.seed=6885F19536DCDBB9 -Dtests.security.manager=false -Dtests.locale=no -Dtests.timezone=America/Adak -Druntime.java=19

java.lang.AssertionError: All incoming requests on node [node_s3] should have finished. Expected 0 but got 437; pending tasks [[{
  "node" : "node_s3",
  "id" : 415,
  "type" : "transport",
  "action" : "internal:index/shard/recovery/start_recovery",
  "description" : "",
  "start_time" : "2023-06-27T17:03:01.988Z",
  "start_time_in_millis" : 1687885381988,
  "running_time" : "1m",
  "running_time_in_nanos" : 65986775792,
  "cancellable" : false,
  "cancelled" : false,
  "headers" : { },
  "resource_stats" : {
    "average" : {
      "cpu_time_in_nanos" : 0,
      "memory_in_bytes" : 0
    },
    "total" : {
      "cpu_time_in_nanos" : 0,
      "memory_in_bytes" : 0
    },
    "min" : {
      "cpu_time_in_nanos" : 0,
      "memory_in_bytes" : 0
    },
    "max" : {
      "cpu_time_in_nanos" : 0,
      "memory_in_bytes" : 0
    },
    "thread_info" : {
      "thread_executions" : 0,
      "active_threads" : 0
    }
  }
}]]

	at __randomizedtesting.SeedInfo.seed([6885F19536DCDBB9:E8D91A8A3570CB6A]:0)
	at org.opensearch.test.InternalTestCluster.lambda$assertRequestsFinished$43(InternalTestCluster.java:2746)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1084)
	at org.opensearch.test.InternalTestCluster.assertRequestsFinished(InternalTestCluster.java:2737)
	at org.opensearch.test.InternalTestCluster.assertAfterTest(InternalTestCluster.java:2712)
	at org.opensearch.test.OpenSearchIntegTestCase.afterInternal(OpenSearchIntegTestCase.java:603)
	at org.opensearch.test.OpenSearchIntegTestCase.cleanUpCluster(OpenSearchIntegTestCase.java:2245)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)

To Reproduce
Steps to reproduce the behavior:

Enable SR by default & run any test running suite scope - I've used ICUCollationKeywordFieldMapperIT & get this after a couple hundred iterations on shard close.
I also wrote a unit test in SegmentReplicationIndexShardTests:

    public void testBeforeIndexShardClosedWhileCopyingFiles() throws Exception {
        try (ReplicationGroup shards = createGroup(1, settings, new NRTReplicationEngineFactory())) {
            shards.startAll();
            IndexShard primary = shards.getPrimary();
            final IndexShard replica = shards.getReplicas().get(0);

            primary.refresh("Test");

            final SegmentReplicationSourceFactory sourceFactory = mock(SegmentReplicationSourceFactory.class);
            final SegmentReplicationTargetService targetService = newTargetService(sourceFactory);
            SegmentReplicationSource source = new TestReplicationSource() {

                ActionListener<GetSegmentFilesResponse> listener;

                @Override
                public void getCheckpointMetadata(
                    long replicationId,
                    ReplicationCheckpoint checkpoint,
                    ActionListener<CheckpointInfoResponse> listener
                ) {
                    resolveCheckpointInfoResponseListener(listener, primary);
                }

                @Override
                public void getSegmentFiles(
                    long replicationId,
                    ReplicationCheckpoint checkpoint,
                    List<StoreFileMetadata> filesToFetch,
                    IndexShard indexShard,
                    ActionListener<GetSegmentFilesResponse> listener
                ) {
                    // set the listener, we will only fail it once we cancel the source.
                    this.listener = listener;
                    // shard is closing while we are copying files.
                    targetService.beforeIndexShardClosed(replica.shardId, replica, Settings.EMPTY);
                }

                @Override
                public void cancel() {
                    // simulate listener resolving, but only after we have issued a cancel from beforeIndexShardClosed .
                    final RuntimeException exception = new CancellableThreads.ExecutionCancelledException("retryable action was cancelled");
                    listener.onFailure(exception);
                }
            };
            when(sourceFactory.get(any())).thenReturn(source);
            startReplicationAndAssertCancellation(replica, targetService);

            shards.removeReplica(replica);
            closeShards(replica);
        }
    }

Expected behavior
All resources to be cleaned up.

The text was updated successfully, but these errors were encountered:

mch2 added bug Something isn't working untriaged labels Jun 27, 2023

Rishikesh1159 added distributed framework and removed untriaged labels Jun 27, 2023

mch2 self-assigned this Jun 28, 2023

mch2 added this to Segment Replication Jun 28, 2023

github-project-automation bot moved this to Todo in Segment Replication Jun 28, 2023

mch2 moved this from Todo to In Progress in Segment Replication Jun 28, 2023

This was referenced Jul 6, 2023

Fix bug where ReplicationListeners would not complete on cancellation. #8478

Merged

[Segment Replication] Add logic back to update tracking replication checkpoint on source #8560

Merged

mch2 closed this as completed in #8478 Jul 11, 2023

github-project-automation bot moved this from In Progress to Done in Segment Replication Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Segment Replication - Shard close while copying files can leave listeners open #8292

[BUG] Segment Replication - Shard close while copying files can leave listeners open #8292

mch2 commented Jun 27, 2023

[BUG] Segment Replication - Shard close while copying files can leave listeners open #8292

[BUG] Segment Replication - Shard close while copying files can leave listeners open #8292

Comments

mch2 commented Jun 27, 2023