Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segment Replication - Shard close while copying files can leave listeners open #8292

Closed
mch2 opened this issue Jun 27, 2023 · 0 comments · Fixed by #8478
Closed

[BUG] Segment Replication - Shard close while copying files can leave listeners open #8292

mch2 opened this issue Jun 27, 2023 · 0 comments · Fixed by #8478
Assignees
Labels
bug Something isn't working distributed framework

Comments

@mch2
Copy link
Member

mch2 commented Jun 27, 2023

Describe the bug
With Segment Replication it is possible that a shard is closed while segment copy is still ongoing, while rare this can result in a case where a ReplicationListener is left open on shard close. This can happen because in the beforeIndexShardClosed method in SegmentReplicationTargetService we cancel ongoing replications, however this is more of a request for cancellation as the method does not block and wait for all resources to be cleaned up.

Within these components there are two listeners. One that is passed to the ReplicationTarget (ReplicationListener), that is only resolved through ReplicationCollection when it is done/failed. The other that is internal to SegmentReplicationTarget directly on execution, that is used to alert the target service a replication event is completed and the collection done/fail methods should be invoked.

Cancellation will remove the shard from SegmentReplicationTargetService's ReplicationCollection immediately in beforeIndexShardClosed, so when the internal listener to startReplication is resolved with failure/success the collection does not notify the original ReplicationListener as it is already removed from the collection. This surfaces with the following stacktrace:

REPRODUCE WITH: ./gradlew 'null' --tests "org.opensearch.index.mapper.ICUCollationKeywordFieldMapperIT.testCustomRules" -Dtests.seed=6885F19536DCDBB9 -Dtests.security.manager=false -Dtests.locale=no -Dtests.timezone=America/Adak -Druntime.java=19

java.lang.AssertionError: All incoming requests on node [node_s3] should have finished. Expected 0 but got 437; pending tasks [[{
  "node" : "node_s3",
  "id" : 415,
  "type" : "transport",
  "action" : "internal:index/shard/recovery/start_recovery",
  "description" : "",
  "start_time" : "2023-06-27T17:03:01.988Z",
  "start_time_in_millis" : 1687885381988,
  "running_time" : "1m",
  "running_time_in_nanos" : 65986775792,
  "cancellable" : false,
  "cancelled" : false,
  "headers" : { },
  "resource_stats" : {
    "average" : {
      "cpu_time_in_nanos" : 0,
      "memory_in_bytes" : 0
    },
    "total" : {
      "cpu_time_in_nanos" : 0,
      "memory_in_bytes" : 0
    },
    "min" : {
      "cpu_time_in_nanos" : 0,
      "memory_in_bytes" : 0
    },
    "max" : {
      "cpu_time_in_nanos" : 0,
      "memory_in_bytes" : 0
    },
    "thread_info" : {
      "thread_executions" : 0,
      "active_threads" : 0
    }
  }
}]]

	at __randomizedtesting.SeedInfo.seed([6885F19536DCDBB9:E8D91A8A3570CB6A]:0)
	at org.opensearch.test.InternalTestCluster.lambda$assertRequestsFinished$43(InternalTestCluster.java:2746)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1084)
	at org.opensearch.test.InternalTestCluster.assertRequestsFinished(InternalTestCluster.java:2737)
	at org.opensearch.test.InternalTestCluster.assertAfterTest(InternalTestCluster.java:2712)
	at org.opensearch.test.OpenSearchIntegTestCase.afterInternal(OpenSearchIntegTestCase.java:603)
	at org.opensearch.test.OpenSearchIntegTestCase.cleanUpCluster(OpenSearchIntegTestCase.java:2245)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)

To Reproduce
Steps to reproduce the behavior:

  1. Enable SR by default & run any test running suite scope - I've used ICUCollationKeywordFieldMapperIT & get this after a couple hundred iterations on shard close.
  2. I also wrote a unit test in SegmentReplicationIndexShardTests:
    public void testBeforeIndexShardClosedWhileCopyingFiles() throws Exception {
        try (ReplicationGroup shards = createGroup(1, settings, new NRTReplicationEngineFactory())) {
            shards.startAll();
            IndexShard primary = shards.getPrimary();
            final IndexShard replica = shards.getReplicas().get(0);

            primary.refresh("Test");

            final SegmentReplicationSourceFactory sourceFactory = mock(SegmentReplicationSourceFactory.class);
            final SegmentReplicationTargetService targetService = newTargetService(sourceFactory);
            SegmentReplicationSource source = new TestReplicationSource() {

                ActionListener<GetSegmentFilesResponse> listener;

                @Override
                public void getCheckpointMetadata(
                    long replicationId,
                    ReplicationCheckpoint checkpoint,
                    ActionListener<CheckpointInfoResponse> listener
                ) {
                    resolveCheckpointInfoResponseListener(listener, primary);
                }

                @Override
                public void getSegmentFiles(
                    long replicationId,
                    ReplicationCheckpoint checkpoint,
                    List<StoreFileMetadata> filesToFetch,
                    IndexShard indexShard,
                    ActionListener<GetSegmentFilesResponse> listener
                ) {
                    // set the listener, we will only fail it once we cancel the source.
                    this.listener = listener;
                    // shard is closing while we are copying files.
                    targetService.beforeIndexShardClosed(replica.shardId, replica, Settings.EMPTY);
                }

                @Override
                public void cancel() {
                    // simulate listener resolving, but only after we have issued a cancel from beforeIndexShardClosed .
                    final RuntimeException exception = new CancellableThreads.ExecutionCancelledException("retryable action was cancelled");
                    listener.onFailure(exception);
                }
            };
            when(sourceFactory.get(any())).thenReturn(source);
            startReplicationAndAssertCancellation(replica, targetService);

            shards.removeReplica(replica);
            closeShards(replica);
        }
    }

Expected behavior
All resources to be cleaned up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants