[BUG] org.opensearch.indices.replication.SegmentReplicationIT.testReplicationPostDeleteAndForceMerge is flaky #8100

sachinpkale · 2023-06-16T09:08:30Z

Describe the bug
org.opensearch.indices.replication.SegmentReplicationIT.testReplicationPostDeleteAndForceMerge is flaky

To Reproduce

There isn't a fixed seed with which it can be reproduced. If we run the test multiple times, it fails occasionally.

java.lang.AssertionError: Unexpected AlreadyClosedException
	at __randomizedtesting.SeedInfo.seed([D263C74CAF37CC20]:0)
	at org.opensearch.index.engine.InternalEngine.failOnTragicEvent(InternalEngine.java:2084)
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1767)
	at org.opensearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:1734)
	at org.opensearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4243)
	at org.opensearch.index.IndexService.maybeRefreshEngine(IndexService.java:992)
	at org.opensearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1125)
	at org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:159)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1623)
Caused by: org.apache.lucene.store.AlreadyClosedException: engine is closed
	at org.opensearch.index.shard.IndexShard.getEngine(IndexShard.java:3346)
	at org.opensearch.index.shard.IndexShard.lambda$getLatestSegmentInfosAndCheckpoint$7(IndexShard.java:1601)
	at java.base/java.util.Optional.map(Optional.java:260)
	at org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1587)
	at org.opensearch.index.shard.IndexShard.getLatestReplicationCheckpoint(IndexShard.java:1557)
	at org.opensearch.index.shard.CheckpointRefreshListener.afterRefresh(CheckpointRefreshListener.java:47)
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
	at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:432)
	at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:412)
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167)
	at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213)
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1755)
	... 9 more

The text was updated successfully, but these errors were encountered:

mch2 · 2023-06-19T16:48:05Z

From the stack trace you pasted, the shard is in AfterRefresh and concurrently closing. The getLatestSegmentInfosAndCheckpoint method is eventually invoked and failing on 1601 with the call to getEngine().config().getCodec().getName() because the engine is already closed. The method already has logic to return empty if the shard is not open, but it is shut after this check. So we need to ensure the shard is not closed while this method is invoked or if we don't want to block shard close for this catch the error and gracefully return.

kotwanikunal · 2023-07-05T16:59:22Z

Fixed with #8134

sachinpkale added bug Something isn't working untriaged labels Jun 16, 2023

dreamer-89 added this to Segment Replication Jun 16, 2023

github-project-automation bot moved this to Todo in Segment Replication Jun 16, 2023

sachinpkale mentioned this issue Jun 19, 2023

Fix SegmentReplication flaky integ tests #8134

Merged

6 tasks

kartg added distributed framework flaky-test Random test failure that succeeds on second run labels Jun 20, 2023

anasalkouz removed the untriaged label Jun 20, 2023

dreamer-89 mentioned this issue Jun 27, 2023

[Meta] Segment Replication flaky test failures #8279

Closed

16 tasks

kotwanikunal self-assigned this Jun 29, 2023

kotwanikunal moved this from Todo to In Progress in Segment Replication Jun 29, 2023

kotwanikunal closed this as completed Jul 5, 2023

github-project-automation bot moved this from In Progress to Done in Segment Replication Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] org.opensearch.indices.replication.SegmentReplicationIT.testReplicationPostDeleteAndForceMerge is flaky #8100

[BUG] org.opensearch.indices.replication.SegmentReplicationIT.testReplicationPostDeleteAndForceMerge is flaky #8100

sachinpkale commented Jun 16, 2023

mch2 commented Jun 19, 2023

kotwanikunal commented Jul 5, 2023

[BUG] org.opensearch.indices.replication.SegmentReplicationIT.testReplicationPostDeleteAndForceMerge is flaky #8100

[BUG] org.opensearch.indices.replication.SegmentReplicationIT.testReplicationPostDeleteAndForceMerge is flaky #8100

Comments

sachinpkale commented Jun 16, 2023

mch2 commented Jun 19, 2023

kotwanikunal commented Jul 5, 2023