Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.indices.replication.SegmentReplicationIT.testReplicationPostDeleteAndForceMerge is flaky #8100

Closed
sachinpkale opened this issue Jun 16, 2023 · 2 comments
Assignees
Labels
bug Something isn't working distributed framework flaky-test Random test failure that succeeds on second run

Comments

@sachinpkale
Copy link
Member

Describe the bug
org.opensearch.indices.replication.SegmentReplicationIT.testReplicationPostDeleteAndForceMerge is flaky

To Reproduce

  • There isn't a fixed seed with which it can be reproduced. If we run the test multiple times, it fails occasionally.
java.lang.AssertionError: Unexpected AlreadyClosedException
	at __randomizedtesting.SeedInfo.seed([D263C74CAF37CC20]:0)
	at org.opensearch.index.engine.InternalEngine.failOnTragicEvent(InternalEngine.java:2084)
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1767)
	at org.opensearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:1734)
	at org.opensearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4243)
	at org.opensearch.index.IndexService.maybeRefreshEngine(IndexService.java:992)
	at org.opensearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1125)
	at org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:159)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1623)
Caused by: org.apache.lucene.store.AlreadyClosedException: engine is closed
	at org.opensearch.index.shard.IndexShard.getEngine(IndexShard.java:3346)
	at org.opensearch.index.shard.IndexShard.lambda$getLatestSegmentInfosAndCheckpoint$7(IndexShard.java:1601)
	at java.base/java.util.Optional.map(Optional.java:260)
	at org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1587)
	at org.opensearch.index.shard.IndexShard.getLatestReplicationCheckpoint(IndexShard.java:1557)
	at org.opensearch.index.shard.CheckpointRefreshListener.afterRefresh(CheckpointRefreshListener.java:47)
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
	at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:432)
	at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:412)
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167)
	at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213)
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1755)
	... 9 more
@mch2
Copy link
Member

mch2 commented Jun 19, 2023

From the stack trace you pasted, the shard is in AfterRefresh and concurrently closing. The getLatestSegmentInfosAndCheckpoint method is eventually invoked and failing on 1601 with the call to getEngine().config().getCodec().getName() because the engine is already closed. The method already has logic to return empty if the shard is not open, but it is shut after this check. So we need to ensure the shard is not closed while this method is invoked or if we don't want to block shard close for this catch the error and gracefully return.

@kartg kartg added distributed framework flaky-test Random test failure that succeeds on second run labels Jun 20, 2023
@kotwanikunal kotwanikunal self-assigned this Jun 29, 2023
@kotwanikunal kotwanikunal moved this from Todo to In Progress in Segment Replication Jun 29, 2023
@kotwanikunal
Copy link
Member

Fixed with #8134

@github-project-automation github-project-automation bot moved this from In Progress to Done in Segment Replication Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework flaky-test Random test failure that succeeds on second run
Projects
Status: Done
Development

No branches or pull requests

5 participants