Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Closing an index with ongoing primary failover can lead to deadlocks #11869

Open
linuxpi opened this issue Jan 12, 2024 · 4 comments
Open
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@linuxpi
Copy link
Collaborator

linuxpi commented Jan 12, 2024

Describe the bug

While replica is being promoted to primary to primary failover we create a new engine pointing to last successful commit and replay the translog till global checkpoint. During this processing we acquire multiple locks.

Around the same time if CLOSE is triggered on the index, the close flow and the failover flow at some pointing are competing for the same locks from each other, which in certain race conditions can lead to a deadlock.

Failover Thread

The failover thread gets BLOCKED at acquiring the engineMutex lock at the following point

@Override
public GatedCloseable<SegmentInfos> getSegmentInfosSnapshot() {
synchronized (engineMutex) {
if (newEngineReference.get() == null) {
throw new AlreadyClosedException("engine was closed");
}
return newEngineReference.get().getSegmentInfosSnapshot();
}
}
@Override

It is already holding a readLock on the new engine being created as part of the failover. This lock is acquired here

/**
* Performs recovery from the transaction log up to {@code recoverUpToSeqNo} (inclusive).
* This operation will close the engine if the recovery fails.
* @param translogRecoveryRunner the translog recovery runner
* @param recoverUpToSeqNo the upper bound, inclusive, of sequence number to be recovered
* @return the total number of operations recovered
*/
@Override
public int recoverFromTranslog(TranslogRecoveryRunner translogRecoveryRunner, long localCheckpoint, long recoverUpToSeqNo)
throws IOException {
int opsRecovered = 0;
translogEventListener.onBeginTranslogRecovery();
try (ReleasableLock ignored = readLock.acquire()) {
engineLifeCycleAware.ensureOpen();
if (pendingTranslogRecovery.get() == false) {
throw new IllegalStateException("Engine has already been recovered");
}
try {
opsRecovered = recoverFromTranslogInternal(translogRecoveryRunner, localCheckpoint, recoverUpToSeqNo);
} catch (Exception e) {
try {
pendingTranslogRecovery.set(true); // just play safe and never allow commits on this see #ensureCanFlush
translogEventListener.onFailure("failed to recover from translog", e);
} catch (Exception inner) {
e.addSuppressed(inner);
}
throw e;
}
}
return opsRecovered;
}

IndexShard Close Thread

The close thread acquires the engineMutex lock at the following point

public void close(String reason, boolean flushEngine, boolean deleted) throws IOException {
synchronized (engineMutex) {
try {
synchronized (mutex) {
changeState(IndexShardState.CLOSED, reason);
}
} finally {
final Engine engine = this.currentEngineReference.getAndSet(null);
try {
if (engine != null && flushEngine) {
engine.flushAndClose();
}
} finally {
// playing safe here and close the engine even if the above succeeds - close can be called multiple times
// Also closing refreshListeners to prevent us from accumulating any more listeners
IOUtils.close(engine, globalCheckpointListeners, refreshListeners, pendingReplicationActions);
if (deleted && engine != null && isPrimaryMode()) {
// Translog Clean up
engine.translogManager().onDelete();
}
indexShardOperationPermits.close();
}
}
}
}

and tries to acquire writeLock on the new engine further in the close flow.

@Override
public void close() throws IOException {
if (isClosed.get() == false) { // don't acquire the write lock if we are already closed
logger.debug("close now acquiring writeLock");
try (ReleasableLock lock = writeLock.acquire()) {
logger.debug("close acquired writeLock");
closeNoLock("api", closedLatch);
}
}
awaitPendingClose();
}

At this point, both threads are in a deadlock state.

Thread Dumps

"opensearch[node_s2][generic][T#3]" ID=4135 BLOCKED on java.lang.Object@3c0fa470 owned by "opensearch[node_s2][indices_shutdown][T#1]" ID=4183
	at app//org.opensearch.index.shard.IndexShard$11.getSegmentInfosSnapshot(IndexShard.java:4768)
	- blocked on java.lang.Object@3c0fa470
	at app//org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:5113)
	at app//org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1676)
	at app//org.opensearch.index.shard.IndexShard.updateReplicationCheckpoint(IndexShard.java:4690)
	at app//org.opensearch.index.shard.IndexShard$ReplicationCheckpointUpdater.afterRefresh(IndexShard.java:4684)
	at app//org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
	at app//org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
	at app//org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
	at app//org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1862)
	at app//org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1977)
	at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:589)
	at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:546)
	at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:146)
	at app//org.opensearch.index.shard.IndexShard.resetEngineToGlobalCheckpoint(IndexShard.java:4812)
	at app//org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:720)
	at app//org.opensearch.index.shard.IndexShard$$Lambda/0x00007f45aca36238.run(Unknown Source)
	at app//org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4079)
	at app//org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4049)
	at app//org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$33(IndexShard.java:4000)
	at app//org.opensearch.index.shard.IndexShard$$Lambda/0x00007f45aca11738.accept(Unknown Source)
	at app//org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
	at app//org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911)
	at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base@21.0.1/java.lang.Thread.runWith(Thread.java:1596)
	at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)
	Locked synchronizers:
	- java.util.concurrent.locks.ReentrantLock$NonfairSync@28675992
	- java.util.concurrent.ThreadPoolExecutor$Worker@1197c3a3
	- java.util.concurrent.locks.ReentrantLock$NonfairSync@57ac517a
 20) Thread[id=4183, name=opensearch[node_s2][indices_shutdown][T#1], state=WAITING, group=TGRP-RemoteStoreRestoreIT]
        at java.base/jdk.internal.misc.Unsafe.park(Native Method)
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)
        at java.base/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
        at org.opensearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:69)
        at org.opensearch.index.engine.Engine.close(Engine.java:2070)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:89)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:131)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:81)
        at org.opensearch.index.shard.IndexShard$11.close(IndexShard.java:4785)
        at org.opensearch.index.engine.Engine.flushAndClose(Engine.java:2059)
        at org.opensearch.index.shard.IndexShard.close(IndexShard.java:2003)
        at org.opensearch.index.IndexService.closeShard(IndexService.java:642)
        at org.opensearch.index.IndexService.removeShard(IndexService.java:618)
        at org.opensearch.index.IndexService.close(IndexService.java:389)
        at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:1047)
        at org.opensearch.indices.IndicesService.lambda$doStop$3(IndicesService.java:542)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

Found this issue in the integ test during build - https://build.ci.opensearch.org/job/gradle-check/31461/testReport/junit/org.opensearch.remotestore/RemoteStoreRestoreIT/testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown/

Related component

Other

To Reproduce

Still working on getting repro steps as this seems to be race condition causing the issue. Might be able to repro by slowing down translog recovery and closing the index while recovery is in progress.

Expected behavior

Closing the index while ongoing primary failover should never lead to deadlock

Additional Details

No response

@linuxpi
Copy link
Collaborator Author

linuxpi commented Jan 17, 2024

Found another instance of deadlock while running RemoteStoreRestoreIT.testRTSRestoreWithRefreshedDataPrimaryReplicaDown

"opensearch[node_s2][generic][T#1]" ID=377 BLOCKED on java.lang.Object@7caaa3de owned by "opensearch[node_s2][indices_shutdown][T#1]" ID=431
	at app//org.opensearch.index.shard.IndexShard$11.getSegmentInfosSnapshot(IndexShard.java:4755)
	- blocked on java.lang.Object@7caaa3de
	at app//org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:5102)
	at app//org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1669)
	at app//org.opensearch.index.shard.IndexShard.updateReplicationCheckpoint(IndexShard.java:4677)
	at app//org.opensearch.index.shard.IndexShard$ReplicationCheckpointUpdater.afterRefresh(IndexShard.java:4671)
	at app//org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
	at app//org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
	at app//org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
	at app//org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1771)
	at app//org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1886)
	at app//org.opensearch.index.engine.InternalEngine$1.onAfterTranslogRecovery(InternalEngine.java:270)
	at app//org.opensearch.index.translog.listener.CompositeTranslogEventListener.onAfterTranslogRecovery(CompositeTranslogEventListener.java:63)
	at app//org.opensearch.index.translog.InternalTranslogManager.recoverFromTranslogInternal(InternalTranslogManager.java:168)
	at app//org.opensearch.index.translog.InternalTranslogManager.recoverFromTranslog(InternalTranslogManager.java:132)
	at app//org.opensearch.index.shard.IndexShard.resetEngineToGlobalCheckpoint(IndexShard.java:4801)
	at app//org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:718)
	at app//org.opensearch.index.shard.IndexShard$$Lambda/0x00007f86449e6288.run(Unknown Source)
	at app//org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4066)
	at app//org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4036)
	at app//org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$32(IndexShard.java:3987)
	at app//org.opensearch.index.shard.IndexShard$$Lambda/0x00007f86449e7178.accept(Unknown Source)
	at app//org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
	at app//org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913)
	at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base@21.0.1/java.lang.Thread.runWith(Thread.java:1596)
	at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)
	Locked synchronizers:
	- java.util.concurrent.ThreadPoolExecutor$Worker@6cf48ee0
	- java.util.concurrent.locks.ReentrantLock$NonfairSync@538b6f5
	- java.util.concurrent.locks.ReentrantLock$NonfairSync@22bfe0aa

"opensearch[node_s2][indices_shutdown][T#1]" ID=431 WAITING on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@68898ecb
	at java.base@21.0.1/jdk.internal.misc.Unsafe.park(Native Method)
	- waiting on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@68898ecb
	at java.base@21.0.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
	at java.base@21.0.1/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
	at java.base@21.0.1/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)
	at java.base@21.0.1/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
	at app//org.opensearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:69)
	at app//org.opensearch.index.engine.Engine.close(Engine.java:2018)
	at app//org.opensearch.common.util.io.IOUtils.close(IOUtils.java:89)
	at app//org.opensearch.common.util.io.IOUtils.close(IOUtils.java:131)
	at app//org.opensearch.common.util.io.IOUtils.close(IOUtils.java:81)
	at app//org.opensearch.index.shard.IndexShard$11.close(IndexShard.java:4772)
	at app//org.opensearch.index.engine.Engine.flushAndClose(Engine.java:2007)
	at app//org.opensearch.index.shard.IndexShard.close(IndexShard.java:1996)
	- locked java.lang.Object@7caaa3de
	at app//org.opensearch.index.IndexService.closeShard(IndexService.java:642)
	at app//org.opensearch.index.IndexService.removeShard(IndexService.java:618)
	- locked org.opensearch.index.IndexService@36641ece
	at app//org.opensearch.index.IndexService.close(IndexService.java:389)
	- locked org.opensearch.index.IndexService@36641ece
	at app//org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:1063)
	at app//org.opensearch.indices.IndicesService.lambda$doStop$4(IndicesService.java:559)
	at app//org.opensearch.indices.IndicesService$$Lambda/0x00007f86449def28.run(Unknown Source)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base@21.0.1/java.lang.Thread.runWith(Thread.java:1596)
	at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)
	Locked synchronizers:
	- java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@10659274
	- java.util.concurrent.ThreadPoolExecutor$Worker@2a84dc1b

Test failed with following error

java.lang.Exception: Test abandoned because suite timeout was reached.
	at __randomizedtesting.SeedInfo.seed([61B84451E1E0BDF2]:0)

@peternied peternied added the flaky-test Random test failure that succeeds on second run label Jan 24, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2]
@linuxpi thanks for filing and providing detail on the deadlocking scenario

@peternied peternied added Indexing Indexing, Bulk Indexing and anything related to indexing and removed untriaged labels Jan 24, 2024
@mch2
Copy link
Member

mch2 commented Feb 6, 2024

@linuxpi I have seen this as well but not able to get a repeatable test. In any case I don't think we need the engineMutex while getSegmentInfosSnapshot is invoked on this RO engine. If the engine ref has not yet been updated we can simply throw the AlreadyClosed exception and callers can handle it. In this case that is when recomputing the checkpoint on flush and close which already handles the AlreadyClosedException. We want to still delegate to the newEngineRef if its set so that the refresh newEngineReference.get().refresh("reset_engine"); refers to the reader on this engine instead of fetching from the RO engine.

Will put up a pr to fix.

@linuxpi
Copy link
Collaborator Author

linuxpi commented Feb 8, 2024

Thanks @mch2 . I was able to repro this for multiple flaky tests i was trying to fix. This usually shows up when you run with more than 500 or even 1000 iterations.

For the fix, certainly removing the lock would help and we should do it if the lock is unnecessary here.

But i was thinking if we should also revisit the locking order in IndexShard for various flows to ensure the locking order is correct.

@ankitkala ankitkala added Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed Indexing Indexing, Bulk Indexing and anything related to indexing labels Feb 15, 2024
@andrross andrross removed the Other label Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
None yet
Development

No branches or pull requests

5 participants