Trim local translog in peer recovery #44756

dnhatn · 2019-07-23T13:34:27Z

Today, if an operation-based peer recovery occurs, we won't trim translog but leave it as is. Some unacknowledged operations existing in translog of that replica might suddenly reappear when it gets promoted. With this change, we ensure trimming translog above the starting sequence number of phase 2. This change can allow us to read translog forward.

elasticmachine · 2019-07-23T13:34:30Z

Pinging @elastic/es-distributed

test/framework/src/main/java/org/elasticsearch/index/engine/EngineTestCase.java

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

dnhatn · 2019-08-01T02:05:34Z

@ywelsch Thanks for reviewing. I have reworked this PR to trim translog using the starting sequence number of phase2 in the finalize step. Can you have another look?

ywelsch

I've left two more comments for discussion

server/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

This reverts commit 85db659.

dnhatn · 2019-08-01T23:26:38Z

@ywelsch This is ready for another round. Can you please have another look? Thank you.

dnhatn · 2019-08-02T13:02:23Z

@elasticmachine update branch

server/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java

server/src/main/java/org/elasticsearch/index/translog/BackwardMultiSnapshot.java

server/src/main/java/org/elasticsearch/index/translog/Translog.java

server/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryFinalizeRecoveryRequest.java

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

dnhatn · 2019-08-03T02:55:37Z

I prefer to make the translog reading with your comments in a follow-up. We will need to add and adjust some tests. Can you take another look at this PR? Thank you!

ywelsch

LGTM

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

dnhatn · 2019-08-03T20:05:16Z

Thanks @ywelsch.

dnhatn · 2019-08-04T21:30:35Z

While I was working on a follow-up, I found two cases where reading translog forward can lead to divergence between translog and Lucene:

The primary has sent some translog operations, but it crashes before finalizing the recovery. The global checkpoint on the recovering replicas might have advanced, but its translog wasn't trimmed. If that copy retries another peer recovery, it can relay stale translog operations when recovering locally up to the global checkpoint.
Similar to the first scenario, but we don't have any in-sync copy available, we have to force-allocate the recovering replica as the primary. In this case, that copy can also replay stale translog operations.

We can solve both by trimming translog earlier in peer recovery. However, as you pointed out that choice would hurt us in the future. With soft-deletes, we use translog in store recovery and local recovery (i.e., locally replica up to the global checkpoint), it might be okay to continue reading translog backward. @ywelsch WDYT?

ywelsch · 2019-08-05T11:31:51Z

That's unfortunate 😿. To restate the problem: The ability to read the translog forwards is reestablished at the end of peer recovery, but is violated during the recovery. If the recovery fails mid-way through, the shard is left in a state where reading the translog forwards causes inconsistencies.

I can't think of any workaround that would not be either too complex to implement or have other tricky implications. Initially I wondered whether a notion of uncommitted translog would help. Finalize recovery would then mark the translog generations as committed and opening a translog would discard uncommitted generations. This has other problems though, namely that the persisted local checkpoint shouldn't advance when there are uncommitted translog generations.

I guess we will have to live with backwards reading for now.

testShouldFlushAfterPeerRecovery was added #28350 to make sure the flushing loop triggered by afterWriteOperation eventually terminates. This test relies on the fact that we call afterWriteOperation after making changes in translog. In #44756, we roll a new generation in RecoveryTarget#finalizeRecovery but do not call afterWriteOperation. Relates #28350 Relates #45073

Today, if an operation-based peer recovery occurs, we won't trim translog but leave it as is. Some unacknowledged operations existing in translog of that replica might suddenly reappear when it gets promoted. With this change, we ensure trimming translog above the starting sequence number of phase 2. This change can allow us to read translog forward.

testShouldFlushAfterPeerRecovery was added #28350 to make sure the flushing loop triggered by afterWriteOperation eventually terminates. This test relies on the fact that we call afterWriteOperation after making changes in translog. In #44756, we roll a new generation in RecoveryTarget#finalizeRecovery but do not call afterWriteOperation. Relates #28350 Relates #45073

Relates #44756

Trim local translog in peer recovery

2a8626c

dnhatn added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.4.0 labels Jul 23, 2019

dnhatn requested a review from ywelsch July 23, 2019 13:34

ywelsch suggested changes Jul 23, 2019

View reviewed changes

test/framework/src/main/java/org/elasticsearch/index/engine/EngineTestCase.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java Outdated Show resolved Hide resolved

dnhatn added 4 commits July 31, 2019 12:03

Merge branch 'master' into trim-translog

c6417bb

strengthen assertConsistentHistoryBetweenTranslogAndLuceneIndex

f242766

add IT test

a46c0bb

trim after starting_seq_no

c1bbc08

dnhatn requested a review from ywelsch August 1, 2019 02:05

ywelsch reviewed Aug 1, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java Outdated Show resolved Hide resolved

dnhatn added 4 commits August 1, 2019 18:13

add test

bbdebeb

two versions

85db659

Revert "two versions"

281ce7f

This reverts commit 85db659.

Merge branch 'master' into trim-translog

af1a466

dnhatn requested a review from ywelsch August 1, 2019 23:26

Merge branch 'master' into trim-translog

9a07727

ywelsch reviewed Aug 2, 2019

View reviewed changes

dnhatn added 3 commits August 2, 2019 13:33

Merge branch 'master' into trim-translog

b218a2e

startingSeqNo -> trimAboveSeqNo

9582f1c

renaming

bb98039

dnhatn requested a review from ywelsch August 3, 2019 02:55

ywelsch approved these changes Aug 3, 2019

View reviewed changes

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java Outdated Show resolved Hide resolved

dnhatn added 2 commits August 3, 2019 11:40

grammar

285584f

Merge branch 'master' into trim-translog

feabed9

dnhatn merged commit 302d29c into elastic:master Aug 3, 2019

dnhatn deleted the trim-translog branch August 3, 2019 20:17

dnhatn added the backport pending label Aug 3, 2019

dnhatn mentioned this pull request Aug 5, 2019

Call afterWriteOperation after trim translog in peer recovery #45182

Merged

dnhatn removed the backport pending label Aug 11, 2019

dnhatn added a commit that referenced this pull request Aug 11, 2019

Adjust BWC version for #44756

d1065fe

Relates #44756

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trim local translog in peer recovery #44756

Trim local translog in peer recovery #44756

dnhatn commented Jul 23, 2019 •

edited

Loading

elasticmachine commented Jul 23, 2019

dnhatn commented Aug 1, 2019

ywelsch left a comment

dnhatn commented Aug 1, 2019

dnhatn commented Aug 2, 2019

dnhatn commented Aug 3, 2019

ywelsch left a comment

dnhatn commented Aug 3, 2019

dnhatn commented Aug 4, 2019

ywelsch commented Aug 5, 2019

Trim local translog in peer recovery #44756

Trim local translog in peer recovery #44756

Conversation

dnhatn commented Jul 23, 2019 • edited Loading

elasticmachine commented Jul 23, 2019

dnhatn commented Aug 1, 2019

ywelsch left a comment

Choose a reason for hiding this comment

dnhatn commented Aug 1, 2019

dnhatn commented Aug 2, 2019

dnhatn commented Aug 3, 2019

ywelsch left a comment

Choose a reason for hiding this comment

dnhatn commented Aug 3, 2019

dnhatn commented Aug 4, 2019

ywelsch commented Aug 5, 2019

dnhatn commented Jul 23, 2019 •

edited

Loading