-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trim local translog in peer recovery #44756
Conversation
Pinging @elastic/es-distributed |
test/framework/src/main/java/org/elasticsearch/index/engine/EngineTestCase.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
Outdated
Show resolved
Hide resolved
@ywelsch Thanks for reviewing. I have reworked this PR to trim translog using the starting sequence number of phase2 in the finalize step. Can you have another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left two more comments for discussion
server/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
Outdated
Show resolved
Hide resolved
This reverts commit 85db659.
@ywelsch This is ready for another round. Can you please have another look? Thank you. |
@elasticmachine update branch |
server/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/translog/BackwardMultiSnapshot.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/translog/BackwardMultiSnapshot.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/translog/BackwardMultiSnapshot.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/translog/Translog.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/translog/Translog.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryFinalizeRecoveryRequest.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
Outdated
Show resolved
Hide resolved
I prefer to make the translog reading with your comments in a follow-up. We will need to add and adjust some tests. Can you take another look at this PR? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java
Outdated
Show resolved
Hide resolved
Thanks @ywelsch. |
While I was working on a follow-up, I found two cases where reading translog forward can lead to divergence between translog and Lucene:
We can solve both by trimming translog earlier in peer recovery. However, as you pointed out that choice would hurt us in the future. With soft-deletes, we use translog in store recovery and local recovery (i.e., locally replica up to the global checkpoint), it might be okay to continue reading translog backward. @ywelsch WDYT? |
That's unfortunate 😿. To restate the problem: The ability to read the translog forwards is reestablished at the end of peer recovery, but is violated during the recovery. If the recovery fails mid-way through, the shard is left in a state where reading the translog forwards causes inconsistencies. I can't think of any workaround that would not be either too complex to implement or have other tricky implications. Initially I wondered whether a notion of uncommitted translog would help. Finalize recovery would then mark the translog generations as committed and opening a translog would discard uncommitted generations. This has other problems though, namely that the persisted local checkpoint shouldn't advance when there are uncommitted translog generations. I guess we will have to live with backwards reading for now. |
testShouldFlushAfterPeerRecovery was added #28350 to make sure the flushing loop triggered by afterWriteOperation eventually terminates. This test relies on the fact that we call afterWriteOperation after making changes in translog. In #44756, we roll a new generation in RecoveryTarget#finalizeRecovery but do not call afterWriteOperation. Relates #28350 Relates #45073
Today, if an operation-based peer recovery occurs, we won't trim translog but leave it as is. Some unacknowledged operations existing in translog of that replica might suddenly reappear when it gets promoted. With this change, we ensure trimming translog above the starting sequence number of phase 2. This change can allow us to read translog forward.
testShouldFlushAfterPeerRecovery was added #28350 to make sure the flushing loop triggered by afterWriteOperation eventually terminates. This test relies on the fact that we call afterWriteOperation after making changes in translog. In #44756, we roll a new generation in RecoveryTarget#finalizeRecovery but do not call afterWriteOperation. Relates #28350 Relates #45073
Today, if an operation-based peer recovery occurs, we won't trim translog but leave it as is. Some unacknowledged operations existing in translog of that replica might suddenly reappear when it gets promoted. With this change, we ensure trimming translog above the starting sequence number of phase 2. This change can allow us to read translog forward.