Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim local translog in peer recovery #44756

Merged
merged 15 commits into from
Aug 3, 2019
Merged

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jul 23, 2019

Today, if an operation-based peer recovery occurs, we won't trim translog but leave it as is. Some unacknowledged operations existing in translog of that replica might suddenly reappear when it gets promoted. With this change, we ensure trimming translog above the starting sequence number of phase 2. This change can allow us to read translog forward.

@dnhatn dnhatn added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.4.0 labels Jul 23, 2019
@dnhatn dnhatn requested a review from ywelsch July 23, 2019 13:34
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member Author

dnhatn commented Aug 1, 2019

@ywelsch Thanks for reviewing. I have reworked this PR to trim translog using the starting sequence number of phase2 in the finalize step. Can you have another look?

@dnhatn dnhatn requested a review from ywelsch August 1, 2019 02:05
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left two more comments for discussion

@dnhatn
Copy link
Member Author

dnhatn commented Aug 1, 2019

@ywelsch This is ready for another round. Can you please have another look? Thank you.

@dnhatn dnhatn requested a review from ywelsch August 1, 2019 23:26
@dnhatn
Copy link
Member Author

dnhatn commented Aug 2, 2019

@elasticmachine update branch

@dnhatn
Copy link
Member Author

dnhatn commented Aug 3, 2019

I prefer to make the translog reading with your comments in a follow-up. We will need to add and adjust some tests. Can you take another look at this PR? Thank you!

@dnhatn dnhatn requested a review from ywelsch August 3, 2019 02:55
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnhatn
Copy link
Member Author

dnhatn commented Aug 3, 2019

Thanks @ywelsch.

@dnhatn dnhatn merged commit 302d29c into elastic:master Aug 3, 2019
@dnhatn dnhatn deleted the trim-translog branch August 3, 2019 20:17
@dnhatn
Copy link
Member Author

dnhatn commented Aug 4, 2019

While I was working on a follow-up, I found two cases where reading translog forward can lead to divergence between translog and Lucene:

  1. The primary has sent some translog operations, but it crashes before finalizing the recovery. The global checkpoint on the recovering replicas might have advanced, but its translog wasn't trimmed. If that copy retries another peer recovery, it can relay stale translog operations when recovering locally up to the global checkpoint.

  2. Similar to the first scenario, but we don't have any in-sync copy available, we have to force-allocate the recovering replica as the primary. In this case, that copy can also replay stale translog operations.

We can solve both by trimming translog earlier in peer recovery. However, as you pointed out that choice would hurt us in the future. With soft-deletes, we use translog in store recovery and local recovery (i.e., locally replica up to the global checkpoint), it might be okay to continue reading translog backward. @ywelsch WDYT?

@ywelsch
Copy link
Contributor

ywelsch commented Aug 5, 2019

That's unfortunate 😿. To restate the problem: The ability to read the translog forwards is reestablished at the end of peer recovery, but is violated during the recovery. If the recovery fails mid-way through, the shard is left in a state where reading the translog forwards causes inconsistencies.

I can't think of any workaround that would not be either too complex to implement or have other tricky implications. Initially I wondered whether a notion of uncommitted translog would help. Finalize recovery would then mark the translog generations as committed and opening a translog would discard uncommitted generations. This has other problems though, namely that the persisted local checkpoint shouldn't advance when there are uncommitted translog generations.

I guess we will have to live with backwards reading for now.

dnhatn added a commit that referenced this pull request Aug 5, 2019
testShouldFlushAfterPeerRecovery was added #28350 to make sure the
flushing loop triggered by afterWriteOperation eventually terminates.
This test relies on the fact that we call afterWriteOperation after
making changes in translog. In #44756, we roll a new generation in
RecoveryTarget#finalizeRecovery but do not call afterWriteOperation.

Relates #28350
Relates #45073
dnhatn added a commit that referenced this pull request Aug 11, 2019
Today, if an operation-based peer recovery occurs, we won't trim
translog but leave it as is. Some unacknowledged operations existing in
translog of that replica might suddenly reappear when it gets promoted.
With this change, we ensure trimming translog above the starting
sequence number of phase 2. This change can allow us to read translog
forward.
dnhatn added a commit that referenced this pull request Aug 11, 2019
testShouldFlushAfterPeerRecovery was added #28350 to make sure the
flushing loop triggered by afterWriteOperation eventually terminates.
This test relies on the fact that we call afterWriteOperation after
making changes in translog. In #44756, we roll a new generation in
RecoveryTarget#finalizeRecovery but do not call afterWriteOperation.

Relates #28350
Relates #45073
dnhatn added a commit that referenced this pull request Aug 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v7.4.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants