-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Lucene soft-deletes in peer recovery #30522
Changes from 1 commit
b18eb06
4b2e385
176b497
9ae627c
ff2215c
188138d
6d901bf
8a78f65
6fe8847
cc2b3f0
0612a05
a61d00b
f3f1fa2
65b8458
fc3d7d1
bd1b8ac
b1e73aa
f7ea71c
04112c6
e34154a
1531024
86c3eba
b3d0d5f
8320647
6b95e21
3be0e30
ca3f781
65ede0b
a0e58b9
c1e03d1
c5ba76f
33be718
c717bf7
df132e1
1bcd443
f232c8a
591d521
76a035f
78c0d92
c30de4a
5ff18f9
8a37126
88950b3
dbe0472
f9eeb90
ccb6e80
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -147,7 +147,7 @@ public RecoveryResponse recoverToTarget() throws IOException { | |
}, shardId + " validating recovery target ["+ request.targetAllocationId() + "] registered ", shard, cancellableThreads, logger); | ||
/* | ||
* DISCUSS: | ||
* There is a major difference when acquiring translog and Lucene retention lock. | ||
* There is a major difference between translog and Lucene retention lock. | ||
* Once translog retention lock is acquired, no translog operations will be trimmed. | ||
* However, this is not true for Lucene retention lock if there are already pending or scheduled merges before the lock is acquired. | ||
* The actual problem is the method `isTranslogReadyForSequenceNumberBasedRecovery` can return true but then the snapshot in phase2, | ||
|
@@ -157,7 +157,7 @@ public RecoveryResponse recoverToTarget() throws IOException { | |
* | ||
* I see two options for this: | ||
* | ||
* 1. We keep the snapshot which is used in isTranslogReadyForSequenceNumberBasedRecovery, then concat it with a new snapshot | ||
* 1. We keep the snapshot which is used in `isTranslogReadyForSequenceNumberBasedRecovery`, then concat it with a new snapshot | ||
* in phase2. The combined snapshot is guaranteed to have all required operations. | ||
* This requires a new method "reset" in Translog#Snapshot. However, I feel this not a clean solution. | ||
* | ||
|
@@ -166,8 +166,8 @@ public RecoveryResponse recoverToTarget() throws IOException { | |
try (Closeable ignored = shard.acquireRetentionLockForPeerRecovery()) { | ||
final long startingSeqNo; | ||
final long requiredSeqNoRangeStart; | ||
// DISCUSS: Most of cases, we will do sequence-based recovery even though file-based recovery will be better | ||
// as we have to replay a large number of operations. Should we add a limit here when making the decision? | ||
// DISCUSS: Most of the cases, we will do sequence-based recovery even though the file-based recovery will be better | ||
// as we have to replay a large number of operations. Should we add a limit here when making this decision? | ||
final boolean isSequenceNumberBasedRecovery = request.startingSeqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO && | ||
isTargetSameHistory() && isTranslogReadyForSequenceNumberBasedRecovery(); | ||
if (isSequenceNumberBasedRecovery) { | ||
|
@@ -186,7 +186,7 @@ public RecoveryResponse recoverToTarget() throws IOException { | |
|
||
// DISCUSS: startingSeqNo = 0; | ||
// Operations in translog is limited by 12h or 512MB, but there are no limit on non-deleted documents in Lucene history | ||
// `startingSeqNo` might replay many Lucene operations | ||
// `startingSeqNo = 0` might replay a large number of operations. | ||
startingSeqNo = 0; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we might need to get the min-consecutive-seq-id from lucene here since we should have it? it's the min seqId in the policy, no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. another option is to pass -1 and let the impl decide where to start. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here too I think we should just forgo the translog when we shift to lucene driven ops. That means that we need to set this to the local checkpoint in the lucene commint we copy. All previous ops will be part of the commit. The only reason why this is 0 is to create a "proper" history on the target in the form of a translog that is just as big as the local one. With a lucene backed history, this is not needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer to make this in a follow-up. |
||
// but we must have everything above the local checkpoint in the commit | ||
requiredSeqNoRangeStart = | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, if you acquire the snapshot before you check the
isSequenceNumberBasedRecovery
lucene will not trim away anything since you prevented the policy from going forward. It has point in time semantics so I don't understand what you mean by stuff get's trimmed. Merged that are in-flight will have seen previous retention limits so that's fine. Pending merges will pick up their retention limits right before they execute which is also fine? I think they only issue I can see is thatsetCheckpointOfSafeCommit
accepts a smaller value than the number of ops we keep around and then things go sideways. Otherwise we should be just fine? I must be missing somethingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s1monw, I should have explained this better.
The problem is that the method
isTranslogReadyForSequenceNumberBasedRecovery
may see operations that are not protected by the retention policy, then decide to go with seq# based recovery. However, an actual snapshot for sending ops might not have those "unprotected" ops if the searcher is refreshed to the newly merged segment. I have a test to illustrate this scenario.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Simon that we should take different approach then the one we had with the translog, now that we reason about trimming soft deletes in terms seq# (the translog uses generation size + time). My suggestion is to use the following logic:
I also think we should not pre-flight the requested range like
isTranslogReadyForSequenceNumberBasedRecovery
does. If you the numbers indicate that an ops based recovery is possible, we go for it. We should then verify we managed to bring the target shard to where we need to be (local checkpoint >= the end of the ops range we send) and if not we fail the recovery. IMO we can start to rely on retention semantics and simplify the logic.Note that this brings another interesting discussion: should we even support translog based ops based recovery? We should talk about it but I tend to say that we should but given it a whole different code path than the lucene based one. We will then use the soft delete setting to decide which path to follow and nothing else. Specifically - no fall back from lucene ops to translog ops. It's one or another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I made this change as you suggested. However, we can use the min retained ops directly as the number of retained operations can be changed - it's a dynamic setting. I used the max_seq_no that has been exposed to the merge policy as the baseline. This value will be baked into the commit and bootstrap when opening an engine.