-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replica starts peer recovery with safe commit #28181
Conversation
Today a replica starts a peer-recovery with the last commit. If the last commit is not a safe commit, a replica will immediately fallback to the file based sync which is more expensive than the sequence based recovery. This commit modifies the peer-recovery in replica to start with a safe commit. Moreover, we can keep the existing translog on the target if the recovery is sequence based recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx Nhat. I left some initial feedback
|
||
import java.io.IOException; | ||
|
||
final class RecoveryOpenSeqBasedEngineRequest extends TransportRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've slept on this (as promised :)) and I prefer we go back to how you had it with a boolean in RecoveryPrepareForTranslogOperationsRequest
. The reason is that I want to do some refactoring to simplify how the engine is created and I expect this to change making the boolean not needed and only use one message. I rather not have to deal with two messages and another layer of BWC. I think we should call the boolean "deleteLocalTranslog"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on not having a separate message here.
@@ -188,7 +188,7 @@ public RecoveryResponse recoverToTarget() throws IOException { | |||
runUnderPrimaryPermit(() -> shard.initiateTracking(request.targetAllocationId())); | |||
|
|||
try { | |||
prepareTargetForTranslog(translog.estimateTotalOperationsFromMinSeq(startingSeqNo)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can roll back all these naming changes if we we keep the old message (and the boolean)
|
||
translogLocation.set(writeTranslog(replica.shardId(), translogUUID, translog.currentFileGeneration(), maxSeqNo)); | ||
|
||
// commit is good, global checkpoint is above max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we lost the extra check that advancing the global checking advanceses the commit usage.
@@ -241,4 +246,38 @@ public void testPeerRecoveryPersistGlobalCheckpoint() throws Exception { | |||
assertThat(replica.getTranslog().getLastSyncedGlobalCheckpoint(), equalTo(numDocs - 1)); | |||
} | |||
} | |||
|
|||
public void testSequenceBasedRecoveryKeepsTranslog() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you double check we have a test that make sure that all forms of recovery removes unneeded ops above the global checkpoint? if we don't think we can now add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We have RecoveryDuringReplicationTests#testRecoveryAfterPrimaryPromotion
. Previously this test was expected to execute a file-based sync, but now it will execute seq-based recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only do sequence number based recovery in case where a safe commit exists (whose definition is based on the local knowledge of the global checkpoint). How about locally replaying the translog up to the (local knowledge of the) globalcheckpoint and then requesting seq-num recovery only from the (local knowledge of the) globalcheckpoint onwards. This might allow sequence-number based recovery in more cases and would also require less translog operations to be shipped over the wire.
That's where we're heading indeed but will require more follow ups. |
# Conflicts: # core/src/test/java/org/elasticsearch/indices/recovery/RecoveryTests.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some very minor comments. It looks good. I think we can now also add assertions in IndexShard post recovery that we have a "safe" commit if the index is new enough and that we the global checkpoint is <= the local checkpoint etc.?
private final long recoveryId; | ||
private final ShardId shardId; | ||
private final int totalTranslogOps; | ||
private final boolean deleteLocalTranslog; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I came up with the name and I'm sorry for changing my mind. I would propose createNewTranslog
. better no?
@@ -188,7 +188,9 @@ public RecoveryResponse recoverToTarget() throws IOException { | |||
runUnderPrimaryPermit(() -> shard.initiateTracking(request.targetAllocationId())); | |||
|
|||
try { | |||
prepareTargetForTranslog(translog.estimateTotalOperationsFromMinSeq(startingSeqNo)); | |||
// For a sequence based recovery, the target can keep its local translog | |||
prepareTargetForTranslog(isSequenceNumberBasedRecoveryPossible == false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shall we rename isSequenceNumberBasedRecoveryPossible to isSequenceNumberBasedRecovery
?
state().getTranslog().totalOperations(totalTranslogOps); | ||
// TODO: take the local checkpoint from store as global checkpoint, once we know it's safe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the todo is still relevant no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed it back as a note as it's a valid TODO.
|
||
// As a replica keeps a safe commit, the file-based recovery only happens if the required translog | ||
// for the sequence based recovery are not fully retained and extra documents were added to the primary. | ||
boolean expectSeqNoRecovery = (moreDocs == 0 || frequently()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not have a random boolean here instead of frequently?
We are targeting to always have a safe index once the recovery is done. This invariant does not hold if the translog is manually truncated by users because the truncate translog cli resets the global checkpoint to unassigned. This commit assigns the max_seqno of the last commit to the global checkpoint when truncating translog. Relates elastic#28181
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx Nhat.
@@ -365,6 +365,7 @@ private void ensureRefCount() { | |||
public void prepareForTranslogOperations(boolean createNewTranslog, int totalTranslogOps) throws IOException { | |||
state().getTranslog().totalOperations(totalTranslogOps); | |||
if (createNewTranslog) { | |||
// TODO: take the local checkpoint from store as global checkpoint, once we know it's safe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add when you expect it to be safe (what version)
We are targeting to always have a safe index once the recovery is done. This invariant does not hold if the translog is manually truncated by users because the truncate translog cli resets the global checkpoint to unassigned. This commit assigns the global checkpoint to the max_seqno of the last commit when truncating translog. We can only safely do it because the truncate translog command will generate a new history uuid for that shard. With a new history UUID, sequence-based recovery between that shard and other old shards will be disabled. Relates #28181
We are targeting to always have a safe index once the recovery is done. This invariant does not hold if the translog is manually truncated by users because the truncate translog cli resets the global checkpoint to unassigned. This commit assigns the global checkpoint to the max_seqno of the last commit when truncating translog. We can only safely do it because the truncate translog command will generate a new history uuid for that shard. With a new history UUID, sequence-based recovery between that shard and other old shards will be disabled. Relates #28181
We introduced a new option `createNewTranslog` in elastic#28181. However, we named that parameter as deleteLocalTranslog in other places. This commit makes sure to have a consistent naming in these places. Relates elastic#28181
* master: TEST: init unassigned gcp in testAcquireIndexCommit Replica start peer recovery with safe commit (elastic#28181) Truncate tlog cli should assign global checkpoint (elastic#28192)
Today a replica starts a peer-recovery with the last commit. If the last commit is not a safe commit, a replica will immediately fallback to the file based sync which is more expensive than the sequence based recovery. This commit modifies the peer-recovery in replica to start with a safe commit. Moreover we can keep the existing translog on the target if the recovery is sequence based recovery. Relates #10708
The previous backport was not corect. Relates #28181
* master: (59 commits) Correct backport replica rollback to 6.2 (elastic#28181) Backport replica rollback to 6.2 (elastic#28181) Rename deleteLocalTranslog to createNewTranslog AwaitsFix #testRecoveryAfterPrimaryPromotion TEST: init unassigned gcp in testAcquireIndexCommit Replica start peer recovery with safe commit (elastic#28181) Truncate tlog cli should assign global checkpoint (elastic#28192) Fix lock accounting in releasable lock Add ability to associate an ID with tasks (elastic#27764) [DOCS] Removed differencies between text and code (elastic#27993) text fixes (elastic#28136) Update getting-started.asciidoc (elastic#28145) [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187) Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186) Do not keep 5.x commits once having 6.x commits (elastic#28188) Rename core module to server (elastic#28180) upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads Primary send safe commit in file-based recovery (elastic#28038) [Docs] Correct response json in rank-eval.asciidoc ...
As a replica always keeps a safe commit and starts peer-recovery with that commit; file-based recovery only happens if new operations are added to the primary and the required translog is not fully retained. In the test, we tried to produce this condition by flushing a new commit in order to trim all translog. However, if the new global checkpoint is not persisted yet, we will keep two commits and not trim translog. This commit tightens the file-based condition in the test by waiting for the global checkpoint persisted properly on the new primary before flushing. Close #28209 Relates #28181
As a replica always keeps a safe commit and starts peer-recovery with that commit; file-based recovery only happens if new operations are added to the primary and the required translog is not fully retained. In the test, we tried to produce this condition by flushing a new commit in order to trim all translog. However, if the new global checkpoint is not persisted yet, we will keep two commits and not trim translog. This commit tightens the file-based condition in the test by waiting for the global checkpoint persisted properly on the new primary before flushing. Close #28209 Relates #28181
* master: (74 commits) Update version of TaskInfo header serialization after backport TEST: Tightens file-based condition in peer-recovery Correct backport replica rollback to 6.2 (elastic#28181) Backport replica rollback to 6.2 (elastic#28181) Rename deleteLocalTranslog to createNewTranslog AwaitsFix #testRecoveryAfterPrimaryPromotion TEST: init unassigned gcp in testAcquireIndexCommit Replica start peer recovery with safe commit (elastic#28181) Truncate tlog cli should assign global checkpoint (elastic#28192) Fix lock accounting in releasable lock Add ability to associate an ID with tasks (elastic#27764) [DOCS] Removed differencies between text and code (elastic#27993) text fixes (elastic#28136) Update getting-started.asciidoc (elastic#28145) [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187) Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186) Do not keep 5.x commits once having 6.x commits (elastic#28188) Rename core module to server (elastic#28180) upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads ...
* master: (21 commits) [GEO] Add WKT Support to GeoBoundingBoxQueryBuilder Painless: Add whitelist extensions (elastic#28161) Fix daitch_mokotoff phonetic filter to use the dedicated Lucene filter (elastic#28225) Avoid doing redundant work when checking for self references. (elastic#26927) Fix casts in HotThreads. (elastic#27578) Ignore the `-snapshot` suffix when comparing the Lucene version in the build and the docs. (elastic#27927) Allow update of `eager_global_ordinals` on `_parent`. (elastic#28014) Fix NPE on composite aggregation with sub-aggregations that need scores (elastic#28129) `MockTcpTransport` to connect asynchronously (elastic#28203) Fix synonym phrase query expansion for cross_fields parsing (elastic#28045) Introduce elasticsearch-core jar (elastic#28191) elastic#28218: Update the Lucene version for 6.2.0 after backport upgrade to lucene 7.2.1 (elastic#28218) [Docs] Fix an error in painless-types.asciidoc (elastic#28221) Adds metadata to rewritten aggregations (elastic#28185) Update version of TaskInfo header serialization after backport TEST: Tightens file-based condition in peer-recovery Correct backport replica rollback to 6.2 (elastic#28181) Backport replica rollback to 6.2 (elastic#28181) Rename deleteLocalTranslog to createNewTranslog ...
Keeping unsafe commits when opening an engine can be problematic because these commits are not safe at the recovering time but they can suddenly become safe in the future. The following issues can happen if unsafe commits are kept oninit. 1. Replica can use unsafe commit in peer-recovery. This happens when a replica with a safe commit c1 (max_seqno=1) and an unsafe commit c2 (max_seqno=2) recovers from a primary with c1(max_seqno=1). If a new document (seqno=2) is added without flushing, the global checkpoint is advanced to 2; and the replica recovers again, it will use the unsafe commit c2 (max_seqno=2 <= gcp=2) as the starting commit for sequenced based recovery even the commit c2 contains a stale operation and the document (with seqno=2) will not be replicated to the replica. 2. Min translog gen for recovery can go backwards in peer-recovery. This happens when a replica with a safe commit c1 (local_checkpoint=1, recovery_translog_gen=1) and an unsafe commit c2 (local_checkpoint=2, recovery_translog_gen=2). The replica recovers from a primary, and keeps c2 as the last commit, then sets last_translog_gen to 2. Flushing a new commit on the replica will cause exception as the new last commit c3 will have recovery_translog_gen=1. The recovery translog generation of a commit is calculated based on the current local checkpoint. The local checkpoint of c3 is 1 while the local checkpoint of c2 is 2. 3. Commit without translog can be used for recovery. An old index, which was created before multiple-commits is introduced (v6.2), may not have a safe commit. If that index has a snapshotted commit without translog and an unsafe commit, the policy can consider the snapshotted commit as a safe commit for recovery even the commit does not have translog. These issues can be avoided if the combined deletion policy keeps only the starting commit onInit. Relates #27804 Relates #28181
Keeping unsafe commits when opening an engine can be problematic because these commits are not safe at the recovering time but they can suddenly become safe in the future. The following issues can happen if unsafe commits are kept oninit. 1. Replica can use unsafe commit in peer-recovery. This happens when a replica with a safe commit c1 (max_seqno=1) and an unsafe commit c2 (max_seqno=2) recovers from a primary with c1(max_seqno=1). If a new document (seqno=2) is added without flushing, the global checkpoint is advanced to 2; and the replica recovers again, it will use the unsafe commit c2 (max_seqno=2 <= gcp=2) as the starting commit for sequenced based recovery even the commit c2 contains a stale operation and the document (with seqno=2) will not be replicated to the replica. 2. Min translog gen for recovery can go backwards in peer-recovery. This happens when a replica with a safe commit c1 (local_checkpoint=1, recovery_translog_gen=1) and an unsafe commit c2 (local_checkpoint=2, recovery_translog_gen=2). The replica recovers from a primary, and keeps c2 as the last commit, then sets last_translog_gen to 2. Flushing a new commit on the replica will cause exception as the new last commit c3 will have recovery_translog_gen=1. The recovery translog generation of a commit is calculated based on the current local checkpoint. The local checkpoint of c3 is 1 while the local checkpoint of c2 is 2. 3. Commit without translog can be used for recovery. An old index, which was created before multiple-commits is introduced (v6.2), may not have a safe commit. If that index has a snapshotted commit without translog and an unsafe commit, the policy can consider the snapshotted commit as a safe commit for recovery even the commit does not have translog. These issues can be avoided if the combined deletion policy keeps only the starting commit onInit. Relates #27804 Relates #28181
* compile-with-jdk-9: (56 commits) TEST: init unassigned gcp in testAcquireIndexCommit Replica start peer recovery with safe commit (elastic#28181) Truncate tlog cli should assign global checkpoint (elastic#28192) Fix lock accounting in releasable lock Add ability to associate an ID with tasks (elastic#27764) [DOCS] Removed differencies between text and code (elastic#27993) text fixes (elastic#28136) Update getting-started.asciidoc (elastic#28145) [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187) Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186) Do not keep 5.x commits once having 6.x commits (elastic#28188) Rename core module to server (elastic#28180) upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads Primary send safe commit in file-based recovery (elastic#28038) [Docs] Correct response json in rank-eval.asciidoc Add scroll parameter to _reindex API (elastic#28041) Include all sentences smaller than fragment_size in the unified highlighter (elastic#28132) Modifies the JavaAPI docs related to AggregationBuilder [Docs] Improvements in script-fields.asciidoc (elastic#28174) ...
Today a replica starts a peer-recovery with the last commit. If the last
commit is not a safe commit, a replica will immediately fallback to the
file based sync which is more expensive than the sequence based
recovery. This commit modifies the peer-recovery in replica to start
with a safe commit. Moreover we can keep the existing translog on the
target if the recovery is sequence based recovery.
Relates #10708