Replica starts peer recovery with safe commit #28181

dnhatn · 2018-01-11T03:30:05Z

Today a replica starts a peer-recovery with the last commit. If the last
commit is not a safe commit, a replica will immediately fallback to the
file based sync which is more expensive than the sequence based
recovery. This commit modifies the peer-recovery in replica to start
with a safe commit. Moreover we can keep the existing translog on the
target if the recovery is sequence based recovery.

Relates #10708

Today a replica starts a peer-recovery with the last commit. If the last commit is not a safe commit, a replica will immediately fallback to the file based sync which is more expensive than the sequence based recovery. This commit modifies the peer-recovery in replica to start with a safe commit. Moreover, we can keep the existing translog on the target if the recovery is sequence based recovery.

bleskes

Thx Nhat. I left some initial feedback

bleskes · 2018-01-11T09:22:38Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoveryOpenSeqBasedEngineRequest.java

+
+import java.io.IOException;
+
+final class RecoveryOpenSeqBasedEngineRequest extends TransportRequest {


I've slept on this (as promised :)) and I prefer we go back to how you had it with a boolean in RecoveryPrepareForTranslogOperationsRequest. The reason is that I want to do some refactoring to simplify how the engine is created and I expect this to change making the boolean not needed and only use one message. I rather not have to deal with two messages and another layer of BWC. I think we should call the boolean "deleteLocalTranslog"

+1 on not having a separate message here.

bleskes · 2018-01-11T09:23:13Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

@@ -188,7 +188,7 @@ public RecoveryResponse recoverToTarget() throws IOException {
            runUnderPrimaryPermit(() -> shard.initiateTracking(request.targetAllocationId()));

            try {
-                prepareTargetForTranslog(translog.estimateTotalOperationsFromMinSeq(startingSeqNo));


we can roll back all these naming changes if we we keep the old message (and the boolean)

bleskes · 2018-01-11T09:31:23Z

core/src/test/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetServiceTests.java

-
-            translogLocation.set(writeTranslog(replica.shardId(), translogUUID, translog.currentFileGeneration(), maxSeqNo));
-
-            // commit is good, global checkpoint is above max


we lost the extra check that advancing the global checking advanceses the commit usage.

bleskes · 2018-01-11T09:32:57Z

core/src/test/java/org/elasticsearch/indices/recovery/RecoveryTests.java

@@ -241,4 +246,38 @@ public void testPeerRecoveryPersistGlobalCheckpoint() throws Exception {
            assertThat(replica.getTranslog().getLastSyncedGlobalCheckpoint(), equalTo(numDocs - 1));
        }
    }
+
+    public void testSequenceBasedRecoveryKeepsTranslog() throws Exception {


can you double check we have a test that make sure that all forms of recovery removes unneeded ops above the global checkpoint? if we don't think we can now add it?

Yes. We have RecoveryDuringReplicationTests#testRecoveryAfterPrimaryPromotion. Previously this test was expected to execute a file-based sync, but now it will execute seq-based recovery.

ywelsch

We only do sequence number based recovery in case where a safe commit exists (whose definition is based on the local knowledge of the global checkpoint). How about locally replaying the translog up to the (local knowledge of the) globalcheckpoint and then requesting seq-num recovery only from the (local knowledge of the) globalcheckpoint onwards. This might allow sequence-number based recovery in more cases and would also require less translog operations to be shipped over the wire.

bleskes · 2018-01-11T11:45:10Z

How about locally replaying the translog up to the (local knowledge of the) globalcheckpoint and then requesting seq-num recovery only from the (local knowledge of the) globalcheckpoint onwards. This might allow sequence-number based recovery in more cases and would also require less translog operations to be shipped over the wire.

That's where we're heading indeed but will require more follow ups.

dnhatn · 2018-01-11T14:21:02Z

@bleskes and @ywelsch, I've replaced the extra message by an additional parameter. Could you please have another look? Thank you.

# Conflicts: # core/src/test/java/org/elasticsearch/indices/recovery/RecoveryTests.java

bleskes

I left some very minor comments. It looks good. I think we can now also add assertions in IndexShard post recovery that we have a "safe" commit if the index is new enough and that we the global checkpoint is <= the local checkpoint etc.?

bleskes · 2018-01-11T15:52:49Z

...ain/java/org/elasticsearch/indices/recovery/RecoveryPrepareForTranslogOperationsRequest.java

+    private final long recoveryId;
+    private final ShardId shardId;
+    private final int totalTranslogOps;
+    private final boolean deleteLocalTranslog;


I know I came up with the name and I'm sorry for changing my mind. I would propose createNewTranslog. better no?

bleskes · 2018-01-11T15:57:30Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

@@ -188,7 +188,9 @@ public RecoveryResponse recoverToTarget() throws IOException {
            runUnderPrimaryPermit(() -> shard.initiateTracking(request.targetAllocationId()));

            try {
-                prepareTargetForTranslog(translog.estimateTotalOperationsFromMinSeq(startingSeqNo));
+                // For a sequence based recovery, the target can keep its local translog
+                prepareTargetForTranslog(isSequenceNumberBasedRecoveryPossible == false,


nit: shall we rename isSequenceNumberBasedRecoveryPossible to isSequenceNumberBasedRecovery?

bleskes · 2018-01-11T16:15:03Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

        state().getTranslog().totalOperations(totalTranslogOps);
-        // TODO: take the local checkpoint from store as global checkpoint, once we know it's safe


the todo is still relevant no?

I pushed it back as a note as it's a valid TODO.

bleskes · 2018-01-11T17:46:53Z

core/src/test/java/org/elasticsearch/index/replication/RecoveryDuringReplicationTests.java

+
+            // As a replica keeps a safe commit, the file-based recovery only happens if the required translog
+            // for the sequence based recovery are not fully retained and extra documents were added to the primary.
+            boolean expectSeqNoRecovery = (moreDocs == 0 || frequently());


why not have a random boolean here instead of frequently?

We are targeting to always have a safe index once the recovery is done. This invariant does not hold if the translog is manually truncated by users because the truncate translog cli resets the global checkpoint to unassigned. This commit assigns the max_seqno of the last commit to the global checkpoint when truncating translog. Relates elastic#28181

bleskes

Thx Nhat.

bleskes · 2018-01-12T21:14:31Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

@@ -365,6 +365,7 @@ private void ensureRefCount() {
    public void prepareForTranslogOperations(boolean createNewTranslog, int totalTranslogOps) throws IOException {
        state().getTranslog().totalOperations(totalTranslogOps);
        if (createNewTranslog) {
+            // TODO: take the local checkpoint from store as global checkpoint, once we know it's safe


please add when you expect it to be safe (what version)

We are targeting to always have a safe index once the recovery is done. This invariant does not hold if the translog is manually truncated by users because the truncate translog cli resets the global checkpoint to unassigned. This commit assigns the global checkpoint to the max_seqno of the last commit when truncating translog. We can only safely do it because the truncate translog command will generate a new history uuid for that shard. With a new history UUID, sequence-based recovery between that shard and other old shards will be disabled. Relates #28181

dnhatn · 2018-01-13T00:09:02Z

Thanks @bleskes and @ywelsch for reviewing.

We are targeting to always have a safe index once the recovery is done. This invariant does not hold if the translog is manually truncated by users because the truncate translog cli resets the global checkpoint to unassigned. This commit assigns the global checkpoint to the max_seqno of the last commit when truncating translog. We can only safely do it because the truncate translog command will generate a new history uuid for that shard. With a new history UUID, sequence-based recovery between that shard and other old shards will be disabled. Relates #28181

We introduced a new option `createNewTranslog` in elastic#28181. However, we named that parameter as deleteLocalTranslog in other places. This commit makes sure to have a consistent naming in these places. Relates elastic#28181

Relates elastic#28181

* master: TEST: init unassigned gcp in testAcquireIndexCommit Replica start peer recovery with safe commit (elastic#28181) Truncate tlog cli should assign global checkpoint (elastic#28192)

Today a replica starts a peer-recovery with the last commit. If the last commit is not a safe commit, a replica will immediately fallback to the file based sync which is more expensive than the sequence based recovery. This commit modifies the peer-recovery in replica to start with a safe commit. Moreover we can keep the existing translog on the target if the recovery is sequence based recovery. Relates #10708

The previous backport was not corect. Relates #28181

* master: (59 commits) Correct backport replica rollback to 6.2 (elastic#28181) Backport replica rollback to 6.2 (elastic#28181) Rename deleteLocalTranslog to createNewTranslog AwaitsFix #testRecoveryAfterPrimaryPromotion TEST: init unassigned gcp in testAcquireIndexCommit Replica start peer recovery with safe commit (elastic#28181) Truncate tlog cli should assign global checkpoint (elastic#28192) Fix lock accounting in releasable lock Add ability to associate an ID with tasks (elastic#27764) [DOCS] Removed differencies between text and code (elastic#27993) text fixes (elastic#28136) Update getting-started.asciidoc (elastic#28145) [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187) Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186) Do not keep 5.x commits once having 6.x commits (elastic#28188) Rename core module to server (elastic#28180) upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads Primary send safe commit in file-based recovery (elastic#28038) [Docs] Correct response json in rank-eval.asciidoc ...

As a replica always keeps a safe commit and starts peer-recovery with that commit; file-based recovery only happens if new operations are added to the primary and the required translog is not fully retained. In the test, we tried to produce this condition by flushing a new commit in order to trim all translog. However, if the new global checkpoint is not persisted yet, we will keep two commits and not trim translog. This commit tightens the file-based condition in the test by waiting for the global checkpoint persisted properly on the new primary before flushing. Close #28209 Relates #28181

* master: (74 commits) Update version of TaskInfo header serialization after backport TEST: Tightens file-based condition in peer-recovery Correct backport replica rollback to 6.2 (elastic#28181) Backport replica rollback to 6.2 (elastic#28181) Rename deleteLocalTranslog to createNewTranslog AwaitsFix #testRecoveryAfterPrimaryPromotion TEST: init unassigned gcp in testAcquireIndexCommit Replica start peer recovery with safe commit (elastic#28181) Truncate tlog cli should assign global checkpoint (elastic#28192) Fix lock accounting in releasable lock Add ability to associate an ID with tasks (elastic#27764) [DOCS] Removed differencies between text and code (elastic#27993) text fixes (elastic#28136) Update getting-started.asciidoc (elastic#28145) [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187) Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186) Do not keep 5.x commits once having 6.x commits (elastic#28188) Rename core module to server (elastic#28180) upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads ...

* master: (21 commits) [GEO] Add WKT Support to GeoBoundingBoxQueryBuilder Painless: Add whitelist extensions (elastic#28161) Fix daitch_mokotoff phonetic filter to use the dedicated Lucene filter (elastic#28225) Avoid doing redundant work when checking for self references. (elastic#26927) Fix casts in HotThreads. (elastic#27578) Ignore the `-snapshot` suffix when comparing the Lucene version in the build and the docs. (elastic#27927) Allow update of `eager_global_ordinals` on `_parent`. (elastic#28014) Fix NPE on composite aggregation with sub-aggregations that need scores (elastic#28129) `MockTcpTransport` to connect asynchronously (elastic#28203) Fix synonym phrase query expansion for cross_fields parsing (elastic#28045) Introduce elasticsearch-core jar (elastic#28191) elastic#28218: Update the Lucene version for 6.2.0 after backport upgrade to lucene 7.2.1 (elastic#28218) [Docs] Fix an error in painless-types.asciidoc (elastic#28221) Adds metadata to rewritten aggregations (elastic#28185) Update version of TaskInfo header serialization after backport TEST: Tightens file-based condition in peer-recovery Correct backport replica rollback to 6.2 (elastic#28181) Backport replica rollback to 6.2 (elastic#28181) Rename deleteLocalTranslog to createNewTranslog ...

Keeping unsafe commits when opening an engine can be problematic because these commits are not safe at the recovering time but they can suddenly become safe in the future. The following issues can happen if unsafe commits are kept oninit. 1. Replica can use unsafe commit in peer-recovery. This happens when a replica with a safe commit c1 (max_seqno=1) and an unsafe commit c2 (max_seqno=2) recovers from a primary with c1(max_seqno=1). If a new document (seqno=2) is added without flushing, the global checkpoint is advanced to 2; and the replica recovers again, it will use the unsafe commit c2 (max_seqno=2 <= gcp=2) as the starting commit for sequenced based recovery even the commit c2 contains a stale operation and the document (with seqno=2) will not be replicated to the replica. 2. Min translog gen for recovery can go backwards in peer-recovery. This happens when a replica with a safe commit c1 (local_checkpoint=1, recovery_translog_gen=1) and an unsafe commit c2 (local_checkpoint=2, recovery_translog_gen=2). The replica recovers from a primary, and keeps c2 as the last commit, then sets last_translog_gen to 2. Flushing a new commit on the replica will cause exception as the new last commit c3 will have recovery_translog_gen=1. The recovery translog generation of a commit is calculated based on the current local checkpoint. The local checkpoint of c3 is 1 while the local checkpoint of c2 is 2. 3. Commit without translog can be used for recovery. An old index, which was created before multiple-commits is introduced (v6.2), may not have a safe commit. If that index has a snapshotted commit without translog and an unsafe commit, the policy can consider the snapshotted commit as a safe commit for recovery even the commit does not have translog. These issues can be avoided if the combined deletion policy keeps only the starting commit onInit. Relates #27804 Relates #28181

* compile-with-jdk-9: (56 commits) TEST: init unassigned gcp in testAcquireIndexCommit Replica start peer recovery with safe commit (elastic#28181) Truncate tlog cli should assign global checkpoint (elastic#28192) Fix lock accounting in releasable lock Add ability to associate an ID with tasks (elastic#27764) [DOCS] Removed differencies between text and code (elastic#27993) text fixes (elastic#28136) Update getting-started.asciidoc (elastic#28145) [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187) Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186) Do not keep 5.x commits once having 6.x commits (elastic#28188) Rename core module to server (elastic#28180) upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads Primary send safe commit in file-based recovery (elastic#28038) [Docs] Correct response json in rank-eval.asciidoc Add scroll parameter to _reindex API (elastic#28041) Include all sentences smaller than fragment_size in the unified highlighter (elastic#28132) Modifies the JavaAPI docs related to AggregationBuilder [Docs] Improvements in script-fields.asciidoc (elastic#28174) ...

dnhatn added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.2.0 labels Jan 11, 2018

dnhatn requested review from bleskes and ywelsch January 11, 2018 03:30

dnhatn changed the title ~~Replica starts a peer recovery with a safe commit~~ Replica starts peer recovery with safe commit Jan 11, 2018

dnhatn added the review label Jan 11, 2018

bleskes suggested changes Jan 11, 2018

View reviewed changes

ywelsch reviewed Jan 11, 2018

View reviewed changes

dnhatn added 2 commits January 11, 2018 08:32

Use additional bool for sequence based recovery

9d7db40

test: restore extra check

c9152ea

dnhatn added 2 commits January 11, 2018 09:53

goodDocs -> initDocs

2ff5354

Merge branch 'master' into rollback-replica

d66ba4b

# Conflicts: # core/src/test/java/org/elasticsearch/indices/recovery/RecoveryTests.java

bleskes suggested changes Jan 11, 2018

View reviewed changes

dnhatn added 3 commits January 11, 2018 13:26

Rename createNewTranslog

1eea7b1

feedbacks

9b25413

Merge branch 'master' into rollback-replica

efb5149

dnhatn mentioned this pull request Jan 12, 2018

Truncate tlog cli should assign global checkpoint #28192

Merged

Merge branch 'master' into rollback-replica

2642d7e

bleskes approved these changes Jan 12, 2018

View reviewed changes

update the todo note

3ef8bf5

dnhatn merged commit 095f31b into elastic:master Jan 13, 2018

dnhatn added the backport pending label Jan 13, 2018

dnhatn deleted the rollback-replica branch January 13, 2018 00:09

dnhatn mentioned this pull request Jan 13, 2018

[CI] testRecoveryAfterPrimaryPromotion failed #28209

Closed

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 13, 2018

Backport replica rollback to 6.2 (elastic#28181)

0151c15

Relates elastic#28181

dnhatn added a commit that referenced this pull request Jan 13, 2018

Correct backport replica rollback to 6.2 (#28181)

9774ba3

The previous backport was not corect. Relates #28181

dnhatn removed the backport pending label Jan 13, 2018

dnhatn mentioned this pull request Jan 15, 2018

Open engine should keep only starting commit #28228

Merged

bleskes mentioned this pull request Jan 18, 2018

Add Sequence Numbers to write operations #10708

Closed

64 tasks

ywelsch mentioned this pull request Jan 27, 2018

Safe commit is not safe #28405

Closed

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replica starts peer recovery with safe commit #28181

Replica starts peer recovery with safe commit #28181

dnhatn commented Jan 11, 2018

bleskes left a comment

bleskes Jan 11, 2018

ywelsch Jan 11, 2018

bleskes Jan 11, 2018

bleskes Jan 11, 2018

bleskes Jan 11, 2018

dnhatn Jan 11, 2018

ywelsch left a comment

bleskes commented Jan 11, 2018

dnhatn commented Jan 11, 2018

bleskes left a comment

bleskes Jan 11, 2018

bleskes Jan 11, 2018

bleskes Jan 11, 2018

dnhatn Jan 11, 2018

bleskes Jan 11, 2018

bleskes left a comment

bleskes Jan 12, 2018

dnhatn commented Jan 13, 2018


		import java.io.IOException;

		final class RecoveryOpenSeqBasedEngineRequest extends TransportRequest {


		translogLocation.set(writeTranslog(replica.shardId(), translogUUID, translog.currentFileGeneration(), maxSeqNo));

		// commit is good, global checkpoint is above max

		state().getTranslog().totalOperations(totalTranslogOps);
		// TODO: take the local checkpoint from store as global checkpoint, once we know it's safe

Replica starts peer recovery with safe commit #28181

Replica starts peer recovery with safe commit #28181

Conversation

dnhatn commented Jan 11, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

bleskes commented Jan 11, 2018

dnhatn commented Jan 11, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Jan 13, 2018