Sequence number based replica allocation #46959

dnhatn · 2019-09-23T03:21:02Z

This change prefers allocating replicas on nodes where it can perform an operation-based recovery or has sync_id match to reduce recovery time. We no longer need to perform a synced_flush in a rolling upgrade or full cluster start with this improvement.

I started with an implementation where I used the persisted global checkpoint from replicas and peer recovery retention leases from primaries to make decisions. However, I was not happy with the extension capturing the persisted global checkpoint (see https://github.com/elastic/elasticsearch/compare/master...dnhatn:replica-allocator-with-gcp?expand=1#diff-275151cc4a5cdf942f310a219e86a403R485).

We don't need the global checkpoint to make decisions for open indices. Having a peer recovery retention lease alone is enough to guarantee to have an operation-based recovery since we share the persisted global checkpoint between copies. I decided to implement this without the global checkpoint.

I prefer to support closed/frozen indices in a follow-up after we agree on the approach (i.e., using the global checkpoint or the last commit).

Closes #46318

elasticmachine · 2019-09-23T03:21:04Z

Pinging @elastic/es-distributed

henningandersen

I think this direction looks good. I have left a few initial comments inline, but my main concern is staleness:

I think that the info we have from primary about leases can be stale. As soon as a node with a replica dies, we will reach out to all nodes including primary and read the info. And then cache it until a shard with same shard-id is started. Given the index.recovery.file_based_threshold, the staleness may become important in much less time than the default 12h lease expiration.

server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java

server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

DaveCTurner

Great stuff @dnhatn, thanks. I left some points for discussion.

server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java

server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java

server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

Today, we don't clear the shard info of the primary shard when a new node joins; then we might risk of making replica allocation decisions based on the stale information of the primary. The serious problem is that we can cancel the current recovery which is more advanced than the copy on the new node due to the old info we have from the primary. With this change, we ensure the shard info from the primary is not older than any node when allocating replicas. Relates #46959 This work was done by Henning in #42518. Co-authored-by: Henning Andersen <henning.andersen@elastic.co>

dnhatn · 2019-10-01T14:56:18Z

@henningandersen @DaveCTurner Thank you for reviewing. I have addressed your comments and suggestions in 17bfb34. Would you please take another look?

Today, we don't clear the shard info of the primary shard when a new node joins; then we might risk of making replica allocation decisions based on the stale information of the primary. The serious problem is that we can cancel the current recovery which is more advanced than the copy on the new node due to the old info we have from the primary. With this change, we ensure the shard info from the primary is not older than any node when allocating replicas. Relates #46959 This work was done by Henning in #42518. Co-authored-by: Henning Andersen <henning.andersen@elastic.co>

ywelsch

I've taken a look to see how this works and left mainly smaller comments. Good stuff.

server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

server/src/main/java/org/elasticsearch/indices/store/TransportNodesListShardStoreMetaData.java

server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java

henningandersen

Thanks @dnhatn, this is looking good. I left a number of smaller comments to address or comment on.

server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java

server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorTests.java

server/src/test/java/org/elasticsearch/index/store/StoreTests.java

dnhatn · 2019-10-09T21:08:12Z

@ywelsch @henningandersen Thank you for another helpful review. I have responded/addressed your comments. Would you please take another look?

dnhatn · 2019-10-10T03:18:20Z

Although the failure is from a newly introduced test, I think this PR is still ready for another round. I am investigating the test failure.

ywelsch

Thanks Nhat!

server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

henningandersen

LGTM.

Thanks @dnhatn

DaveCTurner

Thanks @dnhatn and apologies for the delayed review. I left a few questions, but no blockers.

server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java

dnhatn · 2019-10-13T02:26:08Z

@DaveCTurner Thanks for looking. I have addressed your comments. Would you please take another look?

DaveCTurner

LGTM thanks @dnhatn

dnhatn · 2019-10-13T16:56:56Z

@henningandersen @DaveCTurner @ywelsch Thank you very much for your helpful reviews.

With this change, shard allocation prefers allocating replicas on a node that already has a copy of the shard that is as close as possible to the primary, so that it is as cheap as possible to bring the new replica in sync with the primary. Furthermore, if we find a copy that is identical to the primary then we cancel an ongoing recovery because the new copy which is identical to the primary needs no work to recover as a replica. We no longer need to perform a synced flush before performing a rolling upgrade or full cluster start with this improvement. Closes #46318

Relates #46959

With this change, shard allocation prefers allocating replicas on a node that already has a copy of the shard that is as close as possible to the primary, so that it is as cheap as possible to bring the new replica in sync with the primary. Furthermore, if we find a copy that is identical to the primary then we cancel an ongoing recovery because the new copy which is identical to the primary needs no work to recover as a replica. We no longer need to perform a synced flush before performing a rolling upgrade or full cluster start with this improvement. Closes elastic#46318

Relates elastic#46959

…50351) Today, the replica allocator uses peer recovery retention leases to select the best-matched copies when allocating replicas of indices with soft-deletes. We can employ this mechanism for indices without soft-deletes because the retaining sequence number of a PRRL is the persisted global checkpoint (plus one) of that copy. If the primary and replica have the same retaining sequence number, then we should be able to perform a noop recovery. The reason is that we must be retaining translog up to the local checkpoint of the safe commit, which is at most the global checkpoint of either copy). The only limitation is that we might not cancel ongoing file-based recoveries with PRRLs for noop recoveries. We can't make the translog retention policy comply with PRRLs. We also have this problem with soft-deletes if a PRRL is about to expire. Relates #45136 Relates #46959

…lastic#50351) Today, the replica allocator uses peer recovery retention leases to select the best-matched copies when allocating replicas of indices with soft-deletes. We can employ this mechanism for indices without soft-deletes because the retaining sequence number of a PRRL is the persisted global checkpoint (plus one) of that copy. If the primary and replica have the same retaining sequence number, then we should be able to perform a noop recovery. The reason is that we must be retaining translog up to the local checkpoint of the safe commit, which is at most the global checkpoint of either copy). The only limitation is that we might not cancel ongoing file-based recoveries with PRRLs for noop recoveries. We can't make the translog retention policy comply with PRRLs. We also have this problem with soft-deletes if a PRRL is about to expire. Relates elastic#45136 Relates elastic#46959

Sequence number based replica allocation

01f39cf

dnhatn added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.5.0 labels Sep 23, 2019

dnhatn requested review from ywelsch, DaveCTurner and henningandersen September 23, 2019 03:21

henningandersen reviewed Sep 23, 2019

View reviewed changes

DaveCTurner reviewed Sep 23, 2019

View reviewed changes

dnhatn mentioned this pull request Sep 24, 2019

Re-fetch shard info of primary when new node joins #47035

Merged

dnhatn added 3 commits September 30, 2019 12:14

Merge branch 'master' into replica-allocator

13bb82b

feedback

17bfb34

test name

8c88b82

dnhatn requested review from DaveCTurner and henningandersen October 1, 2019 14:56

ywelsch reviewed Oct 7, 2019

View reviewed changes

henningandersen reviewed Oct 9, 2019

View reviewed changes

dnhatn added 3 commits October 9, 2019 10:58

Merge branch 'master' into replica-allocator

577ee74

yannick’s comments

a967456

henning’s comments

4562ab8

dnhatn requested review from ywelsch and henningandersen October 9, 2019 21:08

use persistent settings

2ff795f

ywelsch approved these changes Oct 10, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java Outdated Show resolved Hide resolved

grammar :)

2d78141

henningandersen approved these changes Oct 11, 2019

View reviewed changes

DaveCTurner reviewed Oct 11, 2019

View reviewed changes

dnhatn added 4 commits October 11, 2019 21:15

Merge branch 'master' into replica-allocator

d79e9c6

retainingSeqNoForPrimary >= 0

c744a76

add test that prefers copy with highest matching operations

a299c0f

do not check for empty store again

dc7fb14

dnhatn requested a review from DaveCTurner October 13, 2019 02:26

DaveCTurner approved these changes Oct 13, 2019

View reviewed changes

dnhatn merged commit e628f35 into elastic:master Oct 13, 2019

dnhatn deleted the replica-allocator branch October 13, 2019 16:58

dnhatn added the backport pending label Oct 13, 2019

dnhatn mentioned this pull request Oct 14, 2019

testDoNotInfinitelyWaitForMapping fails #47974

Closed

dnhatn added a commit that referenced this pull request Oct 14, 2019

Adjust bwc version for StoreFilesMetaData

7f83d06

Relates #46959

dnhatn removed the backport pending label Oct 14, 2019

howardhuanghua pushed a commit to TencentCloudES/elasticsearch that referenced this pull request Oct 14, 2019

Adjust bwc version for StoreFilesMetaData

414d33b

Relates elastic#46959

bleskes mentioned this pull request Nov 1, 2019

Add Sequence Numbers to write operations #10708

Closed

64 tasks

dnhatn mentioned this pull request Dec 19, 2019

Use peer recovery retention leases for indices without soft-deletes #50351

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence number based replica allocation #46959

Sequence number based replica allocation #46959

dnhatn commented Sep 23, 2019

elasticmachine commented Sep 23, 2019

henningandersen left a comment

DaveCTurner left a comment

dnhatn commented Oct 1, 2019

ywelsch left a comment

henningandersen left a comment

dnhatn commented Oct 9, 2019

dnhatn commented Oct 10, 2019

ywelsch left a comment

henningandersen left a comment

DaveCTurner left a comment

dnhatn commented Oct 13, 2019

DaveCTurner left a comment

dnhatn commented Oct 13, 2019

Sequence number based replica allocation #46959

Sequence number based replica allocation #46959

Conversation

dnhatn commented Sep 23, 2019

elasticmachine commented Sep 23, 2019

henningandersen left a comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

dnhatn commented Oct 1, 2019

ywelsch left a comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

dnhatn commented Oct 9, 2019

dnhatn commented Oct 10, 2019

ywelsch left a comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

dnhatn commented Oct 13, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

dnhatn commented Oct 13, 2019