-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequence number based replica allocation #46959
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this direction looks good. I have left a few initial comments inline, but my main concern is staleness:
I think that the info we have from primary about leases can be stale. As soon as a node with a replica dies, we will reach out to all nodes including primary and read the info. And then cache it until a shard with same shard-id is started. Given the index.recovery.file_based_threshold
, the staleness may become important in much less time than the default 12h lease expiration.
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff @dnhatn, thanks. I left some points for discussion.
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
Today, we don't clear the shard info of the primary shard when a new node joins; then we might risk of making replica allocation decisions based on the stale information of the primary. The serious problem is that we can cancel the current recovery which is more advanced than the copy on the new node due to the old info we have from the primary. With this change, we ensure the shard info from the primary is not older than any node when allocating replicas. Relates #46959 This work was done by Henning in #42518. Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
@henningandersen @DaveCTurner Thank you for reviewing. I have addressed your comments and suggestions in 17bfb34. Would you please take another look? |
Today, we don't clear the shard info of the primary shard when a new node joins; then we might risk of making replica allocation decisions based on the stale information of the primary. The serious problem is that we can cancel the current recovery which is more advanced than the copy on the new node due to the old info we have from the primary. With this change, we ensure the shard info from the primary is not older than any node when allocating replicas. Relates #46959 This work was done by Henning in #42518. Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a look to see how this works and left mainly smaller comments. Good stuff.
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/store/TransportNodesListShardStoreMetaData.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dnhatn, this is looking good. I left a number of smaller comments to address or comment on.
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorTests.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/store/StoreTests.java
Outdated
Show resolved
Hide resolved
@ywelsch @henningandersen Thank you for another helpful review. I have responded/addressed your comments. Would you please take another look? |
Although the failure is from a newly introduced test, I think this PR is still ready for another round. I am investigating the test failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nhat!
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks @dnhatn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dnhatn and apologies for the delayed review. I left a few questions, but no blockers.
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java
Show resolved
Hide resolved
@DaveCTurner Thanks for looking. I have addressed your comments. Would you please take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks @dnhatn
@henningandersen @DaveCTurner @ywelsch Thank you very much for your helpful reviews. |
With this change, shard allocation prefers allocating replicas on a node that already has a copy of the shard that is as close as possible to the primary, so that it is as cheap as possible to bring the new replica in sync with the primary. Furthermore, if we find a copy that is identical to the primary then we cancel an ongoing recovery because the new copy which is identical to the primary needs no work to recover as a replica. We no longer need to perform a synced flush before performing a rolling upgrade or full cluster start with this improvement. Closes #46318
With this change, shard allocation prefers allocating replicas on a node that already has a copy of the shard that is as close as possible to the primary, so that it is as cheap as possible to bring the new replica in sync with the primary. Furthermore, if we find a copy that is identical to the primary then we cancel an ongoing recovery because the new copy which is identical to the primary needs no work to recover as a replica. We no longer need to perform a synced flush before performing a rolling upgrade or full cluster start with this improvement. Closes elastic#46318
…50351) Today, the replica allocator uses peer recovery retention leases to select the best-matched copies when allocating replicas of indices with soft-deletes. We can employ this mechanism for indices without soft-deletes because the retaining sequence number of a PRRL is the persisted global checkpoint (plus one) of that copy. If the primary and replica have the same retaining sequence number, then we should be able to perform a noop recovery. The reason is that we must be retaining translog up to the local checkpoint of the safe commit, which is at most the global checkpoint of either copy). The only limitation is that we might not cancel ongoing file-based recoveries with PRRLs for noop recoveries. We can't make the translog retention policy comply with PRRLs. We also have this problem with soft-deletes if a PRRL is about to expire. Relates #45136 Relates #46959
…50351) Today, the replica allocator uses peer recovery retention leases to select the best-matched copies when allocating replicas of indices with soft-deletes. We can employ this mechanism for indices without soft-deletes because the retaining sequence number of a PRRL is the persisted global checkpoint (plus one) of that copy. If the primary and replica have the same retaining sequence number, then we should be able to perform a noop recovery. The reason is that we must be retaining translog up to the local checkpoint of the safe commit, which is at most the global checkpoint of either copy). The only limitation is that we might not cancel ongoing file-based recoveries with PRRLs for noop recoveries. We can't make the translog retention policy comply with PRRLs. We also have this problem with soft-deletes if a PRRL is about to expire. Relates #45136 Relates #46959
…lastic#50351) Today, the replica allocator uses peer recovery retention leases to select the best-matched copies when allocating replicas of indices with soft-deletes. We can employ this mechanism for indices without soft-deletes because the retaining sequence number of a PRRL is the persisted global checkpoint (plus one) of that copy. If the primary and replica have the same retaining sequence number, then we should be able to perform a noop recovery. The reason is that we must be retaining translog up to the local checkpoint of the safe commit, which is at most the global checkpoint of either copy). The only limitation is that we might not cancel ongoing file-based recoveries with PRRLs for noop recoveries. We can't make the translog retention policy comply with PRRLs. We also have this problem with soft-deletes if a PRRL is about to expire. Relates elastic#45136 Relates elastic#46959
This change prefers allocating replicas on nodes where it can perform an operation-based recovery or has sync_id match to reduce recovery time. We no longer need to perform a synced_flush in a rolling upgrade or full cluster start with this improvement.
I started with an implementation where I used the persisted global checkpoint from replicas and peer recovery retention leases from primaries to make decisions. However, I was not happy with the extension capturing the persisted global checkpoint (see https://github.com/elastic/elasticsearch/compare/master...dnhatn:replica-allocator-with-gcp?expand=1#diff-275151cc4a5cdf942f310a219e86a403R485).
We don't need the global checkpoint to make decisions for open indices. Having a peer recovery retention lease alone is enough to guarantee to have an operation-based recovery since we share the persisted global checkpoint between copies. I decided to implement this without the global checkpoint.
I prefer to support closed/frozen indices in a follow-up after we agree on the approach (i.e., using the global checkpoint or the last commit).
Closes #46318