Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 2.x] Limit RW separation to remote store enabled clusters and update recovery flow #16999

Merged
merged 1 commit into from
Jan 10, 2025

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Jan 10, 2025

Description

Manual backport of #16760 - Only conflict was in IndexShard.java because TranslogManager does not exist on 2.x.

Related Issues

Resolves #15952

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ery flow (opensearch-project#16760)

* Update search only replica recovery flow

This PR includes multiple changes to search replica recovery.
1. Change search only replica copies to recover as empty store instead of PEER. This will run a store recovery that syncs segments from remote store directly and eliminate any primary communication.
2. Remove search replicas from the in-sync allocation ID set and update routing table to exclude them from allAllocationIds.  This ensures primaries aren't tracking or validating the routing table for any search replica's presence.
3. Change search replica validation to require remote store.  There are versions of the above changes that are still possible with primary based node-node replication, but I don't think they are worth making  at this time.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* more coverage

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* add changelog entry

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* add assertions that Search Replicas are not in the in-sync id set nor the AllAllocationIds set in the routing table

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* update async task to only run if the FF is enabled and we are a remote store cluster.

This check had previously only checked for segrep

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* clean up max shards logic

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* remove search replicas from check during renewPeerRecoveryRetentionLeases

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Revert "update async task to only run if the FF is enabled and we are a remote store cluster."

reverting this, we already check for remote store earlier.

This reverts commit 48ca1a3.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Add more tests for failover case

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Update remotestore restore logic and add test ensuring we can restore only writers when red

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Fix Search replicas to honor node level recovery limits

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Fix translog UUID mismatch on existing store recovery.

This commit adds PR feedback and recovery tests post node restart.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Fix spotless

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Fix bug with remote restore and add more tests

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

---------

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
(cherry picked from commit 8191de8)
Copy link
Contributor

❕ Gradle check result for 3b04475: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 74.19355% with 16 lines in your changes missing coverage. Please review.

Project coverage is 71.92%. Comparing base (afb2f94) to head (3b04475).
Report is 6 commits behind head on 2.x.

Files with missing lines Patch % Lines
...in/java/org/opensearch/index/shard/IndexShard.java 44.44% 2 Missing and 3 partials ⚠️
...search/cluster/routing/IndexShardRoutingTable.java 42.85% 2 Missing and 2 partials ⚠️
...llocation/decider/ThrottlingAllocationDecider.java 84.61% 0 Missing and 2 partials ⚠️
...a/org/opensearch/cluster/routing/ShardRouting.java 66.66% 0 Missing and 1 partial ⚠️
...uster/routing/allocation/IndexMetadataUpdater.java 75.00% 0 Missing and 1 partial ⚠️
...org/opensearch/index/seqno/ReplicationTracker.java 0.00% 0 Missing and 1 partial ⚠️
...a/org/opensearch/index/shard/ReplicationGroup.java 50.00% 0 Missing and 1 partial ⚠️
...java/org/opensearch/index/shard/StoreRecovery.java 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##                2.x   #16999      +/-   ##
============================================
+ Coverage     71.79%   71.92%   +0.12%     
- Complexity    65454    65599     +145     
============================================
  Files          5316     5318       +2     
  Lines        305529   305739     +210     
  Branches      44509    44580      +71     
============================================
+ Hits         219359   219902     +543     
+ Misses        67902    67504     -398     
- Partials      18268    18333      +65     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@msfroh msfroh merged commit 6fddac6 into opensearch-project:2.x Jan 10, 2025
62 of 63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants