Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove compounding retries within PrimaryShardReplicationSource #12043

Merged
merged 1 commit into from
Jan 30, 2024

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Jan 26, 2024

Description

This change fixes and unmutes segrep bwc test testIndexingWithSegRep.

This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication. This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs. The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur on node disconnect. The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout.

This change also includes simplifications to the cancellation flow on the target service before the shard is closed. Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown. The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw. Instead we now immediately remove the target and decref/close it.

Related Issues

Resolves #7679

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Jan 26, 2024

Compatibility status:

Checks if related components are compatible with change 44e03b5

Incompatible components

Incompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/alerting.git]

This comment was marked as outdated.

Copy link
Contributor

✅ Gradle check result for c1108e3: SUCCESS

Copy link

codecov bot commented Jan 26, 2024

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (6012504) 71.28% compared to head (44e03b5) 71.47%.
Report is 2 commits behind head on main.

Files Patch % Lines
...s/replication/SegmentReplicationTargetService.java 66.66% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #12043      +/-   ##
============================================
+ Coverage     71.28%   71.47%   +0.19%     
- Complexity    59414    59542     +128     
============================================
  Files          4925     4925              
  Lines        279479   279472       -7     
  Branches      40635    40636       +1     
============================================
+ Hits         199226   199759     +533     
+ Misses        63731    63119     -612     
- Partials      16522    16594      +72     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added bug Something isn't working distributed framework flaky-test Random test failure that succeeds on second run labels Jan 29, 2024
@mch2 mch2 marked this pull request as ready for review January 29, 2024 18:31
Copy link
Contributor

❌ Gradle check result for 085d9c6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@mch2
Copy link
Member Author

mch2 commented Jan 29, 2024

❌ Gradle check result for 085d9c6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

#11974

This comment was marked as outdated.

Copy link
Contributor

❌ Gradle check result for f64e85f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@dreamer-89
Copy link
Member

dreamer-89 commented Jan 29, 2024

❌ Gradle check result for f64e85f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Gradle build failure due to single test. May be missing rebase against main ?
https://build.ci.opensearch.org/job/gradle-check/32810/testReport/org.opensearch.qa.verify_version_constants/VerifyVersionConstantsIT/testLuceneVersionConstant/

java.lang.AssertionError: 
Expected: <9.9.2>
     but: was <9.9.1>

Copy link
Contributor

❕ Gradle check result for 094139d: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@mch2
Copy link
Member Author

mch2 commented Jan 29, 2024

testPrimaryRelocationWhileIndexing

#9191

This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication.
This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs.
The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur
on node disconnect.  The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout.

This change also includes simplifications to the cancellation flow on the target service before the shard is closed.
Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown.
The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw.
Instead we now immediately remove the target and decref/close it.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
@mch2
Copy link
Member Author

mch2 commented Jan 29, 2024

Apologies for losing the history here, I force pushed to rebase & fix DCO

Copy link
Contributor

✅ Gradle check result for 44e03b5: SUCCESS

@mch2 mch2 merged commit 11644d5 into opensearch-project:main Jan 30, 2024
30 checks passed
peteralfonsi pushed a commit to peteralfonsi/OpenSearch that referenced this pull request Mar 1, 2024
…search-project#12043)

This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication.
This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs.
The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur
on node disconnect.  The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout.

This change also includes simplifications to the cancellation flow on the target service before the shard is closed.
Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown.
The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw.
Instead we now immediately remove the target and decref/close it.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this pull request Mar 18, 2024
…search-project#12043)

This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication.
This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs.
The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur
on node disconnect.  The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout.

This change also includes simplifications to the cancellation flow on the target service before the shard is closed.
Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown.
The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw.
Instead we now immediately remove the target and decref/close it.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
@mch2 mch2 added the backport 2.x Backport to 2.x branch label Mar 20, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 20, 2024
This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication.
This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs.
The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur
on node disconnect.  The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout.

This change also includes simplifications to the cancellation flow on the target service before the shard is closed.
Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown.
The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw.
Instead we now immediately remove the target and decref/close it.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
(cherry picked from commit 11644d5)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
dblock pushed a commit that referenced this pull request Mar 20, 2024
…) (#12800)

This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication.
This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs.
The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur
on node disconnect.  The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout.

This change also includes simplifications to the cancellation flow on the target service before the shard is closed.
Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown.
The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw.
Instead we now immediately remove the target and decref/close it.


(cherry picked from commit 11644d5)

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…search-project#12043)

This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication.
This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs.
The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur
on node disconnect.  The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout.

This change also includes simplifications to the cancellation flow on the target service before the shard is closed.
Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown.
The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw.
Instead we now immediately remove the target and decref/close it.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working distributed framework flaky-test Random test failure that succeeds on second run skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] org.opensearch.upgrades.IndexingIT.testIndexingWithSegRep test failure
3 participants