Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for missing ShardReplicationTasks on new nodes #497

Merged
merged 1 commit into from
Aug 26, 2022
Merged

Fix for missing ShardReplicationTasks on new nodes #497

merged 1 commit into from
Aug 26, 2022

Conversation

ankitkala
Copy link
Member

@ankitkala ankitkala commented Aug 24, 2022

Signed-off-by: Ankit Kala ankikala@amazon.com

Description

In current implementation of IndexReplicationTask, if the task gets killed and spawned up on a new node, it tries to figure out if ShardReplicationTask for all the shards are running or not. If not, it tries to create those again. This logic is broken as of now as instead of checking for ShardReplicationTask for current index, it check for all the ShardReplicationTask on the cluster. This change fixes this by filtering the tasks specific to the current index.

Ideally we should be adding IT for this case but haven't as simulating the usecase would be very difficult.

Testing done

Simulating this code path would be tricky locally as we need to stop the ShardReplicationTasks and then immediately kill the IndexReplicationTask.
For local testing, I've verified that the filtering logic that is added here works as expected and only filters tasks related to the current index.

Issues Resolved

482

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov-commenter
Copy link

codecov-commenter commented Aug 24, 2022

Codecov Report

Merging #497 (587d9b3) into main (a859596) will decrease coverage by 0.61%.
The diff coverage is 69.44%.

❗ Current head 587d9b3 differs from pull request most recent head 7a8fd08. Consider uploading reports for the commit 7a8fd08 to get more accurate results

@@             Coverage Diff              @@
##               main     #497      +/-   ##
============================================
- Coverage     75.16%   74.55%   -0.62%     
  Complexity     1007     1007              
============================================
  Files           141      141              
  Lines          4579     4587       +8     
  Branches        506      506              
============================================
- Hits           3442     3420      -22     
- Misses          823      850      +27     
- Partials        314      317       +3     
Impacted Files Coverage Δ
...action/stop/TransportStopIndexReplicationAction.kt 60.75% <50.00%> (-13.91%) ⬇️
...rch/replication/task/index/IndexReplicationTask.kt 70.87% <75.00%> (-0.20%) ⬇️
...cation/action/changes/TransportGetChangesAction.kt 63.49% <0.00%> (-25.40%) ⬇️
...earch/replication/metadata/UpdateIndexBlockTask.kt 74.19% <0.00%> (-6.46%) ⬇️
...rch/replication/task/shard/ShardReplicationTask.kt 75.18% <0.00%> (-0.73%) ⬇️
...ication/metadata/store/ReplicationMetadataStore.kt 68.88% <0.00%> (+1.48%) ⬆️
...ation/task/shard/ShardReplicationChangesTracker.kt 63.63% <0.00%> (+3.03%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@ankitkala ankitkala marked this pull request as ready for review August 24, 2022 07:49
@ankitkala ankitkala requested a review from a team August 24, 2022 07:49
@gbbafna
Copy link
Collaborator

gbbafna commented Aug 24, 2022

One thing I am not clear is why the missing shard tasks where not spawned . We do keep calling startMissingShardTasks in the MONITORING state : https://github.com/opensearch-project/cross-cluster-replication/blob/main/src/main/kotlin/org/opensearch/replication/task/index/IndexReplicationTask.kt#L210

This fix is needed for sure, but it won't fix the current problem

@ankitkala
Copy link
Member Author

One thing I am not clear is why the missing shard tasks where not spawned . We do keep calling startMissingShardTasks in the MONITORING state : https://github.com/opensearch-project/cross-cluster-replication/blob/main/src/main/kotlin/org/opensearch/replication/task/index/IndexReplicationTask.kt#L210

This fix is needed for sure, but it won't fix the current problem

That is because these ShardReplicationTasks were not in failed state. cc: @soosinha

@saikaranam-amazon
Copy link
Member

This fix is needed for sure, but it won't fix the current problem

Yes, there are two places, where we need to make this fix.
For both startShardTaks and MissingShardTasks (and consolidate logic).

For startShardTasks, the issue is seen when index replication task get initialises on the new node and If the replication tasks are not executing, It won't start again.

.collect(Collectors.toList())

if (runningShardTasks.size == 0) {
if (runningShardTasksForIndex.size == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic will work when they are no shard tasks running for the index. This will not cover the case when there are some (not all) shard tasks running.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.
Based on the offline discussion, I've unified the logic to start new and missing ShardReplicationTask

@ankitkala
Copy link
Member Author

The changes to add IT for this case is not added in this PR. In the meanwhile, I'll explore on how we can add test to simulate this scenario.

Copy link
Collaborator

@gbbafna gbbafna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM . can we write UTs for same, since integ tests are not possible ?

@ankitkala ankitkala requested a review from krishna-ggk August 24, 2022 12:04
soosinha
soosinha previously approved these changes Aug 25, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>
@ankitkala
Copy link
Member Author

LGTM . can we write UTs for same, since integ tests are not possible ?

My bad. DIdn't realise that we have configured the unit test for IndexReplicationTask. Added the unit test to cover the new method.

Copy link
Collaborator

@gbbafna gbbafna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM .

@ankitkala ankitkala merged commit 805f686 into opensearch-project:main Aug 26, 2022
ankitkala added a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)
ankitkala added a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)

Co-authored-by: Ankit Kala <ankikala@amazon.com>
ankitkala added a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)

Co-authored-by: Ankit Kala <ankikala@amazon.com>
ankitkala added a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)

Co-authored-by: Ankit Kala <ankikala@amazon.com>
ankitkala added a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)

Co-authored-by: Ankit Kala <ankikala@amazon.com>
ankitkala added a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)
ankitkala added a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)
ankitkala added a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)

Co-authored-by: Ankit Kala <ankikala@amazon.com>
ankitkala added a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)

Co-authored-by: Ankit Kala <ankikala@amazon.com>
ankitkala added a commit that referenced this pull request Aug 26, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit 805f686)

Co-authored-by: Ankit Kala <ankikala@amazon.com>
ankitkala added a commit that referenced this pull request Aug 29, 2022
Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>

Signed-off-by: Ankit Kala <ankikala@amazon.com>
@ankitkala ankitkala deleted the missing_srt branch April 5, 2023 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants