-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for missing ShardReplicationTasks on new nodes #497
Conversation
Codecov Report
@@ Coverage Diff @@
## main #497 +/- ##
============================================
- Coverage 75.16% 74.55% -0.62%
Complexity 1007 1007
============================================
Files 141 141
Lines 4579 4587 +8
Branches 506 506
============================================
- Hits 3442 3420 -22
- Misses 823 850 +27
- Partials 314 317 +3
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
One thing I am not clear is why the missing shard tasks where not spawned . We do keep calling This fix is needed for sure, but it won't fix the current problem |
That is because these ShardReplicationTasks were not in failed state. cc: @soosinha |
Yes, there are two places, where we need to make this fix. For startShardTasks, the issue is seen when index replication task get initialises on the new node and If the replication tasks are not executing, It won't start again. |
.collect(Collectors.toList()) | ||
|
||
if (runningShardTasks.size == 0) { | ||
if (runningShardTasksForIndex.size == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic will work when they are no shard tasks running for the index. This will not cover the case when there are some (not all) shard tasks running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
Based on the offline discussion, I've unified the logic to start new and missing ShardReplicationTask
The changes to add IT for this case is not added in this PR. In the meanwhile, I'll explore on how we can add test to simulate this scenario. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM . can we write UTs for same, since integ tests are not possible ?
Signed-off-by: Ankit Kala <ankikala@amazon.com>
My bad. DIdn't realise that we have configured the unit test for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LTGM .
Signed-off-by: Ankit Kala <ankikala@amazon.com> Signed-off-by: Ankit Kala <ankikala@amazon.com>
Signed-off-by: Ankit Kala <ankikala@amazon.com> Signed-off-by: Ankit Kala <ankikala@amazon.com> (cherry picked from commit 805f686)
Signed-off-by: Ankit Kala <ankikala@amazon.com> Signed-off-by: Ankit Kala <ankikala@amazon.com> (cherry picked from commit 805f686)
Signed-off-by: Ankit Kala <ankikala@amazon.com> Signed-off-by: Ankit Kala <ankikala@amazon.com> (cherry picked from commit 805f686)
Signed-off-by: Ankit Kala <ankikala@amazon.com> Signed-off-by: Ankit Kala <ankikala@amazon.com> (cherry picked from commit 805f686)
Signed-off-by: Ankit Kala <ankikala@amazon.com> Signed-off-by: Ankit Kala <ankikala@amazon.com> (cherry picked from commit 805f686)
Signed-off-by: Ankit Kala <ankikala@amazon.com> Signed-off-by: Ankit Kala <ankikala@amazon.com> (cherry picked from commit 805f686)
Signed-off-by: Ankit Kala <ankikala@amazon.com> Signed-off-by: Ankit Kala <ankikala@amazon.com> (cherry picked from commit 805f686)
Signed-off-by: Ankit Kala <ankikala@amazon.com> Signed-off-by: Ankit Kala <ankikala@amazon.com> (cherry picked from commit 805f686)
Signed-off-by: Ankit Kala <ankikala@amazon.com> Signed-off-by: Ankit Kala <ankikala@amazon.com> (cherry picked from commit 805f686)
Signed-off-by: Ankit Kala ankikala@amazon.com
Description
In current implementation of IndexReplicationTask, if the task gets killed and spawned up on a new node, it tries to figure out if ShardReplicationTask for all the shards are running or not. If not, it tries to create those again. This logic is broken as of now as instead of checking for
ShardReplicationTask
for current index, it check for all theShardReplicationTask
on the cluster. This change fixes this by filtering the tasks specific to the current index.Ideally we should be adding IT for this case but haven't as simulating the usecase would be very difficult.
Testing done
Simulating this code path would be tricky locally as we need to stop the ShardReplicationTasks and then immediately kill the IndexReplicationTask.
For local testing, I've verified that the filtering logic that is added here works as expected and only filters tasks related to the current index.
Issues Resolved
482
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.