-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce startup-silent-period mechanism to avoid partial assignments #247
Conversation
PTAL @jerqi |
Codecov Report
@@ Coverage Diff @@
## master #247 +/- ##
============================================
+ Coverage 59.16% 59.18% +0.02%
- Complexity 1340 1343 +3
============================================
Files 163 163
Lines 8810 8837 +27
Branches 833 835 +2
============================================
+ Hits 5212 5230 +18
- Misses 3332 3339 +7
- Partials 266 268 +2
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Is |
No such config in Yarn. Do u have any ideas? |
It's ok for Uniffle. I don't have another good ideas. |
Got it. I will rename to safemode. |
I means that startup-silent-period is ok. |
OK. Do u have any other advice? @jerqi |
.key("rss.coordinator.startup-silent-period.enabled") | ||
.booleanType() | ||
.defaultValue(false) | ||
.withDescription("Enable the startup-silent-period to reject the assignment requests " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add more description to explain why we shouldn't use true
as default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Updated.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @zuston , wait for CI
CI failed. https://github.com/apache/incubator-uniffle/actions/runs/3216468616/jobs/5258387024 #244 Rerun it. |
Unrelated flaky test failed, I will rerun the flaky test |
What changes were proposed in this pull request?
Introduce startup-silent-period mechanism to avoid partial assignments
Why are the changes needed?
When changing some coordinator's conf and then restart, coordinator will accept client getAssignment request immediately, but it will serve for jobs request based on the partial registered shuffle-servers, which will make some jobs gotten not enough required shuffle-servers and then slow the running speed.
I think we should make coordinator wait for more than one shuffle-server heartbeat interval before serving for client. During out-of-service, requests from client will fallback to slave coordinator.
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
UTs