-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray Autoscaler] Ray autoscaler does not scale up effectively and fast #45373
Comments
There's a bunch of these we should look into it further... kicking this back to re-triage |
I also found a similar problem, and the speed of Ray Scheduler seemed to be affected. |
@Moonquakes Seems you have a different issue about scheduler not autoscaler. Do you mind creating a new GH ticket discussing your issue with a repro? |
I was able to reproduce this with the following setup: 60-CPU nodes 40 tasks should fit per node, so you should get 100 nodes scaled up immediately. Instead, upscaling happens slowly in chunks. The pending tasks do show up in |
One thing I find interesting is that, in the set-up I described in my last comment, I get an initial scale up of exactly 25 nodes each time, corresponding to exactly 1000 tasks being processed. |
Ok, I can definitely confirm this pattern: |
Ok, I think what I'm observing is AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE = 1000 But that limit has been there for many years, so it's weird if upgrading to a recent version would suddenly slow things down. It might be nice to make that limit configurable via env variable. |
#50176 provides a workaround |
…able (#50176) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> This change makes `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` configurable. Power users may wish to submit more than 1000 tasks at once and have the autoscaler respond by immediately scaling up the requisite number of nodes. To make this happen, `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` must be increased beyond the 1000 cap; otherwise, the demand from most tasks is ignored and upscaling is slow. ## Related issue number <!-- For example: "Closes #1234" --> Limited `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` causes the issue experienced in #45373. This PR provides a workaround. After merging this PR, if a user wants, say, 10k tasks to trigger quick upscaling, then the user can increase `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` past 10k. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( I tested it experimentally by increasing `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` to 100k and submitting 10k tasks; upscaling happened smoothly. --------- Signed-off-by: Dmitri Gekhtman <dmitri.gekhtman@getcruise.com> Co-authored-by: Dmitri Gekhtman <dmitri.gekhtman@getcruise.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
What happened + What you expected to happen
We run Ray jobs in production. Right after upgrading Ray version from 2.3.0 to 2.20.0, we saw a significant increase in job latency. Upon investigation, we found that autoscaler wasn't spinning up new nodes even when majority of the tasks were in queue waiting to be scheduled which resulted in latency increase overall. We only schedule by memory and these jobs weren't using full memory they requested for. However, we expect autoscaler to still spin up new nodes to serve the demand. This issues does not occur with SPREAD scheduling strategy however (not sure why!).
Versions / Dependencies
Ray: 2.20.0
OS: Ubuntu 20.04
Python: 3.10
Reproduction script
You might need to install deltacat by
pip3 install deltacat
. Each worker has 31 CPU, 220GB memory and 10000 max_tasks.Issue Severity
None
The text was updated successfully, but these errors were encountered: