[Ray Autoscaler] Ray autoscaler does not scale up effectively and fast #45373

raghumdani · 2024-05-16T01:56:47Z

What happened + What you expected to happen

We run Ray jobs in production. Right after upgrading Ray version from 2.3.0 to 2.20.0, we saw a significant increase in job latency. Upon investigation, we found that autoscaler wasn't spinning up new nodes even when majority of the tasks were in queue waiting to be scheduled which resulted in latency increase overall. We only schedule by memory and these jobs weren't using full memory they requested for. However, we expect autoscaler to still spin up new nodes to serve the demand. This issues does not occur with SPREAD scheduling strategy however (not sure why!).

Versions / Dependencies

Ray: 2.20.0
OS: Ubuntu 20.04
Python: 3.10

Reproduction script

You might need to install deltacat by pip3 install deltacat. Each worker has 31 CPU, 220GB memory and 10000 max_tasks.

from deltacat.utils.ray_utils.concurrency import (
    invoke_parallel,
)
import ray
import time

ray.init(address='auto')

@ray.remote
def i_will_return_after_t_secs(t):
    import numpy as np

    all_arrs = []

    for _ in range(10):
        all_arrs.append(np.random.rand(100000000)) # 0.8GB

    # 8 GB overall

    import time
    time.sleep(t)

def options_provider(*args, **kwargs):
    return {"num_cpus": 0.01, "memory": 23387571131, "resources": {'max_tasks': 0.001} }

items_to_run = [120 for _ in range(14000)]

start = time.monotonic()
print(f"Now starting: {start}")

pending = invoke_parallel(items_to_run, i_will_return_after_t_secs, max_parallelism=4096, options_provider=options_provider)

end_invoke = time.monotonic()

print(f"Now ended invoke: {end_invoke - start}")

ray.get(pending)

end = time.monotonic()
print(f"Complete: {end - start}")

# 2.3 used 460 nodes while 2.20 used 364 nodes at max.

Issue Severity

None

The text was updated successfully, but these errors were encountered:

anyscalesam · 2024-10-12T01:14:47Z

There's a bunch of these we should look into it further... kicking this back to re-triage

Moonquakes · 2024-10-21T01:41:05Z

I also found a similar problem, and the speed of Ray Scheduler seemed to be affected.
In my scenario, 1000 nodes were used, each with two CPUs. There were a total of 10,000 tasks, each of which required two CPUs. I found that by default, sometimes the resources of these 1000 nodes were not fully utilized, and only about 800 nodes were used. When I turned on SPREAD scheduling, there was no such problem. I would like to ask if there is any difference in the underlying implementation of these two scheduling methods? cc @anyscalesam @raghumdani

jjyao · 2024-10-28T21:46:13Z

@Moonquakes Seems you have a different issue about scheduler not autoscaler. Do you mind creating a new GH ticket discussing your issue with a repro?

DmitriGekhtman · 2024-12-08T09:05:48Z

I was able to reproduce this with the following setup:

60-CPU nodes
min_workers=0
Large upscaling speed
4000 1.5-CPU tasks, each sleeps for 10 minutes

40 tasks should fit per node, so you should get 100 nodes scaled up immediately.

Instead, upscaling happens slowly in chunks.

The pending tasks do show up in ray status, so it is surprising that the resource demand scheduler does not translate them into the appropriate node shapes.
This seems like an issue in the resource demand scheduler Python code; I can probably looking into solving this one
(, as opposed to #36926, which is a C++ Ray core problem, for which I have no idea)

DmitriGekhtman · 2024-12-08T09:29:51Z

One thing I find interesting is that, in the set-up I described in my last comment, I get an initial scale up of exactly 25 nodes each time, corresponding to exactly 1000 tasks being processed.
I wonder if there's some internal cap at play here.

DmitriGekhtman · 2024-12-08T09:42:37Z

Ok, I can definitely confirm this pattern:
4000 1.5-CPU tasks submitted -> Initial scale-up to 25 60-CPU nodes, corresponding to 1000 tasks.
2000 3.0-CPU tasks submitted -> Initial scale-up to 50 60-CPU nodes, corresponding to 1000 tasks
1000 6.0-CPU tasks submitted -> Initial scale-up to 100 60-CPU nodes, corresponding to 1000 tasks.

DmitriGekhtman · 2024-12-08T09:52:29Z

Ok, I think what I'm observing is AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE = 1000

But that limit has been there for many years, so it's weird if upgrading to a recent version would suddenly slow things down.

It might be nice to make that limit configurable via env variable.

DmitriGekhtman · 2025-02-01T23:11:26Z

#50176 provides a workaround

…able (#50176)   ## Why are these changes needed?  This change makes `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` configurable. Power users may wish to submit more than 1000 tasks at once and have the autoscaler respond by immediately scaling up the requisite number of nodes. To make this happen, `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` must be increased beyond the 1000 cap; otherwise, the demand from most tasks is ignored and upscaling is slow. ## Related issue number  Limited `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` causes the issue experienced in #45373. This PR provides a workaround. After merging this PR, if a user wants, say, 10k tasks to trigger quick upscaling, then the user can increase `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` past 10k. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( I tested it experimentally by increasing `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` to 100k and submitting 10k tasks; upscaling happened smoothly. --------- Signed-off-by: Dmitri Gekhtman <dmitri.gekhtman@getcruise.com> Co-authored-by: Dmitri Gekhtman <dmitri.gekhtman@getcruise.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>

raghumdani added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 16, 2024

anyscalesam added performance core-autoscaler autoscaler related issues core Issues that should be addressed in Ray Core labels May 16, 2024

anyscalesam assigned hongchaodeng May 16, 2024

rynewang added @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2024

DmitriGekhtman mentioned this issue Oct 9, 2024

Core: Ray cluster nodes underutilization during autoscaling #47355

Open

anyscalesam unassigned hongchaodeng Oct 12, 2024

anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) and removed P1 Issue that should be fixed within a few weeks @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. labels Oct 12, 2024

jjyao self-assigned this Oct 28, 2024

jjyao added P1 Issue that should be fixed within a few weeks P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks labels Oct 28, 2024

DmitriGekhtman mentioned this issue Feb 1, 2025

[Autoscaler] Make AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE configurable #50176

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Autoscaler] Ray autoscaler does not scale up effectively and fast #45373

[Ray Autoscaler] Ray autoscaler does not scale up effectively and fast #45373

raghumdani commented May 16, 2024 •

edited

Loading

anyscalesam commented Oct 12, 2024

Moonquakes commented Oct 21, 2024

jjyao commented Oct 28, 2024

DmitriGekhtman commented Dec 8, 2024 •

edited

Loading

DmitriGekhtman commented Dec 8, 2024

DmitriGekhtman commented Dec 8, 2024 •

edited

Loading

DmitriGekhtman commented Dec 8, 2024

DmitriGekhtman commented Feb 1, 2025

[Ray Autoscaler] Ray autoscaler does not scale up effectively and fast #45373

[Ray Autoscaler] Ray autoscaler does not scale up effectively and fast #45373

Comments

raghumdani commented May 16, 2024 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

anyscalesam commented Oct 12, 2024

Moonquakes commented Oct 21, 2024

jjyao commented Oct 28, 2024

DmitriGekhtman commented Dec 8, 2024 • edited Loading

DmitriGekhtman commented Dec 8, 2024

DmitriGekhtman commented Dec 8, 2024 • edited Loading

DmitriGekhtman commented Dec 8, 2024

DmitriGekhtman commented Feb 1, 2025

raghumdani commented May 16, 2024 •

edited

Loading

DmitriGekhtman commented Dec 8, 2024 •

edited

Loading

DmitriGekhtman commented Dec 8, 2024 •

edited

Loading