[Autoscaler] Make AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE configur… · ray-project/ray@ad5db71

Commit

[Autoscaler] Make AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE configur…

…able (#50176)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

This change makes `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE`
configurable.

Power users may wish to submit more than 1000 tasks at once and have the
autoscaler respond by immediately scaling up the requisite number of
nodes.

To make this happen, `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` must
be increased beyond the 1000 cap; otherwise, the demand from most tasks
is ignored and upscaling is slow.

## Related issue number

<!-- For example: "Closes #1234" -->

Limited `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` causes the issue
experienced in #45373.

This PR provides a workaround.
After merging this PR, if a user wants, say, 10k tasks to trigger quick
upscaling, then the user can increase
`AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` past 10k.

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
  
I tested it experimentally by increasing
`AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` to 100k and submitting 10k
tasks; upscaling happened smoothly.

---------

Signed-off-by: Dmitri Gekhtman <dmitri.gekhtman@getcruise.com>
Co-authored-by: Dmitri Gekhtman <dmitri.gekhtman@getcruise.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>

Loading branch information

3 people authored Feb 9, 2025

1 parent b880e96 commit ad5db71

python/ray/autoscaler/_private/constants.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -81,7 +81,9 @@ def env_integer(key, default): @@
     # The maximum allowed resource demand vector size to guarantee the resource
     # demand scheduler bin packing algorithm takes a reasonable amount of time
     # to run.
-    AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE = 1000
+    AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE = env_integer(
+        "AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE", 1000
+    )
     # Port that autoscaler prometheus metrics will be exported to
     AUTOSCALER_METRIC_PORT = env_integer("AUTOSCALER_METRIC_PORT", 44217)
@@ Expand Down @@

0 comments on commit `ad5db71`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `ad5db71`

Commit

There are no files selected for viewing

0 comments on commit ad5db71

0 comments on commit `ad5db71`