Skip to content

Commit

Permalink
[Autoscaler] Make AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE configur…
Browse files Browse the repository at this point in the history
…able (#50176)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

This change makes `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE`
configurable.

Power users may wish to submit more than 1000 tasks at once and have the
autoscaler respond by immediately scaling up the requisite number of
nodes.

To make this happen, `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` must
be increased beyond the 1000 cap; otherwise, the demand from most tasks
is ignored and upscaling is slow.

## Related issue number

<!-- For example: "Closes #1234" -->

Limited `AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` causes the issue
experienced in #45373.

This PR provides a workaround.
After merging this PR, if a user wants, say, 10k tasks to trigger quick
upscaling, then the user can increase
`AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` past 10k.

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
  
I tested it experimentally by increasing
`AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE` to 100k and submitting 10k
tasks; upscaling happened smoothly.

---------

Signed-off-by: Dmitri Gekhtman <dmitri.gekhtman@getcruise.com>
Co-authored-by: Dmitri Gekhtman <dmitri.gekhtman@getcruise.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
  • Loading branch information
3 people authored Feb 9, 2025
1 parent b880e96 commit ad5db71
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion python/ray/autoscaler/_private/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,9 @@ def env_integer(key, default):
# The maximum allowed resource demand vector size to guarantee the resource
# demand scheduler bin packing algorithm takes a reasonable amount of time
# to run.
AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE = 1000
AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE = env_integer(
"AUTOSCALER_MAX_RESOURCE_DEMAND_VECTOR_SIZE", 1000
)

# Port that autoscaler prometheus metrics will be exported to
AUTOSCALER_METRIC_PORT = env_integer("AUTOSCALER_METRIC_PORT", 44217)
Expand Down

0 comments on commit ad5db71

Please sign in to comment.