Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Cancel pg.ready() task for pending trials that end up reusing an actor #35748

Merged
merged 3 commits into from
May 25, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented May 24, 2023

Why are these changes needed?

This PR cancels pg.ready() tasks that are pending node assignment forever, in the case that trials are reusing actors. For Tune experiments with many trials, this clutters the user's dashboard with a bunch of these pg.ready() tasks that don't go away. The trials corresponding to these requests have already been assigned different actors, so there is no point keeping the futures around.

Context

Currently, the PlacementGroupResourceManager uses the pg.ready() future to determine when a placement group is ready for an actor to be created with it.

Let's say we're on a cluster with 10 CPUs, each trial needs 2 CPUs, and we want to run 100 trials. Tune will schedule 5 actors, and Tune will also have some number of pending trials that also try to allocate a placement group (the default is 16 pending in this case I believe, see the TUNE_MAX_PENDING_TRIALS_PG env var).

In the beginning, the state will be:

  • 5 trials with 2 CPUs each.
  • 16 trials, associated with 16 pending pg.ready() futures.

Then, when the first trial finishes, one of the 16 pending trials will end up reusing the actor of the finished trial.
The resources will not be relinquished from that actor, so the new trial's pg.ready() will never finish, and we've also lost the reference.

@jjyao Question: Why isn't it garbage collected automatically? I think it's because pg.ready() calls some global function that always has at least 1 reference: bundle_reservation_check_func.

So, the pending futures will just keep accumulating as trials keep reusing actors. This leads to the dashboard looking like:

Screen Shot 2023-04-27 at 1 25 09 PM

After this change, there will only ever be ~16 pending futures, which are just the PENDING trials that will eventually reuse an actor and cancel the future.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

justinvyu added 3 commits May 24, 2023 13:47
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@krfricke krfricke merged commit fa0d563 into ray-project:master May 25, 2023
@justinvyu justinvyu deleted the tune/cancel_pg_task branch May 25, 2023 17:12
@jjyao
Copy link
Collaborator

jjyao commented May 25, 2023

Question: Why isn't it garbage collected automatically?

Ray currently doesn't automatically cancel the task when the return object is GCed to support fire and forget.

scv119 pushed a commit to scv119/ray that referenced this pull request Jun 16, 2023
…g an actor (ray-project#35748)

This PR cancels `pg.ready()` tasks that are pending node assignment forever, in the case that trials are reusing actors. For Tune experiments with many trials, this clutters the user's dashboard with a bunch of these `pg.ready()` tasks that don't go away. The trials corresponding to these requests have already been assigned different actors, so there is no point keeping the futures around.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…g an actor (ray-project#35748)

This PR cancels `pg.ready()` tasks that are pending node assignment forever, in the case that trials are reusing actors. For Tune experiments with many trials, this clutters the user's dashboard with a bunch of these `pg.ready()` tasks that don't go away. The trials corresponding to these requests have already been assigned different actors, so there is no point keeping the futures around.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants