[Tune] Cancel pg.ready()
task for pending trials that end up reusing an actor
#35748
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
This PR cancels
pg.ready()
tasks that are pending node assignment forever, in the case that trials are reusing actors. For Tune experiments with many trials, this clutters the user's dashboard with a bunch of thesepg.ready()
tasks that don't go away. The trials corresponding to these requests have already been assigned different actors, so there is no point keeping the futures around.Context
Currently, the
PlacementGroupResourceManager
uses thepg.ready()
future to determine when a placement group is ready for an actor to be created with it.Let's say we're on a cluster with 10 CPUs, each trial needs 2 CPUs, and we want to run 100 trials. Tune will schedule 5 actors, and Tune will also have some number of pending trials that also try to allocate a placement group (the default is 16 pending in this case I believe, see the
TUNE_MAX_PENDING_TRIALS_PG
env var).In the beginning, the state will be:
pg.ready()
futures.Then, when the first trial finishes, one of the 16 pending trials will end up reusing the actor of the finished trial.
The resources will not be relinquished from that actor, so the new trial's
pg.ready()
will never finish, and we've also lost the reference.@jjyao Question: Why isn't it garbage collected automatically? I think it's because
pg.ready()
calls some global function that always has at least 1 reference:bundle_reservation_check_func
.So, the pending futures will just keep accumulating as trials keep reusing actors. This leads to the dashboard looking like:
After this change, there will only ever be ~16 pending futures, which are just the PENDING trials that will eventually reuse an actor and cancel the future.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.