-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air/tune] Internal resource management 2 - Ray Tune to use new Ray AIR resource manager #30016
Merged
krfricke
merged 94 commits into
ray-project:master
from
krfricke:tune/refactor-pg-manager
Dec 20, 2022
Merged
[air/tune] Internal resource management 2 - Ray Tune to use new Ray AIR resource manager #30016
krfricke
merged 94 commits into
ray-project:master
from
krfricke:tune/refactor-pg-manager
Dec 20, 2022
+1,064
−993
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
commit 7175d9e68136db1cab1c3e1565fae0a02b9eff10 Author: Kai Fricke <kai@anyscale.com> Date: Mon Oct 3 10:13:19 2022 -0700 Testing Signed-off-by: Kai Fricke <kai@anyscale.com> commit 2dc3bdee204c7d950b0f6c2603cb03f418e207eb Author: Kai Fricke <kai@anyscale.com> Date: Mon Oct 3 08:35:51 2022 -0700 Revert tune.py changes Signed-off-by: Kai Fricke <kai@anyscale.com> commit 4498111 Merge: eb6ede0 7167f9c Author: Kai Fricke <kai@anyscale.com> Date: Mon Oct 3 08:34:14 2022 -0700 Merge branch 'master' of https://github.com/ray-project/ray into air/execution/prototype commit eb6ede0 Author: Kai Fricke <kai@anyscale.com> Date: Mon Oct 3 08:33:32 2022 -0700 Move to experimental subpackage Signed-off-by: Kai Fricke <kai@anyscale.com> commit 8c81be4 Author: Kai Fricke <kai@anyscale.com> Date: Sun Oct 2 17:45:04 2022 +0100 Recover trial Signed-off-by: Kai Fricke <kai@anyscale.com> commit 5289a87 Author: Kai Fricke <kai@anyscale.com> Date: Sun Oct 2 17:07:07 2022 +0100 Fix tests, add step tests Signed-off-by: Kai Fricke <kai@anyscale.com> commit 50abcb2 Author: Kai Fricke <kai@anyscale.com> Date: Sun Oct 2 16:39:31 2022 +0100 tune.run Signed-off-by: Kai Fricke <kai@anyscale.com> commit acdea1a Author: Kai Fricke <kai@anyscale.com> Date: Sun Oct 2 16:08:27 2022 +0100 Legacy interface cont Signed-off-by: Kai Fricke <kai@anyscale.com> commit b2080df Author: Kai Fricke <kai@anyscale.com> Date: Fri Sep 30 18:00:08 2022 +0100 Update resources, legacy wrapper Signed-off-by: Kai Fricke <kai@anyscale.com> commit 2768b58 Author: Kai Fricke <kai@anyscale.com> Date: Fri Sep 30 16:44:26 2022 +0100 test numerical error Signed-off-by: Kai Fricke <kai@anyscale.com> commit 1e177c8 Author: Kai Fricke <kai@anyscale.com> Date: Thu Sep 29 18:50:12 2022 +0100 Fix HEBO test Signed-off-by: Kai Fricke <kai@anyscale.com> commit 5f6e75e Author: Kai Fricke <kai@anyscale.com> Date: Thu Sep 29 15:23:41 2022 +0100 cont Signed-off-by: Kai Fricke <kai@anyscale.com> commit 7715595 Author: Kai Fricke <kai@anyscale.com> Date: Thu Sep 29 14:53:26 2022 +0100 Max pending trials Signed-off-by: Kai Fricke <kai@anyscale.com> commit 5638109 Author: Kai Fricke <kai@anyscale.com> Date: Thu Sep 29 14:08:13 2022 +0100 Overstep Signed-off-by: Kai Fricke <kai@anyscale.com> commit 9ee6455 Author: Kai Fricke <kai@anyscale.com> Date: Thu Sep 29 14:07:01 2022 +0100 Tune controller result handling Signed-off-by: Kai Fricke <kai@anyscale.com> commit fcb8557 Author: Kai Fricke <kai@anyscale.com> Date: Thu Sep 29 13:03:15 2022 +0100 Tune controller resources management tests Signed-off-by: Kai Fricke <kai@anyscale.com> commit ae757fe Author: Kai Fricke <kai@anyscale.com> Date: Thu Sep 29 12:12:48 2022 +0100 Add docs to unit tests Signed-off-by: Kai Fricke <kai@anyscale.com> commit 10d1760 Author: Kai Fricke <kai@anyscale.com> Date: Thu Sep 29 11:46:05 2022 +0100 Delete execution_pull Signed-off-by: Kai Fricke <kai@anyscale.com> commit 99f7ddd Merge: ccf8b02 f5143a9 Author: Kai Fricke <kai@anyscale.com> Date: Thu Sep 29 11:45:49 2022 +0100 Merge branch 'master' into air/execution/prototype commit ccf8b02 Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 27 16:07:31 2022 +0100 PG manager tests Signed-off-by: Kai Fricke <kai@anyscale.com> commit 2e74eaa Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 27 15:32:09 2022 +0100 PG manager Signed-off-by: Kai Fricke <kai@anyscale.com> commit c4eee47 Merge: 01020dd d456aa2 Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 27 11:43:59 2022 +0100 Merge remote-tracking branch 'upstream/master' into air/execution/prototype commit 01020dd Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 27 08:55:50 2022 +0100 PG WIP Signed-off-by: Kai Fricke <kai@anyscale.com> commit b1f3cd2 Author: Kai Fricke <kai@anyscale.com> Date: Mon Sep 26 15:07:45 2022 +0100 Add fixed resource manager test Signed-off-by: Kai Fricke <kai@anyscale.com> commit 2a5d089 Merge: ddbb7ef 9f1ea30 Author: Kai Fricke <kai@anyscale.com> Date: Mon Sep 26 14:55:38 2022 +0100 Merge remote-tracking branch 'upstream/master' into air/execution/prototype commit ddbb7ef Merge: b685a44 6530635 Author: Kai Fricke <kai@anyscale.com> Date: Thu Sep 22 17:07:02 2022 +0100 Merge branch 'master' into air/execution/prototype commit b685a44 Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 20 13:04:32 2022 +0100 K fold CV Signed-off-by: Kai Fricke <kai@anyscale.com> commit 7fbbeed Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 20 10:55:49 2022 +0100 Resources Signed-off-by: Kai Fricke <kai@anyscale.com> commit 0173489 Merge: 66fe655 15883cd Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 20 10:46:25 2022 +0100 Merge remote-tracking branch 'upstream/master' into air/execution/prototype commit 66fe655 Merge: 8359ebd 805d5dc Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 20 09:54:19 2022 +0100 Merge remote-tracking branch 'upstream/master' into air/execution/prototype commit 8359ebd Merge: 1e9b3f9 7280ef4 Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 20 09:54:13 2022 +0100 Merge branch 'master' into air/execution/prototype commit 1e9b3f9 Author: Kai Fricke <kai@anyscale.com> Date: Fri Sep 9 19:11:50 2022 +0100 num samples, benchmark Signed-off-by: Kai Fricke <kai@anyscale.com> commit 11331eb Author: Kai Fricke <kai@anyscale.com> Date: Fri Sep 9 16:49:29 2022 +0100 uuid Signed-off-by: Kai Fricke <kai@anyscale.com> commit bdbff3b Author: Kai Fricke <kai@anyscale.com> Date: Fri Sep 9 16:38:06 2022 +0100 Simple split controller Signed-off-by: Kai Fricke <kai@anyscale.com> commit a0b7815 Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 6 14:36:22 2022 +0100 Train example working with sync futures Signed-off-by: Kai Fricke <kai@anyscale.com> commit aedd0dc Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 6 12:53:59 2022 +0100 Make pausing work Signed-off-by: Kai Fricke <kai@anyscale.com> commit 8155508 Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 6 12:47:18 2022 +0100 Fix execution for tune Signed-off-by: Kai Fricke <kai@anyscale.com> commit 860ffe5 Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 6 12:24:48 2022 +0100 Push-based package Signed-off-by: Kai Fricke <kai@anyscale.com> commit dac1637 Author: Kai Fricke <kai@anyscale.com> Date: Tue Sep 6 11:11:08 2022 +0100 Separate pull-based package Signed-off-by: Kai Fricke <kai@anyscale.com> commit 15ddea1 Author: Kai Fricke <kai@anyscale.com> Date: Fri Sep 2 12:30:37 2022 +0100 tune checkpointing works Signed-off-by: Kai Fricke <kai@anyscale.com> commit 1358832 Merge: b7c46d8 8692eb6 Author: Kai Fricke <kai@anyscale.com> Date: Fri Sep 2 11:37:56 2022 +0100 Merge branch 'master' into air/execution/prototype commit b7c46d8 Merge: 859cb49 ed29291 Author: Kai Fricke <kai@anyscale.com> Date: Wed Aug 31 14:36:18 2022 -0700 Merge remote-tracking branch 'upstream/master' into air/execution/prototype commit 859cb49 Author: Kai Fricke <kai@anyscale.com> Date: Wed Aug 31 14:19:21 2022 -0700 train sequentially working Signed-off-by: Kai Fricke <kai@anyscale.com> commit 7498377 Author: Kai Fricke <kai@anyscale.com> Date: Wed Aug 31 11:37:37 2022 -0700 train wip Signed-off-by: Kai Fricke <kai@anyscale.com> commit f95b103 Author: Kai Fricke <kai@anyscale.com> Date: Wed Aug 31 10:50:13 2022 -0700 Update API Signed-off-by: Kai Fricke <kai@anyscale.com> commit 0effb2a Author: Kai Fricke <kai@anyscale.com> Date: Wed Aug 31 08:43:58 2022 -0700 typed futures Signed-off-by: Kai Fricke <kai@anyscale.com> commit 68e90dc Author: Kai Fricke <kai@anyscale.com> Date: Wed Aug 31 07:53:06 2022 -0700 tune update Signed-off-by: Kai Fricke <kai@anyscale.com> commit 515228e Author: Kai Fricke <kai@anyscale.com> Date: Wed Aug 31 06:37:39 2022 -0700 tune update Signed-off-by: Kai Fricke <kai@anyscale.com> commit ffef078 Author: Kai Fricke <kai@anyscale.com> Date: Tue Aug 30 17:24:32 2022 -0700 tune.run with progress report Signed-off-by: Kai Fricke <kai@anyscale.com> commit 5d7ace0 Author: Kai Fricke <kai@anyscale.com> Date: Tue Aug 30 17:17:34 2022 -0700 tune example running Signed-off-by: Kai Fricke <kai@anyscale.com> commit 3f44ac3 Author: Kai Fricke <kai@anyscale.com> Date: Tue Aug 30 15:48:38 2022 -0700 resource manager, tune example Signed-off-by: Kai Fricke <kai@anyscale.com> commit 5160de5 Author: Kai Fricke <kai@anyscale.com> Date: Tue Aug 30 14:20:30 2022 -0700 tune controller + result Signed-off-by: Kai Fricke <kai@anyscale.com> commit 3bd2184 Author: Kai Fricke <kai@anyscale.com> Date: Tue Aug 30 12:51:12 2022 -0700 initial structure Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <kai@anyscale.com>
# Conflicts: # python/ray/tune/tune.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
# Conflicts: # python/ray/tune/tests/test_sample.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
7 tasks
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
krfricke
added a commit
that referenced
this pull request
Dec 20, 2022
…implementation (#30777) Prerequisite to #30016 This PR adds a new Ray AIR resource manager to replace the PlacementGroupManager of Ray Tune. Details can be found in #30016. Specifically, this PR - Adds the main resource manager abstractions - Renames (and moves) PlacementGroupFactory to ResourceRequest - Adds implementations and tests for a placement group based manager and a budget based manager Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
…manager # Conflicts: # python/ray/tune/execution/placement_groups.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
AmeerHajAli
pushed a commit
that referenced
this pull request
Jan 12, 2023
…implementation (#30777) Prerequisite to #30016 This PR adds a new Ray AIR resource manager to replace the PlacementGroupManager of Ray Tune. Details can be found in #30016. Specifically, this PR - Adds the main resource manager abstractions - Renames (and moves) PlacementGroupFactory to ResourceRequest - Adds implementations and tests for a placement group based manager and a budget based manager Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
AmeerHajAli
pushed a commit
that referenced
this pull request
Jan 12, 2023
…IR resource manager (#30016) Includes/depends on #30777 TLDR: This PR refactors Ray Tune's resource management to use a central AIR resource management package instead of the tightly coupled PlacementGroupManager. Ray Tune's resource management currently uses a tightly coupled placement group manager. This leads to a number of shortcomings: - The tight coupling on the manager side (e.g. PG manager keeps track of trials) prevents re-usability - The tight coupling on the trial executor side prevents using different resource management strategies (e.g. shared or budget-based) - It's hard to test independently. Tests for the resource management require a simulated tune setup. To improve stability, extensibility, and maintainability, this PR moves the resource management logic into a central `ray.air.execution.resources` subpackage. The resource management has a simple API that works with `ResourceRequest`s and `AllocatedResources` to manage requested and assigned resources, respectively. The actual resource management can then be anything - per default it is a placement group based manager, but this PR also introduces a PoC budget-based manager that can be plugged in. The PR does not substantially change existing tests, so we can be certain that the new resource model is a fully compatible replacement for the old placement group manager. Signed-off-by: Kai Fricke <kai@anyscale.com>
tamohannes
pushed a commit
to ju2ez/ray
that referenced
this pull request
Jan 25, 2023
…implementation (ray-project#30777) Prerequisite to ray-project#30016 This PR adds a new Ray AIR resource manager to replace the PlacementGroupManager of Ray Tune. Details can be found in ray-project#30016. Specifically, this PR - Adds the main resource manager abstractions - Renames (and moves) PlacementGroupFactory to ResourceRequest - Adds implementations and tests for a placement group based manager and a budget based manager Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
tamohannes
pushed a commit
to ju2ez/ray
that referenced
this pull request
Jan 25, 2023
…IR resource manager (ray-project#30016) Includes/depends on ray-project#30777 TLDR: This PR refactors Ray Tune's resource management to use a central AIR resource management package instead of the tightly coupled PlacementGroupManager. Ray Tune's resource management currently uses a tightly coupled placement group manager. This leads to a number of shortcomings: - The tight coupling on the manager side (e.g. PG manager keeps track of trials) prevents re-usability - The tight coupling on the trial executor side prevents using different resource management strategies (e.g. shared or budget-based) - It's hard to test independently. Tests for the resource management require a simulated tune setup. To improve stability, extensibility, and maintainability, this PR moves the resource management logic into a central `ray.air.execution.resources` subpackage. The resource management has a simple API that works with `ResourceRequest`s and `AllocatedResources` to manage requested and assigned resources, respectively. The actual resource management can then be anything - per default it is a placement group based manager, but this PR also introduces a PoC budget-based manager that can be plugged in. The PR does not substantially change existing tests, so we can be certain that the new resource model is a fully compatible replacement for the old placement group manager. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
This was referenced Jan 30, 2023
krfricke
added a commit
that referenced
this pull request
Jan 31, 2023
…r lookup (#32087) In #30016 we migrated Ray Tune to use a new resource management interface. In the same PR, we simplified the resource consolidation logic. This lead to a performance regression first identified in #31337. After manual profiling, the regression seems to come from `RayTrialExecutor._count_staged_resources`. We have 1000 staged trials, and this function is called on every step, executing a linear scan through all trials. This PR fixes this performance bottleneck by keeping state of the resource counter instead of dynamically recreating it every time. This is simple as we can just add/subtract the resources whenever we add/remove from the `RayTrialExecutor._staged_trials` set. Manual testing confirmed this improves the runtime of `tune_scalability_result_throughput_cluster` from ~132 seconds to ~122 seconds, bringing it back to the same level as before the refactor. Signed-off-by: Kai Fricke <kai@anyscale.com>
edoakes
pushed a commit
to edoakes/ray
that referenced
this pull request
Mar 22, 2023
…r lookup (ray-project#32087) In ray-project#30016 we migrated Ray Tune to use a new resource management interface. In the same PR, we simplified the resource consolidation logic. This lead to a performance regression first identified in ray-project#31337. After manual profiling, the regression seems to come from `RayTrialExecutor._count_staged_resources`. We have 1000 staged trials, and this function is called on every step, executing a linear scan through all trials. This PR fixes this performance bottleneck by keeping state of the resource counter instead of dynamically recreating it every time. This is simple as we can just add/subtract the resources whenever we add/remove from the `RayTrialExecutor._staged_trials` set. Manual testing confirmed this improves the runtime of `tune_scalability_result_throughput_cluster` from ~132 seconds to ~122 seconds, bringing it back to the same level as before the refactor. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Includes/depends on #30777
TLDR: This PR refactors Ray Tune's resource management to use a central AIR resource management package instead of the tightly coupled PlacementGroupManager.
Ray Tune's resource management currently uses a tightly coupled placement group manager. This leads to a number of shortcomings:
To improve stability, extensibility, and maintainability, this PR moves the resource management logic into a central
ray.air.execution.resources
subpackage. The resource management has a simple API that works withResourceRequest
s andAllocatedResources
to manage requested and assigned resources, respectively. The actual resource management can then be anything - per default it is a placement group based manager, but this PR also introduces a PoC budget-based manager that can be plugged in.The PR does not substantially change existing tests, so we can be certain that the new resource model is a fully compatible replacement for the old placement group manager.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.