Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air/tune] Internal resource management 2 - Ray Tune to use new Ray AIR resource manager #30016

Merged
merged 94 commits into from
Dec 20, 2022

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Nov 4, 2022

Why are these changes needed?

Includes/depends on #30777

TLDR: This PR refactors Ray Tune's resource management to use a central AIR resource management package instead of the tightly coupled PlacementGroupManager.

Ray Tune's resource management currently uses a tightly coupled placement group manager. This leads to a number of shortcomings:

  • The tight coupling on the manager side (e.g. PG manager keeps track of trials) prevents re-usability
  • The tight coupling on the trial executor side prevents using different resource management strategies (e.g. shared or budget-based)
  • It's hard to test independently. Tests for the resource management require a simulated tune setup.

To improve stability, extensibility, and maintainability, this PR moves the resource management logic into a central ray.air.execution.resources subpackage. The resource management has a simple API that works with ResourceRequests and AllocatedResources to manage requested and assigned resources, respectively. The actual resource management can then be anything - per default it is a placement group based manager, but this PR also introduces a PoC budget-based manager that can be plugged in.

The PR does not substantially change existing tests, so we can be certain that the new resource model is a fully compatible replacement for the old placement group manager.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Sorry, something went wrong.

Kai Fricke added 17 commits October 3, 2022 10:23
commit 7175d9e68136db1cab1c3e1565fae0a02b9eff10
Author: Kai Fricke <kai@anyscale.com>
Date:   Mon Oct 3 10:13:19 2022 -0700

    Testing

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 2dc3bdee204c7d950b0f6c2603cb03f418e207eb
Author: Kai Fricke <kai@anyscale.com>
Date:   Mon Oct 3 08:35:51 2022 -0700

    Revert tune.py changes

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 4498111
Merge: eb6ede0 7167f9c
Author: Kai Fricke <kai@anyscale.com>
Date:   Mon Oct 3 08:34:14 2022 -0700

    Merge branch 'master' of https://github.com/ray-project/ray into air/execution/prototype

commit eb6ede0
Author: Kai Fricke <kai@anyscale.com>
Date:   Mon Oct 3 08:33:32 2022 -0700

    Move to experimental subpackage

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 8c81be4
Author: Kai Fricke <kai@anyscale.com>
Date:   Sun Oct 2 17:45:04 2022 +0100

    Recover trial

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 5289a87
Author: Kai Fricke <kai@anyscale.com>
Date:   Sun Oct 2 17:07:07 2022 +0100

    Fix tests, add step tests

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 50abcb2
Author: Kai Fricke <kai@anyscale.com>
Date:   Sun Oct 2 16:39:31 2022 +0100

    tune.run

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit acdea1a
Author: Kai Fricke <kai@anyscale.com>
Date:   Sun Oct 2 16:08:27 2022 +0100

    Legacy interface cont

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit b2080df
Author: Kai Fricke <kai@anyscale.com>
Date:   Fri Sep 30 18:00:08 2022 +0100

    Update resources, legacy wrapper

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 2768b58
Author: Kai Fricke <kai@anyscale.com>
Date:   Fri Sep 30 16:44:26 2022 +0100

    test numerical error

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 1e177c8
Author: Kai Fricke <kai@anyscale.com>
Date:   Thu Sep 29 18:50:12 2022 +0100

    Fix HEBO test

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 5f6e75e
Author: Kai Fricke <kai@anyscale.com>
Date:   Thu Sep 29 15:23:41 2022 +0100

    cont

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 7715595
Author: Kai Fricke <kai@anyscale.com>
Date:   Thu Sep 29 14:53:26 2022 +0100

    Max pending trials

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 5638109
Author: Kai Fricke <kai@anyscale.com>
Date:   Thu Sep 29 14:08:13 2022 +0100

    Overstep

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 9ee6455
Author: Kai Fricke <kai@anyscale.com>
Date:   Thu Sep 29 14:07:01 2022 +0100

    Tune controller result handling

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit fcb8557
Author: Kai Fricke <kai@anyscale.com>
Date:   Thu Sep 29 13:03:15 2022 +0100

    Tune controller resources management tests

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit ae757fe
Author: Kai Fricke <kai@anyscale.com>
Date:   Thu Sep 29 12:12:48 2022 +0100

    Add docs to unit tests

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 10d1760
Author: Kai Fricke <kai@anyscale.com>
Date:   Thu Sep 29 11:46:05 2022 +0100

    Delete execution_pull

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 99f7ddd
Merge: ccf8b02 f5143a9
Author: Kai Fricke <kai@anyscale.com>
Date:   Thu Sep 29 11:45:49 2022 +0100

    Merge branch 'master' into air/execution/prototype

commit ccf8b02
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 27 16:07:31 2022 +0100

    PG manager tests

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 2e74eaa
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 27 15:32:09 2022 +0100

    PG manager

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit c4eee47
Merge: 01020dd d456aa2
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 27 11:43:59 2022 +0100

    Merge remote-tracking branch 'upstream/master' into air/execution/prototype

commit 01020dd
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 27 08:55:50 2022 +0100

    PG WIP

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit b1f3cd2
Author: Kai Fricke <kai@anyscale.com>
Date:   Mon Sep 26 15:07:45 2022 +0100

    Add fixed resource manager test

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 2a5d089
Merge: ddbb7ef 9f1ea30
Author: Kai Fricke <kai@anyscale.com>
Date:   Mon Sep 26 14:55:38 2022 +0100

    Merge remote-tracking branch 'upstream/master' into air/execution/prototype

commit ddbb7ef
Merge: b685a44 6530635
Author: Kai Fricke <kai@anyscale.com>
Date:   Thu Sep 22 17:07:02 2022 +0100

    Merge branch 'master' into air/execution/prototype

commit b685a44
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 20 13:04:32 2022 +0100

    K fold CV

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 7fbbeed
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 20 10:55:49 2022 +0100

    Resources

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 0173489
Merge: 66fe655 15883cd
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 20 10:46:25 2022 +0100

    Merge remote-tracking branch 'upstream/master' into air/execution/prototype

commit 66fe655
Merge: 8359ebd 805d5dc
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 20 09:54:19 2022 +0100

    Merge remote-tracking branch 'upstream/master' into air/execution/prototype

commit 8359ebd
Merge: 1e9b3f9 7280ef4
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 20 09:54:13 2022 +0100

    Merge branch 'master' into air/execution/prototype

commit 1e9b3f9
Author: Kai Fricke <kai@anyscale.com>
Date:   Fri Sep 9 19:11:50 2022 +0100

    num samples, benchmark

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 11331eb
Author: Kai Fricke <kai@anyscale.com>
Date:   Fri Sep 9 16:49:29 2022 +0100

    uuid

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit bdbff3b
Author: Kai Fricke <kai@anyscale.com>
Date:   Fri Sep 9 16:38:06 2022 +0100

    Simple split controller

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit a0b7815
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 6 14:36:22 2022 +0100

    Train example working with sync futures

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit aedd0dc
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 6 12:53:59 2022 +0100

    Make pausing work

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 8155508
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 6 12:47:18 2022 +0100

    Fix execution for tune

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 860ffe5
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 6 12:24:48 2022 +0100

    Push-based package

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit dac1637
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Sep 6 11:11:08 2022 +0100

    Separate pull-based package

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 15ddea1
Author: Kai Fricke <kai@anyscale.com>
Date:   Fri Sep 2 12:30:37 2022 +0100

    tune checkpointing works

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 1358832
Merge: b7c46d8 8692eb6
Author: Kai Fricke <kai@anyscale.com>
Date:   Fri Sep 2 11:37:56 2022 +0100

    Merge branch 'master' into air/execution/prototype

commit b7c46d8
Merge: 859cb49 ed29291
Author: Kai Fricke <kai@anyscale.com>
Date:   Wed Aug 31 14:36:18 2022 -0700

    Merge remote-tracking branch 'upstream/master' into air/execution/prototype

commit 859cb49
Author: Kai Fricke <kai@anyscale.com>
Date:   Wed Aug 31 14:19:21 2022 -0700

    train sequentially working

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 7498377
Author: Kai Fricke <kai@anyscale.com>
Date:   Wed Aug 31 11:37:37 2022 -0700

    train wip

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit f95b103
Author: Kai Fricke <kai@anyscale.com>
Date:   Wed Aug 31 10:50:13 2022 -0700

    Update API

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 0effb2a
Author: Kai Fricke <kai@anyscale.com>
Date:   Wed Aug 31 08:43:58 2022 -0700

    typed futures

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 68e90dc
Author: Kai Fricke <kai@anyscale.com>
Date:   Wed Aug 31 07:53:06 2022 -0700

    tune update

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 515228e
Author: Kai Fricke <kai@anyscale.com>
Date:   Wed Aug 31 06:37:39 2022 -0700

    tune update

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit ffef078
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Aug 30 17:24:32 2022 -0700

    tune.run with progress report

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 5d7ace0
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Aug 30 17:17:34 2022 -0700

    tune example running

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 3f44ac3
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Aug 30 15:48:38 2022 -0700

    resource manager, tune example

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 5160de5
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Aug 30 14:20:30 2022 -0700

    tune controller + result

    Signed-off-by: Kai Fricke <kai@anyscale.com>

commit 3bd2184
Author: Kai Fricke <kai@anyscale.com>
Date:   Tue Aug 30 12:51:12 2022 -0700

    initial structure

    Signed-off-by: Kai Fricke <kai@anyscale.com>

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
# Conflicts:
#	python/ray/tune/tests/test_sample.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Kai Fricke added 12 commits November 4, 2022 11:14
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
krfricke and others added 6 commits December 20, 2022 09:04

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
krfricke added a commit that referenced this pull request Dec 20, 2022

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
…implementation (#30777)

Prerequisite to #30016

This PR adds a new Ray AIR resource manager to replace the PlacementGroupManager of Ray Tune. Details can be found in #30016.

Specifically, this PR
- Adds the main resource manager abstractions
- Renames (and moves) PlacementGroupFactory to ResourceRequest
- Adds implementations and tests for a placement group based manager and a budget based manager

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Kai Fricke added 3 commits December 20, 2022 11:17
…manager

# Conflicts:
#	python/ray/tune/execution/placement_groups.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke merged commit 1510fb2 into ray-project:master Dec 20, 2022
@krfricke krfricke deleted the tune/refactor-pg-manager branch December 20, 2022 15:54
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…implementation (#30777)

Prerequisite to #30016

This PR adds a new Ray AIR resource manager to replace the PlacementGroupManager of Ray Tune. Details can be found in #30016.

Specifically, this PR
- Adds the main resource manager abstractions
- Renames (and moves) PlacementGroupFactory to ResourceRequest
- Adds implementations and tests for a placement group based manager and a budget based manager

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…IR resource manager (#30016)

Includes/depends on #30777

TLDR: This PR refactors Ray Tune's resource management to use a central AIR resource management package instead of the tightly coupled PlacementGroupManager.

Ray Tune's resource management currently uses a tightly coupled placement group manager. This leads to a number of shortcomings:
- The tight coupling on the manager side (e.g. PG manager keeps track of trials) prevents re-usability
- The tight coupling on the trial executor side prevents using different resource management strategies (e.g. shared or budget-based)
- It's hard to test independently. Tests for the resource management require a simulated tune setup.

To improve stability, extensibility, and maintainability, this PR moves the resource management logic into a central `ray.air.execution.resources` subpackage. The resource management has a simple API that works with `ResourceRequest`s and `AllocatedResources` to manage requested and assigned resources, respectively. The actual resource management can then be anything - per default it is a placement group based manager, but this PR also introduces a PoC budget-based manager that can be plugged in.

The PR does not substantially change existing tests, so we can be certain that the new resource model is a fully compatible replacement for the old placement group manager.

Signed-off-by: Kai Fricke <kai@anyscale.com>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…implementation (ray-project#30777)

Prerequisite to ray-project#30016

This PR adds a new Ray AIR resource manager to replace the PlacementGroupManager of Ray Tune. Details can be found in ray-project#30016.

Specifically, this PR
- Adds the main resource manager abstractions
- Renames (and moves) PlacementGroupFactory to ResourceRequest
- Adds implementations and tests for a placement group based manager and a budget based manager

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…IR resource manager (ray-project#30016)

Includes/depends on ray-project#30777

TLDR: This PR refactors Ray Tune's resource management to use a central AIR resource management package instead of the tightly coupled PlacementGroupManager.

Ray Tune's resource management currently uses a tightly coupled placement group manager. This leads to a number of shortcomings:
- The tight coupling on the manager side (e.g. PG manager keeps track of trials) prevents re-usability
- The tight coupling on the trial executor side prevents using different resource management strategies (e.g. shared or budget-based)
- It's hard to test independently. Tests for the resource management require a simulated tune setup.

To improve stability, extensibility, and maintainability, this PR moves the resource management logic into a central `ray.air.execution.resources` subpackage. The resource management has a simple API that works with `ResourceRequest`s and `AllocatedResources` to manage requested and assigned resources, respectively. The actual resource management can then be anything - per default it is a placement group based manager, but this PR also introduces a PoC budget-based manager that can be plugged in.

The PR does not substantially change existing tests, so we can be certain that the new resource model is a fully compatible replacement for the old placement group manager.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
krfricke added a commit that referenced this pull request Jan 31, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
…r lookup (#32087)

In #30016 we migrated Ray Tune to use a new resource management interface. In the same PR, we simplified the resource consolidation logic. This lead to a performance regression first identified in #31337.

After manual profiling, the regression seems to come from `RayTrialExecutor._count_staged_resources`. We have 1000 staged trials, and this function is called on every step, executing a linear scan through all trials.

This PR fixes this performance bottleneck by keeping state of the resource counter instead of dynamically recreating it every time. This is simple as we can just add/subtract the resources whenever we add/remove from the `RayTrialExecutor._staged_trials` set.

Manual testing confirmed this improves the runtime of `tune_scalability_result_throughput_cluster` from ~132 seconds to ~122 seconds, bringing it back to the same level as before the refactor.

Signed-off-by: Kai Fricke <kai@anyscale.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…r lookup (ray-project#32087)

In ray-project#30016 we migrated Ray Tune to use a new resource management interface. In the same PR, we simplified the resource consolidation logic. This lead to a performance regression first identified in ray-project#31337.

After manual profiling, the regression seems to come from `RayTrialExecutor._count_staged_resources`. We have 1000 staged trials, and this function is called on every step, executing a linear scan through all trials.

This PR fixes this performance bottleneck by keeping state of the resource counter instead of dynamically recreating it every time. This is simple as we can just add/subtract the resources whenever we add/remove from the `RayTrialExecutor._staged_trials` set.

Manual testing confirmed this improves the runtime of `tune_scalability_result_throughput_cluster` from ~132 seconds to ~122 seconds, bringing it back to the same level as before the refactor.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants