Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Add node rank and local world size info to session #29919

Merged
merged 21 commits into from
Nov 3, 2022

Conversation

ilee300a
Copy link
Contributor

@ilee300a ilee300a commented Nov 1, 2022

Signed-off-by: ilee300a ilee300@anyscale.com

Pipeline node_rank and local_world_size information in ray training so that they are accessible via session.get_node_rank() and session.get_local_world_size() in the training loop.

Note : Need to add tests

Related PRs:
#29812 -- need these information for resumed training and some other mosaic library functionalities.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Copy link
Member

@jiaodong jiaodong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR lgtm, leaving to Amog to double check on the added args to session and backend executor (ex: Do we have any places that we use *args, **kwargs that could be impacted by this ?)

@@ -53,6 +56,20 @@ def ray_4_node_4_cpu():
cluster.shutdown()


@pytest.fixture
def ray_2_node_2_cpu():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these be moved to conftest.py ? All you need is to add deps = [":train_lib", ":conftest"] in the corresponding pytest build https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/train/BUILD

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixture is only needed for the two specific tests I have added.
If we move this to conftest.py then shouldn't we also move other cluster setup fixtures to the conftest.py file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we had a bad pattern that redefined a bunch of fixtures in each test file in the past that some of them are being addressed already, if this fixture is only used in this file im fine living it as-is, but please keep this context in mind when you come across other fixtures like the ones above, we might still have a few hanging around that are defined more than once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixtures have been moved to conftest and conftest is added as a BUILD dependency

python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_backend.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_backend.py Outdated Show resolved Hide resolved
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
python/ray/air/session.py Outdated Show resolved Hide resolved
python/ray/air/session.py Outdated Show resolved Hide resolved
python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
python/ray/air/session.py Outdated Show resolved Hide resolved
Signed-off-by: ilee300a <ilee300@anyscale.com>
python/ray/train/tests/test_backend.py Outdated Show resolved Hide resolved
hostname=str(i % 2),
gpu_ids=[str(i % 2)],
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice this is very clever!

python/ray/train/tests/test_backend.py Outdated Show resolved Hide resolved
Signed-off-by: ilee300a <ilee300@anyscale.com>
python/ray/air/session.py Outdated Show resolved Hide resolved
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
@ilee300a ilee300a requested review from jiaodong and amogkam and removed request for jiaodong November 2, 2022 21:12
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
@amogkam amogkam changed the title [AIR] pipeline node rank and local world size info to session [AIR] Add node rank and local world size info to session Nov 2, 2022
@amogkam amogkam self-assigned this Nov 2, 2022
python/ray/train/tests/test_backend.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_backend.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_backend.py Outdated Show resolved Hide resolved
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ilee300a, lgtm! Just some typos in the error message. Let's fix those and please ping again once CI passes!

python/ray/air/session.py Outdated Show resolved Hide resolved
python/ray/air/session.py Outdated Show resolved Hide resolved
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
@amogkam amogkam merged commit 79a3bb3 into ray-project:master Nov 3, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…#29919)

Pipeline node_rank and local_world_size information in ray training so that they are accessible via session.get_node_rank() and session.get_local_world_size() in the training loop.

Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants