-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] Add node rank and local world size info to session #29919
Conversation
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR lgtm, leaving to Amog to double check on the added args to session and backend executor (ex: Do we have any places that we use *args, **kwargs that could be impacted by this ?)
@@ -53,6 +56,20 @@ def ray_4_node_4_cpu(): | |||
cluster.shutdown() | |||
|
|||
|
|||
@pytest.fixture | |||
def ray_2_node_2_cpu(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can these be moved to conftest.py ? All you need is to add deps = [":train_lib", ":conftest"]
in the corresponding pytest build https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/train/BUILD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixture is only needed for the two specific tests I have added.
If we move this to conftest.py then shouldn't we also move other cluster setup fixtures to the conftest.py file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we had a bad pattern that redefined a bunch of fixtures in each test file in the past that some of them are being addressed already, if this fixture is only used in this file im fine living it as-is, but please keep this context in mind when you come across other fixtures like the ones above, we might still have a few hanging around that are defined more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixtures have been moved to conftest and conftest is added as a BUILD dependency
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
hostname=str(i % 2), | ||
gpu_ids=[str(i % 2)], | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice this is very clever!
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ilee300a, lgtm! Just some typos in the error message. Let's fix those and please ping again once CI passes!
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
…#29919) Pipeline node_rank and local_world_size information in ray training so that they are accessible via session.get_node_rank() and session.get_local_world_size() in the training loop. Signed-off-by: ilee300a <ilee300@anyscale.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: ilee300a ilee300@anyscale.com
Pipeline
node_rank
andlocal_world_size
information in ray training so that they are accessible viasession.get_node_rank()
andsession.get_local_world_size()
in the training loop.Note : Need to add tests
Related PRs:
#29812 -- need these information for resumed training and some other mosaic library functionalities.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.