-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] Fix behavior of multi-node checkpointing without an external storage_path
to hard-fail
#37543
[air] Fix behavior of multi-node checkpointing without an external storage_path
to hard-fail
#37543
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@@ -93,7 +98,8 @@ | |||
"`RunConfig(storage_path='/mnt/path/to/nfs_storage')`\n" | |||
"See this Github issue for more details on transitioning to cloud storage/NFS " | |||
"as well as an explanation on why this functionality is " | |||
"being removed: https://github.com/ray-project/ray/issues/37177\n\n" | |||
"being removed: https://github.com/ray-project/ray/issues/37177\n" | |||
"If you are already using NFS, you can ignore this warning message.\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this ever happen? If they are already using storage_path, they shouldn't see this message, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do log a warning message when running with multiple nodes without cloud storage, and it'll show up even if NFS is used.
This is because I don't have a good way of checking if storage_path
is a local directory or NFS.
Is that ok? This warning message will only show up once, and it basically just serves as a deprecation reminder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me :)
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…disable_head_node_sync_fix
@justinvyu This is great! Can you also create a cherry-pick PR into the 2.6.0 release branch? :) |
…torage_path` to hard-fail (ray-project#37543) Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…torage_path` to hard-fail (ray-project#37543) Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…torage_path` to hard-fail (ray-project#37543) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
…torage_path` to hard-fail (ray-project#37543) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…torage_path` to hard-fail (ray-project#37543) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR builds on #37142, to fix the behavior of running a multi-node experiment without NFS/cloud storage. The correct behavior should be to fail, but previously, Tune caught the error and just logged it.
Testing
Manual testing
Verified outputs/errors are as expected for:
Automated testing of all combinations
This PR was tested with this script on a cluster
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.