-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] Resuming to different node ip then original one doesn’t work #28468
Comments
Confirmed this is an issue. The issue is actually in the post-result syncing: The next checkpoint will be saved before the next result is returned. The syncer looks up the last reported node IP, which still points to the old IP. I have a working fix for this in progress now, but it will only be available in Ray 2.1 (or maybe 2.0.1). In the meantime, you can do this as a workaround:
|
Thanks for the quick reply and happy to hear that a fix is already on its way!
|
Apologies, I didn't clean up my local test results directory properly. Can you try again with this callback:
Thanks! |
Thanks a lot! The workaround works. I really appreciate the fast replies. |
What happened + What you expected to happen
Describing the bug
I first start one tune run on a docker with a certain ip address. I then resume this run from a docker with another ip address. It then resumes at the correct iteration but after 1 iteration I get the error
Error: No available node types can fulfill resource request {'node:172.18.0.2': 0.01}. Add suitable node types to this cluster to resolve this issue.
It will then put out this error every 30 seconds or so and not run any further.
Expected behavior
I expect it will just continue with the run and doing this on a different ip should not be an issue.
Useful information
When using the CLIReporter to also output the node ip, the new one still prints the old node ip. However, it restores (resumes) on the new node-ip. Old one is 172.18.0.2 and new one is 172.18.0.3
(PPO pid=515) 2022-09-13 04:24:37,581 INFO trainable.py:669 -- Restored on 172.18.0.3 from checkpoint: /tmp/checkpoint_tmp_md7ga0os (PPO pid=515) 2022-09-13 04:24:37,581 INFO trainable.py:677 -- Current state after restoring: {'_iteration': 8, '_timesteps_total': None, '_time_total': 60.78257393836975, '_episodes_total': 443}
To give some background information, the situation that it is started at a different node ip occurs when Azure kills the node (preempty) due to it being a spot instance and then resumes later. At that point the ip is changed.
Versions / Dependencies
docker container: rayproject/ray-ml:2.0.0-cpu (pulled 13 september)
This one uses python 3.7 and Ray 2.0.0.
Reproduction script
The reproduction needs multiple steps so I will here explain step by step what needs to be done in order to reproduce the error.
docker pull rayproject/ray-ml:2.0.0-cpu
docker network create --subnet=172.18.0.0/16 ray_test_network
Note: it is important the path {path_to_folder_where_you_want_to_store_the_results_locally} is the same so it resumes from these the checkpoint from the previous docker run.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: