Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Resuming to different node ip then original one doesn’t work #28468

Closed
RaymondKoopmanschap opened this issue Sep 13, 2022 · 4 comments · Fixed by #28470
Closed

[Tune] Resuming to different node ip then original one doesn’t work #28468

RaymondKoopmanschap opened this issue Sep 13, 2022 · 4 comments · Fixed by #28470
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks

Comments

@RaymondKoopmanschap
Copy link

What happened + What you expected to happen

Describing the bug
I first start one tune run on a docker with a certain ip address. I then resume this run from a docker with another ip address. It then resumes at the correct iteration but after 1 iteration I get the error Error: No available node types can fulfill resource request {'node:172.18.0.2': 0.01}. Add suitable node types to this cluster to resolve this issue.
It will then put out this error every 30 seconds or so and not run any further.

Expected behavior
I expect it will just continue with the run and doing this on a different ip should not be an issue.

Useful information
When using the CLIReporter to also output the node ip, the new one still prints the old node ip. However, it restores (resumes) on the new node-ip. Old one is 172.18.0.2 and new one is 172.18.0.3
(PPO pid=515) 2022-09-13 04:24:37,581 INFO trainable.py:669 -- Restored on 172.18.0.3 from checkpoint: /tmp/checkpoint_tmp_md7ga0os (PPO pid=515) 2022-09-13 04:24:37,581 INFO trainable.py:677 -- Current state after restoring: {'_iteration': 8, '_timesteps_total': None, '_time_total': 60.78257393836975, '_episodes_total': 443}

To give some background information, the situation that it is started at a different node ip occurs when Azure kills the node (preempty) due to it being a spot instance and then resumes later. At that point the ip is changed.

Versions / Dependencies

docker container: rayproject/ray-ml:2.0.0-cpu (pulled 13 september)
This one uses python 3.7 and Ray 2.0.0.

Reproduction script

The reproduction needs multiple steps so I will here explain step by step what needs to be done in order to reproduce the error.

  1. Pull the ray docker container: docker pull rayproject/ray-ml:2.0.0-cpu
  2. Create a subnetwork: docker network create --subnet=172.18.0.0/16 ray_test_network
  3. Copy the below code, name it rllib_simple.py and put it in some folder
from ray import tune
from ray.rllib.algorithms.ppo import PPOConfig

num_iterations = 20
checkpoint_frequency = 1
config = PPOConfig().environment(env="CartPole-v0").framework(framework="torch").rollouts(batch_mode="complete_episodes")

tune.run("PPO",
         resume="AUTO",
         config=config.to_dict(),
         stop={"training_iteration": num_iterations},
         verbose=1,
         progress_reporter=tune.CLIReporter(metric_columns=["episode_reward_mean", "episode_len_mean", "episodes_this_iter", "training_iteration", "timesteps_total", "node_ip"]),
         checkpoint_freq=checkpoint_frequency,
         keep_checkpoints_num=None,
         checkpoint_at_end=True)
  1. Run a docker container with the below command in powershell or some CLI. Run it for say 10 iterations. Then kill it with Ctrl+C
docker run -i -t --rm `
    --name rllib_simple `
    --net ray_test_network --ip 172.18.0.2 `
	-v {path_to_folder_you_put_the_rllib_simple.py_file}:/home/developer/workspace `
	-v '{path_to_folder_where_you_want_to_store_the_results_locally}:/home/ray/ray_results/' `
	-w /home/developer/workspace `
	--shm-size=2gb `
	rayproject/ray-ml:2.0.0-cpu `
	bash -c 'python -u rllib_simple.py'
  1. Run the second docker container with the below command in powershell or some CLI. It is nearly the same except the ip has changed. Wait roughly 30 seconds for the error to show up. After 1 iteration it gives the error.
    Note: it is important the path {path_to_folder_where_you_want_to_store_the_results_locally} is the same so it resumes from these the checkpoint from the previous docker run.
docker run -i -t --rm `
    --name rllib_simple `
    --net ray_test_network --ip 172.18.0.3 `
	-v {path_to_folder_you_put_the_rllib_simple.py_file}:/home/developer/workspace `
	-v '{path_to_folder_where_you_want_to_store_the_results_locally}:/home/ray/ray_results/' `
	-w /home/developer/workspace `
	--shm-size=2gb `
	rayproject/ray-ml:2.0.0-cpu `
	bash -c 'python -u rllib_simple.py'

Issue Severity

High: It blocks me from completing my task.

@RaymondKoopmanschap RaymondKoopmanschap added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 13, 2022
@krfricke krfricke added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 13, 2022
@krfricke krfricke self-assigned this Sep 13, 2022
@krfricke
Copy link
Contributor

krfricke commented Sep 13, 2022

Confirmed this is an issue. The issue is actually in the post-result syncing: The next checkpoint will be saved before the next result is returned. The syncer looks up the last reported node IP, which still points to the old IP. I have a working fix for this in progress now, but it will only be available in Ray 2.1 (or maybe 2.0.1).

In the meantime, you can do this as a workaround:

from ray.tune.result import NODE_IP


class ClearNodeIpCallback(tune.Callback):
    def on_trial_start(self, iteration, trials, trial):
        trial.last_result.pop(NODE_IP)



tune.run(
    # ...
    callbacks=[ClearNodeIpCallback()]
)

@krfricke krfricke changed the title [Core] [Tune] Resuming to different node ip then original one doesn’t work [Tune] Resuming to different node ip then original one doesn’t work Sep 13, 2022
@RaymondKoopmanschap
Copy link
Author

Thanks for the quick reply and happy to hear that a fix is already on its way!
The workaround doesn't seem to work however. I made sure it is executed by adding a print statement. This is the code I now use for rllib_simple.py

from ray import tune
from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.result import NODE_IP


num_iterations = 20
checkpoint_frequency = 1
config = PPOConfig().environment(env="CartPole-v0").framework(framework="torch").rollouts(batch_mode="complete_episodes")


class ClearNodeIpCallback(tune.Callback):
    def on_trial_start(self, iteration, trials, trial):
        print('THIS IS EXECUTED')
        trial.last_result.pop(NODE_IP)


tune.run("PPO",
         resume="AUTO",
         config=config.to_dict(),
         callbacks=[ClearNodeIpCallback()],
         stop={"training_iteration": num_iterations},
         verbose=1,
         progress_reporter=tune.CLIReporter(metric_columns=["episode_reward_mean", "episode_len_mean", "episodes_this_iter", "training_iteration", "timesteps_total", "node_ip"]),
         checkpoint_freq=checkpoint_frequency,
         keep_checkpoints_num=None,
         checkpoint_at_end=True)

@krfricke
Copy link
Contributor

Apologies, I didn't clean up my local test results directory properly.

Can you try again with this callback:

class ClearNodeIpCallback(tune.Callback):
    def __init__(self):
        self.reset_trials = set()

    def on_trial_result(self, iteration, trials, trial, result):
        if trial.trial_id not in self.reset_trials:
            trial.last_result.pop(NODE_IP, None)
            self.reset_trials.add(trial.trial_id)

Thanks!

@RaymondKoopmanschap
Copy link
Author

Thanks a lot! The workaround works. I really appreciate the fast replies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants