[Tune] Resuming to different node ip then original one doesn’t work #28468

RaymondKoopmanschap · 2022-09-13T12:24:06Z

What happened + What you expected to happen

Describing the bug
I first start one tune run on a docker with a certain ip address. I then resume this run from a docker with another ip address. It then resumes at the correct iteration but after 1 iteration I get the error Error: No available node types can fulfill resource request {'node:172.18.0.2': 0.01}. Add suitable node types to this cluster to resolve this issue.
It will then put out this error every 30 seconds or so and not run any further.

Expected behavior
I expect it will just continue with the run and doing this on a different ip should not be an issue.

Useful information
When using the CLIReporter to also output the node ip, the new one still prints the old node ip. However, it restores (resumes) on the new node-ip. Old one is 172.18.0.2 and new one is 172.18.0.3
(PPO pid=515) 2022-09-13 04:24:37,581 INFO trainable.py:669 -- Restored on 172.18.0.3 from checkpoint: /tmp/checkpoint_tmp_md7ga0os (PPO pid=515) 2022-09-13 04:24:37,581 INFO trainable.py:677 -- Current state after restoring: {'_iteration': 8, '_timesteps_total': None, '_time_total': 60.78257393836975, '_episodes_total': 443}

To give some background information, the situation that it is started at a different node ip occurs when Azure kills the node (preempty) due to it being a spot instance and then resumes later. At that point the ip is changed.

Versions / Dependencies

docker container: rayproject/ray-ml:2.0.0-cpu (pulled 13 september)
This one uses python 3.7 and Ray 2.0.0.

Reproduction script

The reproduction needs multiple steps so I will here explain step by step what needs to be done in order to reproduce the error.

Pull the ray docker container: docker pull rayproject/ray-ml:2.0.0-cpu
Create a subnetwork: docker network create --subnet=172.18.0.0/16 ray_test_network
Copy the below code, name it rllib_simple.py and put it in some folder

from ray import tune
from ray.rllib.algorithms.ppo import PPOConfig

num_iterations = 20
checkpoint_frequency = 1
config = PPOConfig().environment(env="CartPole-v0").framework(framework="torch").rollouts(batch_mode="complete_episodes")

tune.run("PPO",
         resume="AUTO",
         config=config.to_dict(),
         stop={"training_iteration": num_iterations},
         verbose=1,
         progress_reporter=tune.CLIReporter(metric_columns=["episode_reward_mean", "episode_len_mean", "episodes_this_iter", "training_iteration", "timesteps_total", "node_ip"]),
         checkpoint_freq=checkpoint_frequency,
         keep_checkpoints_num=None,
         checkpoint_at_end=True)

Run a docker container with the below command in powershell or some CLI. Run it for say 10 iterations. Then kill it with Ctrl+C

docker run -i -t --rm `
    --name rllib_simple `
    --net ray_test_network --ip 172.18.0.2 `
	-v {path_to_folder_you_put_the_rllib_simple.py_file}:/home/developer/workspace `
	-v '{path_to_folder_where_you_want_to_store_the_results_locally}:/home/ray/ray_results/' `
	-w /home/developer/workspace `
	--shm-size=2gb `
	rayproject/ray-ml:2.0.0-cpu `
	bash -c 'python -u rllib_simple.py'

Run the second docker container with the below command in powershell or some CLI. It is nearly the same except the ip has changed. Wait roughly 30 seconds for the error to show up. After 1 iteration it gives the error.
Note: it is important the path {path_to_folder_where_you_want_to_store_the_results_locally} is the same so it resumes from these the checkpoint from the previous docker run.

docker run -i -t --rm `
    --name rllib_simple `
    --net ray_test_network --ip 172.18.0.3 `
	-v {path_to_folder_you_put_the_rllib_simple.py_file}:/home/developer/workspace `
	-v '{path_to_folder_where_you_want_to_store_the_results_locally}:/home/ray/ray_results/' `
	-w /home/developer/workspace `
	--shm-size=2gb `
	rayproject/ray-ml:2.0.0-cpu `
	bash -c 'python -u rllib_simple.py'

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

krfricke · 2022-09-13T15:08:56Z

Confirmed this is an issue. The issue is actually in the post-result syncing: The next checkpoint will be saved before the next result is returned. The syncer looks up the last reported node IP, which still points to the old IP. I have a working fix for this in progress now, but it will only be available in Ray 2.1 (or maybe 2.0.1).

In the meantime, you can do this as a workaround:

from ray.tune.result import NODE_IP


class ClearNodeIpCallback(tune.Callback):
    def on_trial_start(self, iteration, trials, trial):
        trial.last_result.pop(NODE_IP)



tune.run(
    # ...
    callbacks=[ClearNodeIpCallback()]
)

RaymondKoopmanschap · 2022-09-13T15:42:17Z

Thanks for the quick reply and happy to hear that a fix is already on its way!
The workaround doesn't seem to work however. I made sure it is executed by adding a print statement. This is the code I now use for rllib_simple.py

from ray import tune
from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.result import NODE_IP


num_iterations = 20
checkpoint_frequency = 1
config = PPOConfig().environment(env="CartPole-v0").framework(framework="torch").rollouts(batch_mode="complete_episodes")


class ClearNodeIpCallback(tune.Callback):
    def on_trial_start(self, iteration, trials, trial):
        print('THIS IS EXECUTED')
        trial.last_result.pop(NODE_IP)


tune.run("PPO",
         resume="AUTO",
         config=config.to_dict(),
         callbacks=[ClearNodeIpCallback()],
         stop={"training_iteration": num_iterations},
         verbose=1,
         progress_reporter=tune.CLIReporter(metric_columns=["episode_reward_mean", "episode_len_mean", "episodes_this_iter", "training_iteration", "timesteps_total", "node_ip"]),
         checkpoint_freq=checkpoint_frequency,
         keep_checkpoints_num=None,
         checkpoint_at_end=True)

krfricke · 2022-09-13T17:02:04Z

Apologies, I didn't clean up my local test results directory properly.

Can you try again with this callback:

class ClearNodeIpCallback(tune.Callback):
    def __init__(self):
        self.reset_trials = set()

    def on_trial_result(self, iteration, trials, trial, result):
        if trial.trial_id not in self.reset_trials:
            trial.last_result.pop(NODE_IP, None)
            self.reset_trials.add(trial.trial_id)

Thanks!

RaymondKoopmanschap · 2022-09-14T10:25:38Z

Thanks a lot! The workaround works. I really appreciate the fast replies.

RaymondKoopmanschap added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 13, 2022

krfricke added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 13, 2022

krfricke self-assigned this Sep 13, 2022

krfricke changed the title ~~[Core] [Tune] Resuming to different node ip then original one doesn’t work~~ [Tune] Resuming to different node ip then original one doesn’t work Sep 13, 2022

krfricke mentioned this issue Sep 13, 2022

[tune] Fix trial checkpoint syncing after recovery from other node #28470

Merged

7 tasks

krfricke closed this as completed in #28470 Sep 14, 2022

YuhangSong mentioned this issue Feb 15, 2023

[tune/Autoscaler] Autoscaler made demands on a node while knowing this node has died / Cluster hangs randomly when a spot instance worker node is killed. #32579

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tune] Resuming to different node ip then original one doesn’t work #28468

[Tune] Resuming to different node ip then original one doesn’t work #28468

RaymondKoopmanschap commented Sep 13, 2022

krfricke commented Sep 13, 2022 •

edited

Loading

RaymondKoopmanschap commented Sep 13, 2022

krfricke commented Sep 13, 2022

RaymondKoopmanschap commented Sep 14, 2022

[Tune] Resuming to different node ip then original one doesn’t work #28468

[Tune] Resuming to different node ip then original one doesn’t work #28468

Comments

RaymondKoopmanschap commented Sep 13, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

krfricke commented Sep 13, 2022 • edited Loading

RaymondKoopmanschap commented Sep 13, 2022

krfricke commented Sep 13, 2022

RaymondKoopmanschap commented Sep 14, 2022

krfricke commented Sep 13, 2022 •

edited

Loading