[tune] Raylet crashing if zero resource capacity is used #4870

willgroves · 2019-05-26T03:33:55Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): OSX 10.12 and Ubuntu 16.04 (can reproduce on both)
Ray installed from (source or binary): binary
Ray version: 0.7.0
Python version: 3.5, 3.6, 3.7 (can reproduce on all)
Exact command to reproduce: See minimal snippet source code to reproduce.

Describe the problem

I am attempting to use a Ray Actor (functioning as a data reader/cache singleton) inside of a function that is being optimized using Ray Tune. After the optimization process runs for about 10 minutes, it always fails with a Raylet connection closed exception and quits the optimization process.

e.g.

Exception: [RayletClient] Raylet connection closed.

2019-05-25 22:23:57,216	ERROR worker.py:1678 -- The node with client ID 13f2e72398ffd893090f7fd31bae0c1409755432 has been marked dead because the monitor has missed too many heartbeats from it.

The error seems to be related to the duration of the optimization task, not the cpu load as evident from the minimal example (see below). Even with various changes to the task (run on OSX, run on Ubuntu, change python version, set reuse actors to false/true, change number of cpus in init), the error still appears. A minimal code to reproduce the error is provided below. Is this a bug in the heartbeat generation for Ray Actors?

A functioning work around was to remove the Ray Actor and instead create a function with a filesystem lock file to create the singleton. Are Ray Actors supported as part of a Ray Tune optimization process?

Source code / logs

Minimal source code to reproduce the error (Takes about 10 minutes to hit the error):

import ray
import time
from lockfile import LockFile
import numpy as np
import pickle
from ray.tune import Trainable, run, Experiment, sample_from
from ray.tune.schedulers import HyperBandScheduler, AsyncHyperBandScheduler

@ray.remote(resources={'datasource': 1})
class DataService(object):
    def __init__(self):
        pass
    
    def get_data_for_item(self,config):
        lock = LockFile
        lock = LockFile("/tmp/test.lockb")
        print(lock.path, 'trying to acquire.')
        lock.acquire()
        print(lock.path, 'is locked.')
        time.sleep(2)
        thing = config.get('identifier',"")+"whatever"
        lock.release()
        print(lock.path, 'is released.')
        return thing

class MyTrainableClass(Trainable):

    def _setup(self, config):
        self.config = config
        self.timestep = 0
    
    def _train(self):
        self.timestep += 1
        print("what is params",self.config)
        val = ray.get(self.config['remote_service'].get_data_for_item.remote(self.config))
        time.sleep(20)
        self.result = val
        print("result",self.config['identifier'],val)
        return {'objective': len(val), 'done': True}

        return {'objective': -1, 'done': True}

    def _save(self, checkpoint_dir):
        path = os.path.join(checkpoint_dir, "checkpoint",str(self.timestep))
        if not os.path.exists(os.path.dirname(path)):
            os.makedirs(os.path.dirname(path))
        with open(path, "wb") as f:
            pickle.dump(self.result,f)
        return path

    def _restore(self, checkpoint_path):
        with open(checkpoint_path,'rb') as f:
            self.config = pickle.load(f)
            self.timestep = 1


ray.shutdown()
ray.init(num_cpus=7,resources={"datasource": 1},object_store_memory=200000000,log_to_driver=True,)
remote_h = DataService.remote()
print("remote_h", remote_h)
hyperband = AsyncHyperBandScheduler(
    time_attr="training_iteration",
    reward_attr="objective",
        max_t=100,)
exp = Experiment(
    name="asynchyperband_class_testx",
    run=MyTrainableClass,
    num_samples=150,
    stop={"training_iteration": 150},
    config= 
    {
        'identifier' : ray.tune.sample_from(lambda x: np.random.choice(['good','better','best','longest'])),
        'useless' : ray.tune.sample_from(lambda x: np.random.choice([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])),
        #'window' : ray.tune.sample_from(lambda x: np.random.choice([40,50,60,70,80,90,100,110])),
        #'window2' : ray.tune.sample_from(lambda x: np.random.choice([40,50,60,70,80,90,100,110])),
        #'thresh' : ray.tune.sample_from(lambda x: np.random.choice(list(range(20,70,2)))),
        #'thresh2' : ray.tune.sample_from(lambda x: np.random.choice(list(range(20,60,2)))),
        'remote_service' : remote_h 
    }
    )
trials = run(exp,scheduler=hyperband,verbose=1, reuse_actors=True)
del remote_h
ray.shutdown()

Observed Exception (OSX 10.12, Python 3.6.0, Ray 0.7.0 from pip):

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-4-38f977638a63> in <module>()
     24     }
     25     )
---> 26 trials = run(exp,scheduler=hyperband,verbose=1, reuse_actors=True)
     27 del remote_h
     28 ray.shutdown()

/Users/wgroves/venv/tensorflow/lib/python3.6/site-packages/ray/tune/tune.py in run(run_or_experiment, name, stop, config, resources_per_trial, num_samples, local_dir, upload_dir, trial_name_creator, loggers, sync_function, checkpoint_freq, checkpoint_at_end, export_formats, max_failures, restore, search_alg, scheduler, with_server, server_port, verbose, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial)
    245     last_debug = 0
    246     while not runner.is_finished():
--> 247         runner.step()
    248         if time.time() - last_debug > DEBUG_PRINT_INTERVAL:
    249             if verbose:

/Users/wgroves/venv/tensorflow/lib/python3.6/site-packages/ray/tune/trial_runner.py in step(self)
    274                 self.trial_executor.start_trial(next_trial)
    275         elif self.trial_executor.get_running_trials():
--> 276             self._process_events()  # blocking
    277         else:
    278             for trial in self._trials:

/Users/wgroves/venv/tensorflow/lib/python3.6/site-packages/ray/tune/trial_runner.py in _process_events(self)
    438 
    439     def _process_events(self):
--> 440         trial = self.trial_executor.get_next_available_trial()  # blocking
    441         with warn_if_slow("process_trial"):
    442             self._process_trial(trial)

/Users/wgroves/venv/tensorflow/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py in get_next_available_trial(self)
    290         # See https://github.com/ray-project/ray/issues/4211 for details.
    291         start = time.time()
--> 292         [result_id], _ = ray.wait(shuffled_results)
    293         wait_time = time.time() - start
    294         if wait_time > NONTRIVIAL_WAIT_TIME_THRESHOLD_S:

/Users/wgroves/venv/tensorflow/lib/python3.6/site-packages/ray/worker.py in wait(object_ids, num_returns, timeout)
   2322             timeout_milliseconds,
   2323             False,
-> 2324             worker.current_task_id,
   2325         )
   2326         return ready_ids, remaining_ids

python/ray/_raylet.pyx in ray._raylet.RayletClient.wait()

python/ray/_raylet.pyx in ray._raylet.check_status()

Exception: [RayletClient] Raylet connection closed.

2019-05-25 22:23:57,216	ERROR worker.py:1678 -- The node with client ID 13f2e72398ffd893090f7fd31bae0c1409755432 has been marked dead because the monitor has missed too many heartbeats from it.

raylet_monitor.err file:

WARNING: Logging before InitGoogleLogging() is written to STDERR
W0525 22:23:57.210012 3839128512 monitor.cc:49] Client timed out: 13f2e72398ffd893090f7fd31bae0c1409755432

raylet.err file:

WARNING: Logging before InitGoogleLogging() is written to STDERR
W0525 22:13:38.203748 3839128512 stats.h:50] Failed to create the Prometheus exporter. This doesn't affect anything except stats. Caused by: null context when constructing CivetServer. Possible problem binding to port.
W0525 22:14:23.657073 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:25.661929 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:27.668455 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:29.674670 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:31.684083 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:33.697531 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:47.302045 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:49.309584 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:51.318904 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:53.325412 3839128512 node_manager.cc:1726] Resources oversubscribed: {}
W0525 22:14:56.674257 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:58.680490 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:15:10.775121 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:15:12.781167 3839128512 node_manager.cc:1726] Resources oversubscribed: {}
...
W0525 22:22:57.761237 3839128512 node_manager.cc:1726] Resources oversubscribed: {}
W0525 22:23:21.603353 3839128512 node_manager.cc:1726] Resources oversubscribed: {}
F0525 22:23:25.221491 3839128512 scheduling_resources.cc:171]  Check failed: resource_capacity_.count(resource_label) == 1 Attempt to acquire unknown resource: CPU
*** Check failure stack trace: ***
*** Aborted at 1558837405 (unix time) try "date -d @1558837405" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7fffdbf5ed42) received by PID 85294 (TID 0x7fffe4d473c0) stack trace: ***
    @     0x7fffdc03fb3a _sigtramp
    @        0x10ac0f551 (unknown)
    @     0x7fffdbec4420 abort
    @        0x105e1d899 google::logging_fail()
    @        0x105e1c5d6 google::LogMessage::SendToLog()
    @        0x105e1ccc5 google::LogMessage::Flush()
    @        0x105e1cda2 google::LogMessage::~LogMessage()
    @        0x105e16835 ray::RayLog::~RayLog()
    @        0x105d42a8c ray::raylet::ResourceSet::SubtractResourcesStrict()
    @        0x105d1368b ray::raylet::NodeManager::HandleTaskUnblocked()
    @        0x105d10748 ray::raylet::NodeManager::ProcessClientMessage()
    @        0x105d3331c std::__1::__function::__func<>::operator()()
    @        0x105e081fc ray::ClientConnection<>::ProcessMessage()
    @        0x105e101b8 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
    @        0x105cecff3 boost::asio::detail::scheduler::do_run_one()
    @        0x105ceca42 boost::asio::detail::scheduler::run()
    @        0x105cdf561 main
    @     0x7fffdbe30235 start
    @               0x10 (unknown)

The text was updated successfully, but these errors were encountered:

richardliaw · 2019-05-26T22:11:56Z

Looks like there's some issue with resource subscription that becomes more and more likely over time? Can you try num_cpus=3 to see if this is reproduced faster?

willgroves · 2019-05-27T02:47:42Z

There appears to be a relationship between the num_cpus setting and the time before the error is encountered. If the setting is low enough, the error does not appear.

Here were some tests based on the sample code on a 2 core with hyperthreading (4 virtual cores) Macbook Pro.

num_cpus 3 took 32 minutes, no error
num_cpus 4 took 7 minutes to get a raylet connection closed exception
num_cpus 5 took 6 minutes to get a raylet connection closed exception
num_cpus 6 took 2.5 minutes to get a raylet connection closed exception
num_cpus 7 took 3 minutes to get a raylet connection closed exception
num_cpus 12 took 1 minute 35 seconds to get a raylet connection closed exception

richardliaw · 2019-05-27T03:46:30Z

hm looks like some resource bookkeeping bug? cc @stephanie-wang @robertnishihara @pcmoritz

pcmoritz · 2019-05-29T20:25:13Z

This is potentially related to 0f42f87 but I'm not super sure. Let me try to reproduce it.

pcmoritz · 2019-05-29T21:19:42Z

@willgroves So I can't reproduce the bug with the script you provided (I ran it a couple of times). Is the reproduction fairly deterministic or does the problem happen rarely?

I can reproduce a related but different error: #4892

richardliaw · 2019-05-30T23:58:16Z

@pcmoritz did you try setting num_cpus to 20?

robertnishihara · 2019-06-07T05:00:48Z

Ok, I can reproduce the crash using the original script (with num_cpus=12).

EDIT: Not consistently unfortunately, but sometimes.

robertnishihara · 2019-06-07T23:14:29Z

#4945 should fix this, however, I still don't completely understand the issue because I haven't been able to produce a minimal example that triggers the issue.

pcmoritz changed the title ~~[tune] Raylet connection closed Exception in Ray Tune optimization when trainable calls a Ray Actor.~~ [tune] Raylet crashing if zero resource capacity is used May 29, 2019

pcmoritz mentioned this issue May 29, 2019

Fix SubtractResourcesStrict if requested resource is 0 #4891

Closed

robertnishihara mentioned this issue Jun 7, 2019

Fix resource bookkeeping bug with acquiring unknown resource. #4945

Merged

pcmoritz closed this as completed in #4945 Jun 8, 2019

Edilmo mentioned this issue Feb 7, 2020

[fix][0.6.6] Raylet connection closed & RayOutOfMemoryError BonsaiAI/ray#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Raylet crashing if zero resource capacity is used #4870

[tune] Raylet crashing if zero resource capacity is used #4870

willgroves commented May 26, 2019

richardliaw commented May 26, 2019

willgroves commented May 27, 2019

richardliaw commented May 27, 2019

pcmoritz commented May 29, 2019 •

edited

Loading

pcmoritz commented May 29, 2019

richardliaw commented May 30, 2019

robertnishihara commented Jun 7, 2019 •

edited

Loading

robertnishihara commented Jun 7, 2019

[tune] Raylet crashing if zero resource capacity is used #4870

[tune] Raylet crashing if zero resource capacity is used #4870

Comments

willgroves commented May 26, 2019

System information

Describe the problem

Source code / logs

richardliaw commented May 26, 2019

willgroves commented May 27, 2019

richardliaw commented May 27, 2019

pcmoritz commented May 29, 2019 • edited Loading

pcmoritz commented May 29, 2019

richardliaw commented May 30, 2019

robertnishihara commented Jun 7, 2019 • edited Loading

robertnishihara commented Jun 7, 2019

pcmoritz commented May 29, 2019 •

edited

Loading

robertnishihara commented Jun 7, 2019 •

edited

Loading