Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Raylet crashing if zero resource capacity is used #4870

Closed
willgroves opened this issue May 26, 2019 · 8 comments · Fixed by #4945
Closed

[tune] Raylet crashing if zero resource capacity is used #4870

willgroves opened this issue May 26, 2019 · 8 comments · Fixed by #4945

Comments

@willgroves
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): OSX 10.12 and Ubuntu 16.04 (can reproduce on both)
  • Ray installed from (source or binary): binary
  • Ray version: 0.7.0
  • Python version: 3.5, 3.6, 3.7 (can reproduce on all)
  • Exact command to reproduce: See minimal snippet source code to reproduce.

Describe the problem

I am attempting to use a Ray Actor (functioning as a data reader/cache singleton) inside of a function that is being optimized using Ray Tune. After the optimization process runs for about 10 minutes, it always fails with a Raylet connection closed exception and quits the optimization process.

e.g.

Exception: [RayletClient] Raylet connection closed.

2019-05-25 22:23:57,216	ERROR worker.py:1678 -- The node with client ID 13f2e72398ffd893090f7fd31bae0c1409755432 has been marked dead because the monitor has missed too many heartbeats from it.

The error seems to be related to the duration of the optimization task, not the cpu load as evident from the minimal example (see below). Even with various changes to the task (run on OSX, run on Ubuntu, change python version, set reuse actors to false/true, change number of cpus in init), the error still appears. A minimal code to reproduce the error is provided below. Is this a bug in the heartbeat generation for Ray Actors?

A functioning work around was to remove the Ray Actor and instead create a function with a filesystem lock file to create the singleton. Are Ray Actors supported as part of a Ray Tune optimization process?

Source code / logs

Minimal source code to reproduce the error (Takes about 10 minutes to hit the error):

import ray
import time
from lockfile import LockFile
import numpy as np
import pickle
from ray.tune import Trainable, run, Experiment, sample_from
from ray.tune.schedulers import HyperBandScheduler, AsyncHyperBandScheduler

@ray.remote(resources={'datasource': 1})
class DataService(object):
    def __init__(self):
        pass
    
    def get_data_for_item(self,config):
        lock = LockFile
        lock = LockFile("/tmp/test.lockb")
        print(lock.path, 'trying to acquire.')
        lock.acquire()
        print(lock.path, 'is locked.')
        time.sleep(2)
        thing = config.get('identifier',"")+"whatever"
        lock.release()
        print(lock.path, 'is released.')
        return thing

class MyTrainableClass(Trainable):

    def _setup(self, config):
        self.config = config
        self.timestep = 0
    
    def _train(self):
        self.timestep += 1
        print("what is params",self.config)
        val = ray.get(self.config['remote_service'].get_data_for_item.remote(self.config))
        time.sleep(20)
        self.result = val
        print("result",self.config['identifier'],val)
        return {'objective': len(val), 'done': True}

        return {'objective': -1, 'done': True}

    def _save(self, checkpoint_dir):
        path = os.path.join(checkpoint_dir, "checkpoint",str(self.timestep))
        if not os.path.exists(os.path.dirname(path)):
            os.makedirs(os.path.dirname(path))
        with open(path, "wb") as f:
            pickle.dump(self.result,f)
        return path

    def _restore(self, checkpoint_path):
        with open(checkpoint_path,'rb') as f:
            self.config = pickle.load(f)
            self.timestep = 1


ray.shutdown()
ray.init(num_cpus=7,resources={"datasource": 1},object_store_memory=200000000,log_to_driver=True,)
remote_h = DataService.remote()
print("remote_h", remote_h)
hyperband = AsyncHyperBandScheduler(
    time_attr="training_iteration",
    reward_attr="objective",
        max_t=100,)
exp = Experiment(
    name="asynchyperband_class_testx",
    run=MyTrainableClass,
    num_samples=150,
    stop={"training_iteration": 150},
    config= 
    {
        'identifier' : ray.tune.sample_from(lambda x: np.random.choice(['good','better','best','longest'])),
        'useless' : ray.tune.sample_from(lambda x: np.random.choice([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])),
        #'window' : ray.tune.sample_from(lambda x: np.random.choice([40,50,60,70,80,90,100,110])),
        #'window2' : ray.tune.sample_from(lambda x: np.random.choice([40,50,60,70,80,90,100,110])),
        #'thresh' : ray.tune.sample_from(lambda x: np.random.choice(list(range(20,70,2)))),
        #'thresh2' : ray.tune.sample_from(lambda x: np.random.choice(list(range(20,60,2)))),
        'remote_service' : remote_h 
    }
    )
trials = run(exp,scheduler=hyperband,verbose=1, reuse_actors=True)
del remote_h
ray.shutdown()

Observed Exception (OSX 10.12, Python 3.6.0, Ray 0.7.0 from pip):

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-4-38f977638a63> in <module>()
     24     }
     25     )
---> 26 trials = run(exp,scheduler=hyperband,verbose=1, reuse_actors=True)
     27 del remote_h
     28 ray.shutdown()

/Users/wgroves/venv/tensorflow/lib/python3.6/site-packages/ray/tune/tune.py in run(run_or_experiment, name, stop, config, resources_per_trial, num_samples, local_dir, upload_dir, trial_name_creator, loggers, sync_function, checkpoint_freq, checkpoint_at_end, export_formats, max_failures, restore, search_alg, scheduler, with_server, server_port, verbose, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial)
    245     last_debug = 0
    246     while not runner.is_finished():
--> 247         runner.step()
    248         if time.time() - last_debug > DEBUG_PRINT_INTERVAL:
    249             if verbose:

/Users/wgroves/venv/tensorflow/lib/python3.6/site-packages/ray/tune/trial_runner.py in step(self)
    274                 self.trial_executor.start_trial(next_trial)
    275         elif self.trial_executor.get_running_trials():
--> 276             self._process_events()  # blocking
    277         else:
    278             for trial in self._trials:

/Users/wgroves/venv/tensorflow/lib/python3.6/site-packages/ray/tune/trial_runner.py in _process_events(self)
    438 
    439     def _process_events(self):
--> 440         trial = self.trial_executor.get_next_available_trial()  # blocking
    441         with warn_if_slow("process_trial"):
    442             self._process_trial(trial)

/Users/wgroves/venv/tensorflow/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py in get_next_available_trial(self)
    290         # See https://github.com/ray-project/ray/issues/4211 for details.
    291         start = time.time()
--> 292         [result_id], _ = ray.wait(shuffled_results)
    293         wait_time = time.time() - start
    294         if wait_time > NONTRIVIAL_WAIT_TIME_THRESHOLD_S:

/Users/wgroves/venv/tensorflow/lib/python3.6/site-packages/ray/worker.py in wait(object_ids, num_returns, timeout)
   2322             timeout_milliseconds,
   2323             False,
-> 2324             worker.current_task_id,
   2325         )
   2326         return ready_ids, remaining_ids

python/ray/_raylet.pyx in ray._raylet.RayletClient.wait()

python/ray/_raylet.pyx in ray._raylet.check_status()

Exception: [RayletClient] Raylet connection closed.

2019-05-25 22:23:57,216	ERROR worker.py:1678 -- The node with client ID 13f2e72398ffd893090f7fd31bae0c1409755432 has been marked dead because the monitor has missed too many heartbeats from it.

raylet_monitor.err file:

WARNING: Logging before InitGoogleLogging() is written to STDERR
W0525 22:23:57.210012 3839128512 monitor.cc:49] Client timed out: 13f2e72398ffd893090f7fd31bae0c1409755432

raylet.err file:

WARNING: Logging before InitGoogleLogging() is written to STDERR
W0525 22:13:38.203748 3839128512 stats.h:50] Failed to create the Prometheus exporter. This doesn't affect anything except stats. Caused by: null context when constructing CivetServer. Possible problem binding to port.
W0525 22:14:23.657073 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:25.661929 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:27.668455 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:29.674670 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:31.684083 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:33.697531 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:47.302045 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:49.309584 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:51.318904 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:53.325412 3839128512 node_manager.cc:1726] Resources oversubscribed: {}
W0525 22:14:56.674257 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:14:58.680490 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:15:10.775121 3839128512 node_manager.cc:1726] Resources oversubscribed: {CPU,0.000000}
W0525 22:15:12.781167 3839128512 node_manager.cc:1726] Resources oversubscribed: {}
...
W0525 22:22:57.761237 3839128512 node_manager.cc:1726] Resources oversubscribed: {}
W0525 22:23:21.603353 3839128512 node_manager.cc:1726] Resources oversubscribed: {}
F0525 22:23:25.221491 3839128512 scheduling_resources.cc:171]  Check failed: resource_capacity_.count(resource_label) == 1 Attempt to acquire unknown resource: CPU
*** Check failure stack trace: ***
*** Aborted at 1558837405 (unix time) try "date -d @1558837405" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7fffdbf5ed42) received by PID 85294 (TID 0x7fffe4d473c0) stack trace: ***
    @     0x7fffdc03fb3a _sigtramp
    @        0x10ac0f551 (unknown)
    @     0x7fffdbec4420 abort
    @        0x105e1d899 google::logging_fail()
    @        0x105e1c5d6 google::LogMessage::SendToLog()
    @        0x105e1ccc5 google::LogMessage::Flush()
    @        0x105e1cda2 google::LogMessage::~LogMessage()
    @        0x105e16835 ray::RayLog::~RayLog()
    @        0x105d42a8c ray::raylet::ResourceSet::SubtractResourcesStrict()
    @        0x105d1368b ray::raylet::NodeManager::HandleTaskUnblocked()
    @        0x105d10748 ray::raylet::NodeManager::ProcessClientMessage()
    @        0x105d3331c std::__1::__function::__func<>::operator()()
    @        0x105e081fc ray::ClientConnection<>::ProcessMessage()
    @        0x105e101b8 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
    @        0x105cecff3 boost::asio::detail::scheduler::do_run_one()
    @        0x105ceca42 boost::asio::detail::scheduler::run()
    @        0x105cdf561 main
    @     0x7fffdbe30235 start
    @               0x10 (unknown)
@richardliaw
Copy link
Contributor

Looks like there's some issue with resource subscription that becomes more and more likely over time? Can you try num_cpus=3 to see if this is reproduced faster?

@willgroves
Copy link
Author

There appears to be a relationship between the num_cpus setting and the time before the error is encountered. If the setting is low enough, the error does not appear.

Here were some tests based on the sample code on a 2 core with hyperthreading (4 virtual cores) Macbook Pro.

num_cpus 3 took 32 minutes, no error
num_cpus 4 took 7 minutes to get a raylet connection closed exception
num_cpus 5 took 6 minutes to get a raylet connection closed exception
num_cpus 6 took 2.5 minutes to get a raylet connection closed exception
num_cpus 7 took 3 minutes to get a raylet connection closed exception
num_cpus 12 took 1 minute 35 seconds to get a raylet connection closed exception

@richardliaw
Copy link
Contributor

hm looks like some resource bookkeeping bug? cc @stephanie-wang @robertnishihara @pcmoritz

@pcmoritz pcmoritz changed the title [tune] Raylet connection closed Exception in Ray Tune optimization when trainable calls a Ray Actor. [tune] Raylet crashing if zero resource capacity is used May 29, 2019
@pcmoritz
Copy link
Contributor

pcmoritz commented May 29, 2019

This is potentially related to 0f42f87 but I'm not super sure. Let me try to reproduce it.

@pcmoritz
Copy link
Contributor

@willgroves So I can't reproduce the bug with the script you provided (I ran it a couple of times). Is the reproduction fairly deterministic or does the problem happen rarely?

I can reproduce a related but different error: #4892

@richardliaw
Copy link
Contributor

@pcmoritz did you try setting num_cpus to 20?

@robertnishihara
Copy link
Collaborator

robertnishihara commented Jun 7, 2019

Ok, I can reproduce the crash using the original script (with num_cpus=12).

EDIT: Not consistently unfortunately, but sometimes.

@robertnishihara
Copy link
Collaborator

#4945 should fix this, however, I still don't completely understand the issue because I haven't been able to produce a minimal example that triggers the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants