-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Raylet crashing if zero resource capacity is used #4870
Comments
Looks like there's some issue with resource subscription that becomes more and more likely over time? Can you try |
There appears to be a relationship between the num_cpus setting and the time before the error is encountered. If the setting is low enough, the error does not appear. Here were some tests based on the sample code on a 2 core with hyperthreading (4 virtual cores) Macbook Pro.
|
hm looks like some resource bookkeeping bug? cc @stephanie-wang @robertnishihara @pcmoritz |
Raylet connection closed
Exception in Ray Tune optimization when trainable calls a Ray Actor.
This is potentially related to 0f42f87 but I'm not super sure. Let me try to reproduce it. |
@willgroves So I can't reproduce the bug with the script you provided (I ran it a couple of times). Is the reproduction fairly deterministic or does the problem happen rarely? I can reproduce a related but different error: #4892 |
@pcmoritz did you try setting num_cpus to 20? |
Ok, I can reproduce the crash using the original script (with EDIT: Not consistently unfortunately, but sometimes. |
#4945 should fix this, however, I still don't completely understand the issue because I haven't been able to produce a minimal example that triggers the issue. |
System information
Describe the problem
I am attempting to use a Ray Actor (functioning as a data reader/cache singleton) inside of a function that is being optimized using Ray Tune. After the optimization process runs for about 10 minutes, it always fails with a Raylet connection closed exception and quits the optimization process.
e.g.
The error seems to be related to the duration of the optimization task, not the cpu load as evident from the minimal example (see below). Even with various changes to the task (run on OSX, run on Ubuntu, change python version, set reuse actors to false/true, change number of cpus in init), the error still appears. A minimal code to reproduce the error is provided below. Is this a bug in the heartbeat generation for Ray Actors?
A functioning work around was to remove the Ray Actor and instead create a function with a filesystem lock file to create the singleton. Are Ray Actors supported as part of a Ray Tune optimization process?
Source code / logs
Minimal source code to reproduce the error (Takes about 10 minutes to hit the error):
Observed Exception (OSX 10.12, Python 3.6.0, Ray 0.7.0 from pip):
raylet_monitor.err file:
raylet.err file:
The text was updated successfully, but these errors were encountered: