You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a Ray cluster is shutdown on exit of a context manager, it is not always fully shutdown if the context manager is exited with an exception.
In our CI, this can lead to a broken state, where a Ray cluster is "running" (but not really), and tests fail on ray.init(num_cpus=n) (because a Ray cluster already exists).
This behavior was uncovered in #35004. See reproduction script.
Versions / Dependencies
master
Reproduction script
import contextlib
import ray
from ray.cluster_utils import Cluster
@contextlib.contextmanager
def ray_start_2_node_cluster(num_cpus_per_node: int, num_gpus_per_node: int):
cluster = Cluster()
for _ in range(2):
cluster.add_node(num_cpus=num_cpus_per_node, num_gpus=num_gpus_per_node)
ray.init(address=cluster.address)
yield
ray.shutdown()
cluster.shutdown()
with ray_start_2_node_cluster(2, 1):
raise RuntimeError
A subsequent ray status yields:
(ray) kai@192:~/coding/ > ray status
Traceback (most recent call last):
File "/Users/kai/coding/ray/python/ray/_private/gcs_utils.py", line 123, in check_health
resp = stub.CheckAlive(req, timeout=timeout)
File "/Users/kai/.pyenv/versions/3.8.16/envs/ray/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/Users/kai/.pyenv/versions/3.8.16/envs/ray/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused"
debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-05-03T14:35:01.329305+01:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {created_time:"2023-05-03T14:35:01.329302+01:00", grpc_status:14}]}"
>
Ray cluster is not found at 127.0.0.1:63032
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
krfricke
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
May 3, 2023
rkooo567
added
flaky-tracker
Issue created via Flaky Test Tracker https://flaky-tests.ray.io/
P1
Issue that should be fixed within a few weeks
and removed
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
May 10, 2023
What happened + What you expected to happen
When a Ray cluster is shutdown on exit of a context manager, it is not always fully shutdown if the context manager is exited with an exception.
In our CI, this can lead to a broken state, where a Ray cluster is "running" (but not really), and tests fail on
ray.init(num_cpus=n)
(because a Ray cluster already exists).This behavior was uncovered in #35004. See reproduction script.
Versions / Dependencies
master
Reproduction script
A subsequent
ray status
yields:Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: