[core] Ray shutdown in a context manager is not fully executed on error #35005

krfricke · 2023-05-03T13:57:33Z

What happened + What you expected to happen

When a Ray cluster is shutdown on exit of a context manager, it is not always fully shutdown if the context manager is exited with an exception.

In our CI, this can lead to a broken state, where a Ray cluster is "running" (but not really), and tests fail on ray.init(num_cpus=n) (because a Ray cluster already exists).

This behavior was uncovered in #35004. See reproduction script.

Versions / Dependencies

master

Reproduction script

import contextlib
import ray
from ray.cluster_utils import Cluster


@contextlib.contextmanager
def ray_start_2_node_cluster(num_cpus_per_node: int, num_gpus_per_node: int):
    cluster = Cluster()
    for _ in range(2):
        cluster.add_node(num_cpus=num_cpus_per_node, num_gpus=num_gpus_per_node)

    ray.init(address=cluster.address)

    yield

    ray.shutdown()
    cluster.shutdown()


with ray_start_2_node_cluster(2, 1):
    raise RuntimeError

A subsequent ray status yields:

(ray) kai@192:~/coding/ > ray status
Traceback (most recent call last):
  File "/Users/kai/coding/ray/python/ray/_private/gcs_utils.py", line 123, in check_health
    resp = stub.CheckAlive(req, timeout=timeout)
  File "/Users/kai/.pyenv/versions/3.8.16/envs/ray/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/Users/kai/.pyenv/versions/3.8.16/envs/ray/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-05-03T14:35:01.329305+01:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {created_time:"2023-05-03T14:35:01.329302+01:00", grpc_status:14}]}"
>
Ray cluster is not found at 127.0.0.1:63032

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

krfricke added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 3, 2023

matthewdeng added the core Issues that should be addressed in Ray Core label May 3, 2023

rkooo567 added flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ P1 Issue that should be fixed within a few weeks and removed bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 10, 2023

anyscalesam removed the flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ label Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Ray shutdown in a context manager is not fully executed on error #35005

[core] Ray shutdown in a context manager is not fully executed on error #35005

krfricke commented May 3, 2023

[core] Ray shutdown in a context manager is not fully executed on error #35005

[core] Ray shutdown in a context manager is not fully executed on error #35005

Comments

krfricke commented May 3, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity