Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Ray shutdown in a context manager is not fully executed on error #35005

Open
krfricke opened this issue May 3, 2023 · 0 comments
Open
Labels
core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks

Comments

@krfricke
Copy link
Contributor

krfricke commented May 3, 2023

What happened + What you expected to happen

When a Ray cluster is shutdown on exit of a context manager, it is not always fully shutdown if the context manager is exited with an exception.

In our CI, this can lead to a broken state, where a Ray cluster is "running" (but not really), and tests fail on ray.init(num_cpus=n) (because a Ray cluster already exists).

This behavior was uncovered in #35004. See reproduction script.

Versions / Dependencies

master

Reproduction script

import contextlib
import ray
from ray.cluster_utils import Cluster


@contextlib.contextmanager
def ray_start_2_node_cluster(num_cpus_per_node: int, num_gpus_per_node: int):
    cluster = Cluster()
    for _ in range(2):
        cluster.add_node(num_cpus=num_cpus_per_node, num_gpus=num_gpus_per_node)

    ray.init(address=cluster.address)

    yield

    ray.shutdown()
    cluster.shutdown()


with ray_start_2_node_cluster(2, 1):
    raise RuntimeError

A subsequent ray status yields:

(ray) kai@192:~/coding/ > ray status
Traceback (most recent call last):
  File "/Users/kai/coding/ray/python/ray/_private/gcs_utils.py", line 123, in check_health
    resp = stub.CheckAlive(req, timeout=timeout)
  File "/Users/kai/.pyenv/versions/3.8.16/envs/ray/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/Users/kai/.pyenv/versions/3.8.16/envs/ray/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-05-03T14:35:01.329305+01:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {created_time:"2023-05-03T14:35:01.329302+01:00", grpc_status:14}]}"
>
Ray cluster is not found at 127.0.0.1:63032

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@krfricke krfricke added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 3, 2023
@matthewdeng matthewdeng added the core Issues that should be addressed in Ray Core label May 3, 2023
@rkooo567 rkooo567 added flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ P1 Issue that should be fixed within a few weeks and removed bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 10, 2023
@anyscalesam anyscalesam removed the flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ label Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

4 participants