Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Workflow] Workflow fails if python script calling it exits #29253

Closed
SebastianMorawiec opened this issue Oct 12, 2022 · 1 comment
Closed

[Workflow] Workflow fails if python script calling it exits #29253

SebastianMorawiec opened this issue Oct 12, 2022 · 1 comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks

Comments

@SebastianMorawiec
Copy link

SebastianMorawiec commented Oct 12, 2022

What happened + What you expected to happen

When creating and running async workflow, Id expect it to be ran in background as in docs it states that:
The lifetime of the workflow is not coupled with the driver. If the driver exits, the workflow will continue running in the background of the cluster.
However, when the python scripts spawns workflow and exits, the workflow fails. I can find in the logs something like this:
53The object's owner has exited. This is the Python worker that first created the ObjectRef via .remote() or ray.put(). Check cluster logs (/tmp/ray/session_latest/logs/*05000000ffffffffffffffffffffffffffffffffffffffffffffffff* at IP address 127.0.0.1) for more information about the Python worker failure.
Workflow manager is called as detached, so it should be kept alive as I understand it.
What is even more confusing is that Ray Dashboard in "Jobs" shows that the job was successful.

Attaching reproduction script, called twice in the span of 30 sec will show that first workflow submitted failed (could comment out create_workflow function to not spam with new ones)
Before the script is called for the first time, I used
ray start --head
--temp-dir="$HOME/ray"
--storage="$HOME/ray_storage"
to spawn local cluster
It doesn't matter if i use ray.init with proper local address (commented out) or with 'auto'

Versions / Dependencies

Using pure ray 2.0 on Mac M1

Reproduction script

Hint: If I remove time.sleep(20) from multiply function, workflow passes successfully


import ray
import ray.workflow
import time


@ray.remote
def multiply(num: int):
    time.sleep(20)
    return 2*num


@ray.remote
def addify(num: int):
    time.sleep(5)
    return 2+num


def create_workflow():
    multiplied = multiply.bind(10)
    final_value = addify.bind(multiplied)
    return ray.workflow.run_async(final_value)


if __name__ == "__main__":
    # ray.init(address="127.0.0.1:6379")
    ray.init(address='auto')
    ray.workflow.init()
    create_workflow()
    print(ray.workflow.list_all())

Issue Severity

High: It blocks me from completing my task.

@SebastianMorawiec SebastianMorawiec added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 12, 2022
@SebastianMorawiec SebastianMorawiec changed the title [Workflow] Workflow fails if python script calling it exists [Workflow] Workflow fails if python script calling it exits Oct 12, 2022
@hora-anyscale hora-anyscale added the core Issues that should be addressed in Ray Core label Nov 4, 2022
@hora-anyscale hora-anyscale added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 14, 2022
@fishbone
Copy link
Contributor

fishbone commented Dec 13, 2022

Fixed here: #29092
it should be delivered in ray 2.1.
Close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

3 participants