Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[workflow] Fix the object loss due to driver exit issues. #29092

Merged
merged 7 commits into from
Oct 6, 2022

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Oct 5, 2022

Why are these changes needed?

When the workflow runs in driver mode, the owner of the object ref is the driver. So when the driver exits, the objects are no longer available. This happens when we run with run_async.

This PR fixed this by passing the manager actor as the owner of the objects.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Copy link
Member

@suquark suquark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should get the workflow actor manager at the beginning of the function instead, so we won't get the actors repeatedly for every object

@suquark
Copy link
Member

suquark commented Oct 5, 2022

Also could you verify if this works under client mode?

@@ -168,7 +168,10 @@ def _node_visitor(node: Any) -> Any:
flattened_args = _SerializationContextPreservingWrapper(
flattened_args
)
input_placeholder: ray.ObjectRef = ray.put(flattened_args)
workflow_manager = workflow_access.get_management_actor()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should get the workflow actor manager at the beginning of the function instead, so we won't get the actors repeatedly for every object

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@fishbone
Copy link
Contributor Author

fishbone commented Oct 6, 2022

@suquark the ray client is supported.

@ckw017 could you take a look at the ray client updates?

Copy link
Member

@ckw017 ckw017 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client part looks good

@@ -114,6 +114,8 @@ message PutRequest {
int32 total_chunks = 4;
// Total size in bytes of the data being put
int64 total_size = 5;
// The owner of the put
bytes _owner_id = 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: any reason why this needs to be prefixed with underscore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just following the non-client mode API:

https://github.com/ray-project/ray/blob/master/python/ray/_private/worker.py#L2314

and it's ray.put(_owner=xxx). But this is protobuf, I think I can change it to the one without _. Up to you.

Copy link
Member

@ckw017 ckw017 Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can drop from the protobuf definition that would be nice, I think there's some wonky behavior sometimes when you prefix with non-alphabetical characters. Can leave it the same elsewhere

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Copy link
Member

@suquark suquark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

with disable_client_hook():
objectref = ray.put(obj)
objectref = ray.put(obj, _owner=owner)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments for why using _owner here

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@fishbone fishbone merged commit 69ece85 into ray-project:master Oct 6, 2022
maxpumperla pushed a commit that referenced this pull request Oct 7, 2022
When the workflow runs in driver mode, the owner of the object ref is the driver. So when the driver exits, the objects are no longer available. This happens when we run with `run_async`.

This PR fixed this by passing the manager actor as the owner of the objects.
@SebastianMorawiec
Copy link

SebastianMorawiec commented Nov 12, 2022

Hey @iycheng - sorry for jumping in, but I'd understand that this change would fix this issue Ive reported: #29253, am I right?

@fishbone
Copy link
Contributor Author

fishbone commented Dec 13, 2022

@SebastianMorawiec sorry I missed your message. This should fix your reported issues. The root cause is because of ownership. In ray the one who create the object own the object. Here it's the driver. And when the owner died (driver exits), the object will be lost.

The fix is to make the manager as the owner and thus it'll always be there until no one is using that.

WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…t#29092)

When the workflow runs in driver mode, the owner of the object ref is the driver. So when the driver exits, the objects are no longer available. This happens when we run with `run_async`.

This PR fixed this by passing the manager actor as the owner of the objects.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants