-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[workflow] Fix the object loss due to driver exit issues. #29092
Conversation
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should get the workflow actor manager at the beginning of the function instead, so we won't get the actors repeatedly for every object
Also could you verify if this works under client mode? |
@@ -168,7 +168,10 @@ def _node_visitor(node: Any) -> Any: | |||
flattened_args = _SerializationContextPreservingWrapper( | |||
flattened_args | |||
) | |||
input_placeholder: ray.ObjectRef = ray.put(flattened_args) | |||
workflow_manager = workflow_access.get_management_actor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should get the workflow actor manager at the beginning of the function instead, so we won't get the actors repeatedly for every object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client part looks good
src/ray/protobuf/ray_client.proto
Outdated
@@ -114,6 +114,8 @@ message PutRequest { | |||
int32 total_chunks = 4; | |||
// Total size in bytes of the data being put | |||
int64 total_size = 5; | |||
// The owner of the put | |||
bytes _owner_id = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: any reason why this needs to be prefixed with underscore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just following the non-client mode API:
https://github.com/ray-project/ray/blob/master/python/ray/_private/worker.py#L2314
and it's ray.put(_owner=xxx). But this is protobuf, I think I can change it to the one without _
. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can drop from the protobuf definition that would be nice, I think there's some wonky behavior sometimes when you prefix with non-alphabetical characters. Can leave it the same elsewhere
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
with disable_client_hook(): | ||
objectref = ray.put(obj) | ||
objectref = ray.put(obj, _owner=owner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comments for why using _owner
here
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
When the workflow runs in driver mode, the owner of the object ref is the driver. So when the driver exits, the objects are no longer available. This happens when we run with `run_async`. This PR fixed this by passing the manager actor as the owner of the objects.
Hey @iycheng - sorry for jumping in, but I'd understand that this change would fix this issue Ive reported: #29253, am I right? |
@SebastianMorawiec sorry I missed your message. This should fix your reported issues. The root cause is because of ownership. In ray the one who create the object own the object. Here it's the driver. And when the owner died (driver exits), the object will be lost. The fix is to make the manager as the owner and thus it'll always be there until no one is using that. |
…t#29092) When the workflow runs in driver mode, the owner of the object ref is the driver. So when the driver exits, the objects are no longer available. This happens when we run with `run_async`. This PR fixed this by passing the manager actor as the owner of the objects. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Why are these changes needed?
When the workflow runs in driver mode, the owner of the object ref is the driver. So when the driver exits, the objects are no longer available. This happens when we run with
run_async
.This PR fixed this by passing the manager actor as the owner of the objects.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.