[workflow] Fix the object loss due to driver exit issues. #29092

fishbone · 2022-10-05T22:07:01Z

Why are these changes needed?

When the workflow runs in driver mode, the owner of the object ref is the driver. So when the driver exits, the objects are no longer available. This happens when we run with run_async.

This PR fixed this by passing the manager actor as the owner of the objects.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

suquark

~~we should get the workflow actor manager at the beginning of the function instead, so we won't get the actors repeatedly for every object~~

suquark · 2022-10-05T22:14:05Z

Also could you verify if this works under client mode?

suquark · 2022-10-05T22:15:30Z

python/ray/workflow/workflow_state_from_dag.py

@@ -168,7 +168,10 @@ def _node_visitor(node: Any) -> Any:
                    flattened_args = _SerializationContextPreservingWrapper(
                        flattened_args
                    )
-                input_placeholder: ray.ObjectRef = ray.put(flattened_args)
+                workflow_manager = workflow_access.get_management_actor()


we should get the workflow actor manager at the beginning of the function instead, so we won't get the actors repeatedly for every object

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

fishbone · 2022-10-06T01:04:00Z

@suquark the ray client is supported.

@ckw017 could you take a look at the ray client updates?

ckw017

Client part looks good

ckw017 · 2022-10-06T01:35:45Z

src/ray/protobuf/ray_client.proto

@@ -114,6 +114,8 @@ message PutRequest {
  int32 total_chunks = 4;
  // Total size in bytes of the data being put
  int64 total_size = 5;
+  // The owner of the put
+  bytes _owner_id = 6;


nit: any reason why this needs to be prefixed with underscore?

I'm just following the non-client mode API:

https://github.com/ray-project/ray/blob/master/python/ray/_private/worker.py#L2314

and it's ray.put(_owner=xxx). But this is protobuf, I think I can change it to the one without _. Up to you.

If you can drop from the protobuf definition that would be nice, I think there's some wonky behavior sometimes when you prefix with non-alphabetical characters. Can leave it the same elsewhere

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

suquark

LGTM.

suquark · 2022-10-06T08:14:49Z

python/ray/util/client/server/server.py

            with disable_client_hook():
-                objectref = ray.put(obj)
+                objectref = ray.put(obj, _owner=owner)


Add comments for why using _owner here

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

When the workflow runs in driver mode, the owner of the object ref is the driver. So when the driver exits, the objects are no longer available. This happens when we run with `run_async`. This PR fixed this by passing the manager actor as the owner of the objects.

SebastianMorawiec · 2022-11-12T13:22:41Z

Hey @iycheng - sorry for jumping in, but I'd understand that this change would fix this issue Ive reported: #29253, am I right?

fishbone · 2022-12-13T00:21:25Z

@SebastianMorawiec sorry I missed your message. This should fix your reported issues. The root cause is because of ownership. In ray the one who create the object own the object. Here it's the driver. And when the owner died (driver exits), the object will be lost.

The fix is to make the manager as the owner and thus it'll always be there until no one is using that.

…t#29092) When the workflow runs in driver mode, the owner of the object ref is the driver. So when the driver exits, the objects are no longer available. This happens when we run with `run_async`. This PR fixed this by passing the manager actor as the owner of the objects. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fishbone added 2 commits October 5, 2022 21:32

fix

dd03190

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

add unit test

6592090

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

fishbone requested review from ericl, stephanie-wang and suquark as code owners October 5, 2022 22:07

fishbone assigned suquark Oct 5, 2022

suquark reviewed Oct 5, 2022

View reviewed changes

fishbone added 3 commits October 6, 2022 00:57

fix client

cbb463a

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

add test

0dde7dc

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

fix

693c034

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

fishbone requested review from ijrsvt, AmeerHajAli and ckw017 as code owners October 6, 2022 01:03

fishbone assigned ckw017 Oct 6, 2022

ckw017 reviewed Oct 6, 2022

View reviewed changes

fix comment

fc5c049

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

suquark approved these changes Oct 6, 2022

View reviewed changes

suquark reviewed Oct 6, 2022

View reviewed changes

ckw017 approved these changes Oct 6, 2022

View reviewed changes

fix comments

5e9d1fe

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

fishbone merged commit 69ece85 into ray-project:master Oct 6, 2022

fishbone mentioned this pull request Dec 13, 2022

[Workflow] Workflow fails if python script calling it exits #29253

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[workflow] Fix the object loss due to driver exit issues. #29092

[workflow] Fix the object loss due to driver exit issues. #29092

fishbone commented Oct 5, 2022

suquark left a comment •

edited

Loading

suquark commented Oct 5, 2022

suquark Oct 5, 2022

fishbone commented Oct 6, 2022

ckw017 left a comment

ckw017 Oct 6, 2022

fishbone Oct 6, 2022

ckw017 Oct 6, 2022 •

edited

Loading

suquark left a comment

suquark Oct 6, 2022

SebastianMorawiec commented Nov 12, 2022 •

edited

Loading

fishbone commented Dec 13, 2022 •

edited

Loading

[workflow] Fix the object loss due to driver exit issues. #29092

[workflow] Fix the object loss due to driver exit issues. #29092

Conversation

fishbone commented Oct 5, 2022

Why are these changes needed?

Related issue number

Checks

suquark left a comment • edited Loading

Choose a reason for hiding this comment

suquark commented Oct 5, 2022

suquark Oct 5, 2022

Choose a reason for hiding this comment

fishbone commented Oct 6, 2022

ckw017 left a comment

Choose a reason for hiding this comment

ckw017 Oct 6, 2022

Choose a reason for hiding this comment

fishbone Oct 6, 2022

Choose a reason for hiding this comment

ckw017 Oct 6, 2022 • edited Loading

Choose a reason for hiding this comment

suquark left a comment

Choose a reason for hiding this comment

suquark Oct 6, 2022

Choose a reason for hiding this comment

SebastianMorawiec commented Nov 12, 2022 • edited Loading

fishbone commented Dec 13, 2022 • edited Loading

suquark left a comment •

edited

Loading

ckw017 Oct 6, 2022 •

edited

Loading

SebastianMorawiec commented Nov 12, 2022 •

edited

Loading

fishbone commented Dec 13, 2022 •

edited

Loading