Use detached lifetime for stats actor #25271

jianoaix · 2022-05-28T00:48:43Z

Why are these changes needed?

The actor handle held at Ray client will become dangling if the Ray cluster is shutdown, and in such case if the user tries to get the actor again it will result in crash. This happened in a real user and blocked them from making progress.

This change makes the stats actor detached, and instead of keeping a handle, we access it via its name. This way we can make sure re-create this actor if the cluster gets restarted.

Related issue number

Closes #25237

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl · 2022-05-28T00:54:03Z

python/ray/data/impl/stats.py

-# Actor handle, job id the actor was created for.
-_stats_actor = [None, None]
+# Actor handle, job id, client id the actor was created for.
+_stats_actor = [None, None, None]


 def _get_or_create_stats_actor():


Hmm I don't remember why we did it this way, would it be better to use actor.options(get_if_exists=True, name="_dataset_stats_actor").remote() instead to get-or-create the actor?

I thought about changing to this option (getting by name) as well, but it seems not meet the requirement.
So the requirement here is that this actor is reused across datasets created by this same process. For the getting by name approach:

It may work for the same dataset, but after it's completed and the actor refcount goes to zero, the datasets created later will not be able to get that same actor (have to recreate).

Alternatively, we may create a detached actor, but in that case it will be shared even across different driver processes.

Keeping a reference here and clearing it upon new connection or new driver is essentially what this PR does to serve this requirement.

Hm, sharing across drivers should be ok right? That should be uncommon case anyway.

If that's not a concern, then getting by name is indeed simpler.

So this stats actor will never be cleaned up for the lifetime of the Ray cluster, and the read stats for each Dataset will never get cleaned up within the actor, which is a bit of a leaky lifecycle. This seems fine for now to unblock the Ray Client use case, but we should probably open a P2 to improve the stats actor lifecycle, or eliminate the stats actor all together, if possible.

My previous code was made quite close to the existing way. I think the feedback here is it's a bit complicated AND detached actor isn't a concern to use. It may make sense to have a new kind of lifetime between refcounted and detached, e.g. per-job lifetime actor.

…lingref

python/ray/data/impl/stats.py

ericl

Ah also need to set the namespace, otherwise you'll leak one actor per job.

clarkzinzow · 2022-05-31T18:11:11Z

@ericl @jianoaix So we can assume that the get_if_exists=True path is ~as cheap as the global actor handle cache since the core worker should cache actor handle fetches, right? It looks as if that path should only involve an RPC on the first fetch for a given worker process.

ericl · 2022-05-31T19:04:55Z

Hmm that's a good question, we should validate this for sure prior to merging.

jianoaix · 2022-06-02T00:04:29Z

Compared to the data loading of read task, one RPC seems a small cost?

Do we have a test to run the impact of this?

ericl · 2022-06-02T00:15:43Z

Simple enough to run a trivial dataset workload with small blocks before/after this PR. Maybe like 10000 blocks range + map batches?

jianoaix · 2022-06-02T03:51:12Z

Tried a simple test like this:

    total_time = 0
    for _ in range(16):
        start_time = time.time()
        ds = ray.data.range(100000, parallelism=10000)
        ds.map_batches(lambda x: x)
        total_time = time.time() - start_time
    print("mean time:", total_time / 16)

On a local cluster with 8 nodes and 1 cpu/node:

def build_cluster(num_nodes, num_cpus):
    cluster = Cluster()
    for _ in range(num_nodes):
        cluster.add_node(num_cpus=num_cpus)
    cluster.wait_for_nodes()
    return cluster

cluster = build_cluster(8, 1)

With this PR: mean time: 2.742537647485733
Without this PR: mean time: 2.706667184829712
The difference is 1.33%, which is small given the block is very small (just 10 ints). But the nodes are all on laptop so the RPC might be cheaper than real cluster.

ericl

Good enough for me!

clarkzinzow · 2022-06-02T16:36:14Z

python/ray/data/impl/stats.py

-# Actor handle, job id the actor was created for.
-_stats_actor = [None, None]
+# Actor handle, job id, client id the actor was created for.
+_stats_actor = [None, None, None]


 def _get_or_create_stats_actor():


So this stats actor will never be cleaned up for the lifetime of the Ray cluster, and the read stats for each Dataset will never get cleaned up within the actor, which is a bit of a leaky lifecycle. This seems fine for now to unblock the Ray Client use case, but we should probably open a P2 to improve the stats actor lifecycle, or eliminate the stats actor all together, if possible.

clarkzinzow · 2022-06-02T16:37:22Z

Also, small nit - this docstring comment on the actor scope is not accurate anymore:

ray/python/ray/data/impl/stats.py

Line 85 in 3c9bd66

This actor is shared across all datasets created by the same process.

jianoaix · 2022-06-02T20:09:09Z

Microbenchmark:

    start_time = time.time()
    for _ in range(1000):
        ah = ray.data.impl.stats._get_or_create_stats_actor()
    print("mean time to get:", (time.time() - start_time) / 1000)

Before: 1.4783143997192383e-05 (sec)
After: 0.0005355322360992432 (sec)
Diff: 36x increase

…lingref

jianoaix · 2022-08-03T03:51:36Z

I was looking at the test failures, but it turned out they were already on the flaky test list. So the PR is ready to review/merge.

…lingref

jianoaix · 2022-08-03T18:11:12Z

@ericl @clarkzinzow ptal, thanks

clarkzinzow

LGTM, although I am a bit concerned about the leaky lifecycle of a long-lived detached actor. This stats actor will live for the lifetime of the cluster, and read stats won't be cleaned up for the lifetime of the cluster.

This could be done as a follow-up (and shouldn't block merging this), but what do you think about adding best-effort clean-up of these read stats when DatasetStats is destructed?

@ray.remote(num_cpus=0)
class _StatsActor:
    # ...

    def clear(self, stats_uuid: str):
        self.metadata.pop(stats_uuid, None)
        self.last_time.pop(stats_uuid, None)
        self.start_time.pop(stats_uuid, None)

class DatasetStats:
    def __del__(self):
        if self.needs_stats_actor:
            self.stats_actor.clear.remote(self.stats_uuid)

jianoaix · 2022-08-04T21:03:37Z

Not sure it'll work, DatasetStats as a Python/local object can have multiple instances in cluster, this cannot clean up the entry in actor for each destruction.

clarkzinzow · 2022-08-05T17:50:49Z

@jianoaix Ah good point, yeah I forgot that stats can be sent around to other tasks. It doesn't seem like there's a good garbage collection point at the moment. 🤔

clarng · 2022-08-10T20:25:31Z

python/ray/data/_internal/lazy_block_list.py

@@ -502,7 +504,9 @@ def _submit_task(
        self, task_idx: int
    ) -> Tuple[ObjectRef[MaybeBlockPartition], ObjectRef[BlockPartitionMetadata]]:
        """Submit the task with index task_idx."""
-        stats_actor = _get_or_create_stats_actor()
+        if self._stats_actor is None:


do we need the null check since it is already using get_if_exists=True

This is actually the cached actor handle in this class which was initialized to None, so it's None for the first call here.

…lingref

scv119 · 2022-08-11T20:45:26Z

Do we need cherry pick this to 2.0.0 release branch?

jianoaix · 2022-08-11T20:50:20Z

Do we need cherry pick this to 2.0.0 release branch?

We haven't heard any issues other than the user who originally reported this (holding a ray client, and shutdown/restarting cluster for multiple trials), so we probably do not need to pick it.

jianoaix · 2022-08-11T22:09:01Z

Synced to head and CI passed, @clarkzinzow

The actor handle held at Ray client will become dangling if the Ray cluster is shutdown, and in such case if the user tries to get the actor again it will result in crash. This happened in a real user and blocked them from making progress. This change makes the stats actor detached, and instead of keeping a handle, we access it via its name. This way we can make sure re-create this actor if the cluster gets restarted. Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal> Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

ericl · 2022-10-11T07:44:46Z

What if we just have a FIFO queue of stats? Like the most recent 10000 Dataset stats, which should suffice for almost everyone, but safeguard against any worst case OOMs like if you're running datasets in a while loop or something.

…

On Fri, Aug 5, 2022, 10:51 AM Clark Zinzow ***@***.***> wrote: @jianoaix <https://github.com/jianoaix> Ah good point, yeah I forgot that stats can be sent around to other tasks. It doesn't seem like there's a good garbage collection point at the moment. 🤔 — Reply to this email directly, view it on GitHub <#25271 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSQLAWC5PBZJXM54IWDVXVIAJANCNFSM5XFZX7YQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Null out dangling stats actor handle held at Ray client

e8360cf

jianoaix requested review from ericl, scv119, clarkzinzow and jjyao as code owners May 28, 2022 00:48

jianoaix assigned ckw017 and fishbone May 28, 2022

ericl reviewed May 28, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 28, 2022

Ubuntu added 3 commits May 31, 2022 15:45

Merge branch 'master' of https://github.com/ray-project/ray into dang…

9a8f283

…lingref

use deattached actor

1b8ea8a

fix

73e28ff

jianoaix assigned ericl May 31, 2022

jianoaix removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 31, 2022

jianoaix changed the title ~~Null out dangling stats actor handle held at Ray client~~ Use detached lifetime for stats actor May 31, 2022

ericl approved these changes May 31, 2022

View reviewed changes

ericl reviewed May 31, 2022

View reviewed changes

python/ray/data/impl/stats.py Outdated Show resolved Hide resolved

ericl requested changes May 31, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 31, 2022

actor namespace

edeb85c

ericl approved these changes Jun 2, 2022

View reviewed changes

clarkzinzow approved these changes Jun 2, 2022

View reviewed changes

feedback: stale comment fix

aea4d85

jianoaix added 2 commits August 2, 2022 17:27

fix serielization test

4a8d8ce

fix

595e82f

jianoaix removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 2, 2022

jianoaix added 3 commits August 2, 2022 19:00

Merge branch 'master' of https://github.com/ray-project/ray into dang…

09c1a1d

…lingref

more fix

f5c8b59

lint

1220535

Merge branch 'master' of https://github.com/ray-project/ray into dang…

b99b27c

…lingref

clarkzinzow approved these changes Aug 4, 2022

View reviewed changes

clarng reviewed Aug 10, 2022

View reviewed changes

jianoaix added 2 commits August 11, 2022 19:45

Merge branch 'master' of https://github.com/ray-project/ray into dang…

49cc26c

…lingref

fix

615ff0b

clarkzinzow merged commit b1cad0a into ray-project:master Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use detached lifetime for stats actor #25271

Use detached lifetime for stats actor #25271

jianoaix commented May 28, 2022 •

edited

Loading

ericl May 28, 2022

jianoaix May 28, 2022

ericl May 28, 2022

jianoaix May 28, 2022

clarkzinzow Jun 2, 2022

jianoaix Jun 2, 2022

ericl left a comment

clarkzinzow commented May 31, 2022

ericl commented May 31, 2022

jianoaix commented Jun 2, 2022

ericl commented Jun 2, 2022 •

edited

Loading

jianoaix commented Jun 2, 2022

ericl left a comment

clarkzinzow Jun 2, 2022

clarkzinzow commented Jun 2, 2022 •

edited

Loading

jianoaix commented Jun 2, 2022

jianoaix commented Aug 3, 2022

jianoaix commented Aug 3, 2022

clarkzinzow left a comment

jianoaix commented Aug 4, 2022

clarkzinzow commented Aug 5, 2022

clarng Aug 10, 2022

jianoaix Aug 11, 2022

scv119 commented Aug 11, 2022

jianoaix commented Aug 11, 2022

jianoaix commented Aug 11, 2022

ericl commented Oct 11, 2022 via email

Use detached lifetime for stats actor #25271

Use detached lifetime for stats actor #25271

Conversation

jianoaix commented May 28, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

clarkzinzow commented May 31, 2022

ericl commented May 31, 2022

jianoaix commented Jun 2, 2022

ericl commented Jun 2, 2022 • edited Loading

jianoaix commented Jun 2, 2022

ericl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow commented Jun 2, 2022 • edited Loading

jianoaix commented Jun 2, 2022

jianoaix commented Aug 3, 2022

jianoaix commented Aug 3, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

jianoaix commented Aug 4, 2022

clarkzinzow commented Aug 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scv119 commented Aug 11, 2022

jianoaix commented Aug 11, 2022

jianoaix commented Aug 11, 2022

ericl commented Oct 11, 2022 via email

jianoaix commented May 28, 2022 •

edited

Loading

ericl commented Jun 2, 2022 •

edited

Loading

clarkzinzow commented Jun 2, 2022 •

edited

Loading