-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use detached lifetime for stats actor #25271
Conversation
python/ray/data/impl/stats.py
Outdated
# Actor handle, job id the actor was created for. | ||
_stats_actor = [None, None] | ||
# Actor handle, job id, client id the actor was created for. | ||
_stats_actor = [None, None, None] | ||
|
||
|
||
def _get_or_create_stats_actor(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't remember why we did it this way, would it be better to use actor.options(get_if_exists=True, name="_dataset_stats_actor").remote()
instead to get-or-create the actor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about changing to this option (getting by name) as well, but it seems not meet the requirement.
So the requirement here is that this actor is reused across datasets created by this same process. For the getting by name approach:
- It may work for the same dataset, but after it's completed and the actor refcount goes to zero, the datasets created later will not be able to get that same actor (have to recreate).
- Alternatively, we may create a detached actor, but in that case it will be shared even across different driver processes.
Keeping a reference here and clearing it upon new connection or new driver is essentially what this PR does to serve this requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, sharing across drivers should be ok right? That should be uncommon case anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's not a concern, then getting by name is indeed simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this stats actor will never be cleaned up for the lifetime of the Ray cluster, and the read stats for each Dataset
will never get cleaned up within the actor, which is a bit of a leaky lifecycle. This seems fine for now to unblock the Ray Client use case, but we should probably open a P2 to improve the stats actor lifecycle, or eliminate the stats actor all together, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous code was made quite close to the existing way. I think the feedback here is it's a bit complicated AND detached actor isn't a concern to use. It may make sense to have a new kind of lifetime between refcounted and detached, e.g. per-job lifetime actor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah also need to set the namespace, otherwise you'll leak one actor per job.
Hmm that's a good question, we should validate this for sure prior to merging. |
Compared to the data loading of read task, one RPC seems a small cost? Do we have a test to run the impact of this? |
Simple enough to run a trivial dataset workload with small blocks before/after this PR. Maybe like 10000 blocks range + map batches? |
Tried a simple test like this:
On a local cluster with 8 nodes and 1 cpu/node:
With this PR: mean time: 2.742537647485733 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good enough for me!
python/ray/data/impl/stats.py
Outdated
# Actor handle, job id the actor was created for. | ||
_stats_actor = [None, None] | ||
# Actor handle, job id, client id the actor was created for. | ||
_stats_actor = [None, None, None] | ||
|
||
|
||
def _get_or_create_stats_actor(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this stats actor will never be cleaned up for the lifetime of the Ray cluster, and the read stats for each Dataset
will never get cleaned up within the actor, which is a bit of a leaky lifecycle. This seems fine for now to unblock the Ray Client use case, but we should probably open a P2 to improve the stats actor lifecycle, or eliminate the stats actor all together, if possible.
Also, small nit - this docstring comment on the actor scope is not accurate anymore: ray/python/ray/data/impl/stats.py Line 85 in 3c9bd66
|
Microbenchmark:
Before: 1.4783143997192383e-05 (sec) |
I was looking at the test failures, but it turned out they were already on the flaky test list. So the PR is ready to review/merge. |
@ericl @clarkzinzow ptal, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I am a bit concerned about the leaky lifecycle of a long-lived detached actor. This stats actor will live for the lifetime of the cluster, and read stats won't be cleaned up for the lifetime of the cluster.
This could be done as a follow-up (and shouldn't block merging this), but what do you think about adding best-effort clean-up of these read stats when DatasetStats
is destructed?
@ray.remote(num_cpus=0)
class _StatsActor:
# ...
def clear(self, stats_uuid: str):
self.metadata.pop(stats_uuid, None)
self.last_time.pop(stats_uuid, None)
self.start_time.pop(stats_uuid, None)
class DatasetStats:
def __del__(self):
if self.needs_stats_actor:
self.stats_actor.clear.remote(self.stats_uuid)
Not sure it'll work, DatasetStats as a Python/local object can have multiple instances in cluster, this cannot clean up the entry in actor for each destruction. |
@jianoaix Ah good point, yeah I forgot that stats can be sent around to other tasks. It doesn't seem like there's a good garbage collection point at the moment. 🤔 |
@@ -502,7 +504,9 @@ def _submit_task( | |||
self, task_idx: int | |||
) -> Tuple[ObjectRef[MaybeBlockPartition], ObjectRef[BlockPartitionMetadata]]: | |||
"""Submit the task with index task_idx.""" | |||
stats_actor = _get_or_create_stats_actor() | |||
if self._stats_actor is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the null check since it is already using get_if_exists=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually the cached actor handle in this class which was initialized to None, so it's None for the first call here.
Do we need cherry pick this to 2.0.0 release branch? |
We haven't heard any issues other than the user who originally reported this (holding a ray client, and shutdown/restarting cluster for multiple trials), so we probably do not need to pick it. |
Synced to head and CI passed, @clarkzinzow |
The actor handle held at Ray client will become dangling if the Ray cluster is shutdown, and in such case if the user tries to get the actor again it will result in crash. This happened in a real user and blocked them from making progress. This change makes the stats actor detached, and instead of keeping a handle, we access it via its name. This way we can make sure re-create this actor if the cluster gets restarted. Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal> Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
What if we just have a FIFO queue of stats? Like the most recent 10000
Dataset stats, which should suffice for almost everyone, but safeguard
against any worst case OOMs like if you're running datasets in a while loop
or something.
…On Fri, Aug 5, 2022, 10:51 AM Clark Zinzow ***@***.***> wrote:
@jianoaix <https://github.com/jianoaix> Ah good point, yeah I forgot that
stats can be sent around to other tasks. It doesn't seem like there's a
good garbage collection point at the moment. 🤔
—
Reply to this email directly, view it on GitHub
<#25271 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSQLAWC5PBZJXM54IWDVXVIAJANCNFSM5XFZX7YQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Why are these changes needed?
The actor handle held at Ray client will become dangling if the Ray cluster is shutdown, and in such case if the user tries to get the actor again it will result in crash. This happened in a real user and blocked them from making progress.
This change makes the stats actor detached, and instead of keeping a handle, we access it via its name. This way we can make sure re-create this actor if the cluster gets restarted.
Related issue number
Closes #25237
Checks
scripts/format.sh
to lint the changes in this PR.