Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] autoscaling_shuffle_1tb_1000_partitions failing on uncaught actor death #28709

Closed
rickyyx opened this issue Sep 22, 2022 · 6 comments
Closed
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order

Comments

@rickyyx
Copy link
Contributor

rickyyx commented Sep 22, 2022

What happened + What you expected to happen

Failure run

A stacktrace seems to point to OOM killing that wasn't handled?

Map Progress.:  93%|█████████▎| 928/1000 [06:04<00:38,  1.86it/s]
                                                          Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ray/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/shuffle.py", line 358, in <module>
    main()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/shuffle.py", line 353, in main
    use_wait=args.use_wait,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/shuffle.py", line 308, in run
    tracker=tracker,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/shuffle.py", line 218, in simple_shuffle
    render_progress_bar(tracker, input_num_partitions, output_num_partitions)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/shuffle.py", line 142, in render_progress_bar
    new_num_map, new_num_reduce = ray.get(tracker.get_progress.remote())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2281, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: _StatusTracker
        actor_id: 5339ebb17b84c8655938f02301000000
        pid: 470
        namespace: f1ba0d60-c2d6-416f-8410-3a00a96d5596
        ip: 172.31.68.215
The actor is dead because its worker process has died. Worker exit type: USER_ERROR Worker exit detail: System memory low at node with IP 172.31.68.215. Used memory (104.32GB) / total capacity (109.80GB) (0.950123) exceeds threshold 0.95, killing latest task with name _StatusTracker.__init__() and actor ID 5339ebb17b84c8655938f02301000000 to avoid running out of memory.
This may indicate a memory leak in a task or actor, or that too many tasks are running in parallel.

Versions / Dependencies

Last success: 7ba3788
First failure: b878fd2

b878fd2 [air] Use self-hosted mirror for CIFAR10 dataset (#28480)
8f4f4d1 [doc] Fix tune stopper doctests (#28531)
21140c2 (upstream/aviv-gpu-buffer) [doc/tune] fix tune stopper attribute name (#28517)
c6e7156 [autoscaler][observability] Experimental verbose mode (#28392)
9ff23cd [Telemetry][Kuberentes] Distinguish Kubernetes deployment stacks (#28490)
c91b4a7 [AIR] Make PathPartitionScheme a dataclass (#28390)
10996bd Add imports to object-spilling.rst Python code (#28507)
70153f2 [air/tune] Catch empty hyperopt search space, raise better Tuner error message (#28503)
ab036ed [ci] Increase timeout on test_metrics (#28508)
fd89891 [doc] [Datasets] Improve docstring and doctest for read_parquet (#28488)
7ba3788 Cast rewards as tf.float32 to fix error in DQN in tf2 (#28384)

Reproduction script

https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_Bmk5Ftx5tgxs5eiQX55Q6QSF?command-history-section=command_history

Issue Severity

No response

@rickyyx rickyyx added bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order core Issues that should be addressed in Ray Core labels Sep 22, 2022
@rickyyx rickyyx added this to the Core Nightly/CI Regressions milestone Sep 22, 2022
@rickyyx
Copy link
Contributor Author

rickyyx commented Sep 22, 2022

Same test failure (seems to have slightly different causes) : #28563

cc @stephanie-wang @wuisawesome @clarng

@clarng
Copy link
Contributor

clarng commented Sep 22, 2022

_StatusTracker failing will cause the workload to fail?

@rickyyx
Copy link
Contributor Author

rickyyx commented Sep 22, 2022

Looks like this is the case for now - not sure about the best way to handle it yet.

@clarng
Copy link
Contributor

clarng commented Sep 22, 2022

yea look like it is used to render the progress bar.

Trying to see if i can get the logs to see why it was chosen to be killed

@rickyyx
Copy link
Contributor Author

rickyyx commented Sep 22, 2022

Assigning to you for now - feel free to triage and re-assign back.

@rickyyx
Copy link
Contributor Author

rickyyx commented Sep 28, 2022

Is this one due to non-retriable actor being killed?

@scv119 scv119 closed this as completed Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order
Projects
None yet
Development

No branches or pull requests

3 participants