You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A stacktrace seems to point to OOM killing that wasn't handled?
Map Progress.: 93%|█████████▎| 928/1000 [06:04<00:38, 1.86it/s]
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ray/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/shuffle.py", line 358, in <module>
main()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/shuffle.py", line 353, in main
use_wait=args.use_wait,
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/shuffle.py", line 308, in run
tracker=tracker,
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/shuffle.py", line 218, in simple_shuffle
render_progress_bar(tracker, input_num_partitions, output_num_partitions)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/shuffle.py", line 142, in render_progress_bar
new_num_map, new_num_reduce = ray.get(tracker.get_progress.remote())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2281, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: _StatusTracker
actor_id: 5339ebb17b84c8655938f02301000000
pid: 470
namespace: f1ba0d60-c2d6-416f-8410-3a00a96d5596
ip: 172.31.68.215
The actor is dead because its worker process has died. Worker exit type: USER_ERROR Worker exit detail: System memory low at node with IP 172.31.68.215. Used memory (104.32GB) / total capacity (109.80GB) (0.950123) exceeds threshold 0.95, killing latest task with name _StatusTracker.__init__() and actor ID 5339ebb17b84c8655938f02301000000 to avoid running out of memory.
This may indicate a memory leak in a task or actor, or that too many tasks are running in parallel.
The text was updated successfully, but these errors were encountered:
rickyyx
added
bug
Something that is supposed to be working; but isn't
P0
Issues that should be fixed in short order
core
Issues that should be addressed in Ray Core
labels
Sep 22, 2022
rickyyx
added this to the
Core Nightly/CI Regressions milestone
Sep 22, 2022
What happened + What you expected to happen
Failure run
A stacktrace seems to point to OOM killing that wasn't handled?
Versions / Dependencies
Last success: 7ba3788
First failure: b878fd2
b878fd2 [air] Use self-hosted mirror for CIFAR10 dataset (#28480)
8f4f4d1 [doc] Fix tune stopper doctests (#28531)
21140c2 (upstream/aviv-gpu-buffer) [doc/tune] fix tune stopper attribute name (#28517)
c6e7156 [autoscaler][observability] Experimental verbose mode (#28392)
9ff23cd [Telemetry][Kuberentes] Distinguish Kubernetes deployment stacks (#28490)
c91b4a7 [AIR] Make PathPartitionScheme a dataclass (#28390)
10996bd Add imports to object-spilling.rst Python code (#28507)
70153f2 [air/tune] Catch empty hyperopt search space, raise better Tuner error message (#28503)
ab036ed [ci] Increase timeout on test_metrics (#28508)
fd89891 [doc] [Datasets] Improve docstring and doctest for read_parquet (#28488)
7ba3788 Cast rewards as tf.float32 to fix error in DQN in tf2 (#28384)
Reproduction script
https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_Bmk5Ftx5tgxs5eiQX55Q6QSF?command-history-section=command_history
Issue Severity
No response
The text was updated successfully, but these errors were encountered: