You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
rickyyx opened this issue
Sep 26, 2022
· 0 comments
Labels
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoreP0Issues that should be fixed in short ordertriageNeeds triage (eg: priority, bug/not-bug, and owning component)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: monitor_memory_usage.<locals>.MemoryMonitorActor
actor_id: 5b7297472c9a580c171cc15602000000
pid: 1412
namespace: 94d6d15c-2534-45cf-b560-2d7cdbfac41b
ip: 172.31.41.244
The actor is dead because its worker process has died. Worker exit type: USER_ERROR Worker exit detail: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.31.41.244, ID: 208ebdd0b89f655e905f8ab3993f4a1c7ee671e57d81df38aa3164eb) where the task was running was 215.90GB / 219.60GB (0.983172), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: a3f9f9127a61b300cfafe252b4c38ef4ad7d154106d15df55b95100e) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.31.41.244`. To see the logs of the worker, use `ray logs worker-a3f9f9127a61b300cfafe252b4c38ef4ad7d154106d15df55b95100e*out -ip 172.31.41.244`.
Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the eviction threshold, set the environment variable `RAY_memory_usage_threshold_fraction` when starting Ray. To disable worker eviction, set the environment variable `RAY_memory_monitor_interval_ms` to zero.
Interestingly the non-smoke tests seem to be passing - so might be some configs issue.
Versions / Dependencies
NOTE: non-smoke tests have been passing.
Last success: f6ae7ee
First failure: a47adb9
a47adb9 [RLlib] before_sub_environment_reset becomes on_episode_created(). (#28600) d4e2e99 [Datasets] Add metadata override and inference in Dataset.to_dask(). (#28625) 8e8ab34 Handle starting worker throttling inside worker pool (#28551) 2527ffa [docs] Add basic parallel execution guide for Tune and cleanup order of guides (#28677) fb7472f Remove RAY_RAYLET_NODE_ID (#28715) 93f911e Add API latency and call counts metrics to dashboard APIs (#28279) 66aae4c [Release Test] Make sure to delete all EBS volumes (#28707) 697df80 [Serve] [Docs] Remove incorrect output (#28708) d8c9aa7 [docs] configurable ecosystem gallery (#28662) 42874e1 [RLlib] Atari gym environments now require ale-py. (#28703) b7f0346 [AIR] Maintain dtype info in LightGBMPredictor (#28673) f6ae7ee [tune] Test background syncer serialization (#28699)
Reproduction script
NA
Issue Severity
No response
The text was updated successfully, but these errors were encountered:
rickyyx
added
bug
Something that is supposed to be working; but isn't
P0
Issues that should be fixed in short order
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
core
Issues that should be addressed in Ray Core
labels
Sep 26, 2022
rickyyx
added this to the
Core Nightly/CI Regressions milestone
Sep 26, 2022
rickyyx
changed the title
[core][release] dask_on_ray_large_scale_test_no_spilling/spilling failed with RayActorError on low memory
[core][release] dask_on_ray_large_scale_test_no_spilling failed with RayActorError on low memory
Sep 26, 2022
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoreP0Issues that should be fixed in short ordertriageNeeds triage (eg: priority, bug/not-bug, and owning component)
What happened + What you expected to happen
The below task has been failing due to memory monitor actor killed by OOM killer
dask_on_ray_large_scale_test_no_spilling(smoke)
dask_on_ray_large_scale_test_spilling(smoke) was failing as well, but seems to have a successful run now
Interestingly the non-smoke tests seem to be passing - so might be some configs issue.
Versions / Dependencies
NOTE: non-smoke tests have been passing.
Last success: f6ae7ee
First failure: a47adb9
a47adb9 [RLlib]
before_sub_environment_reset
becomeson_episode_created()
. (#28600)d4e2e99 [Datasets] Add metadata override and inference in
Dataset.to_dask()
. (#28625)8e8ab34 Handle starting worker throttling inside worker pool (#28551)
2527ffa [docs] Add basic parallel execution guide for Tune and cleanup order of guides (#28677)
fb7472f Remove RAY_RAYLET_NODE_ID (#28715)
93f911e Add API latency and call counts metrics to dashboard APIs (#28279)
66aae4c [Release Test] Make sure to delete all EBS volumes (#28707)
697df80 [Serve] [Docs] Remove incorrect output (#28708)
d8c9aa7 [docs] configurable ecosystem gallery (#28662)
42874e1 [RLlib] Atari gym environments now require ale-py. (#28703)
b7f0346 [AIR] Maintain dtype info in LightGBMPredictor (#28673)
f6ae7ee [tune] Test background syncer serialization (#28699)
Reproduction script
NA
Issue Severity
No response
The text was updated successfully, but these errors were encountered: