Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][release] dask_on_ray_large_scale_test_no_spilling failed with RayActorError on low memory #28778

Closed
rickyyx opened this issue Sep 26, 2022 · 0 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@rickyyx
Copy link
Contributor

rickyyx commented Sep 26, 2022

What happened + What you expected to happen

The below task has been failing due to memory monitor actor killed by OOM killer

ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: monitor_memory_usage.<locals>.MemoryMonitorActor
        actor_id: 5b7297472c9a580c171cc15602000000
        pid: 1412
        namespace: 94d6d15c-2534-45cf-b560-2d7cdbfac41b
        ip: 172.31.41.244
The actor is dead because its worker process has died. Worker exit type: USER_ERROR Worker exit detail: Task was killed due to the node running low on memory.

Memory on the node (IP: 172.31.41.244, ID: 208ebdd0b89f655e905f8ab3993f4a1c7ee671e57d81df38aa3164eb) where the task was running was 215.90GB / 219.60GB (0.983172), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: a3f9f9127a61b300cfafe252b4c38ef4ad7d154106d15df55b95100e) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.31.41.244`. To see the logs of the worker, use `ray logs worker-a3f9f9127a61b300cfafe252b4c38ef4ad7d154106d15df55b95100e*out -ip 172.31.41.244`.

Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the eviction threshold, set the environment variable `RAY_memory_usage_threshold_fraction` when starting Ray. To disable worker eviction, set the environment variable `RAY_memory_monitor_interval_ms` to zero.

Interestingly the non-smoke tests seem to be passing - so might be some configs issue.

Versions / Dependencies

NOTE: non-smoke tests have been passing.
Last success: f6ae7ee
First failure: a47adb9

a47adb9 [RLlib] before_sub_environment_reset becomes on_episode_created(). (#28600)
d4e2e99 [Datasets] Add metadata override and inference in Dataset.to_dask(). (#28625)
8e8ab34 Handle starting worker throttling inside worker pool (#28551)
2527ffa [docs] Add basic parallel execution guide for Tune and cleanup order of guides (#28677)
fb7472f Remove RAY_RAYLET_NODE_ID (#28715)
93f911e Add API latency and call counts metrics to dashboard APIs (#28279)
66aae4c [Release Test] Make sure to delete all EBS volumes (#28707)
697df80 [Serve] [Docs] Remove incorrect output (#28708)
d8c9aa7 [docs] configurable ecosystem gallery (#28662)
42874e1 [RLlib] Atari gym environments now require ale-py. (#28703)
b7f0346 [AIR] Maintain dtype info in LightGBMPredictor (#28673)
f6ae7ee [tune] Test background syncer serialization (#28699)

Reproduction script

NA

Issue Severity

No response

@rickyyx rickyyx added bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels Sep 26, 2022
@rickyyx rickyyx added this to the Core Nightly/CI Regressions milestone Sep 26, 2022
@rickyyx rickyyx changed the title [core][release] dask_on_ray_large_scale_test_no_spilling/spilling failed with RayActorError on low memory [core][release] dask_on_ray_large_scale_test_no_spilling failed with RayActorError on low memory Sep 26, 2022
@rickyyx rickyyx closed this as completed Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

1 participant