[Cluster] ray running inside docker on a cloud VM losing GPU access after few hours #46552

lobo1586 · 2024-07-10T21:17:50Z

What happened + What you expected to happen

I believe i am currently affected by NVIDIA/nvidia-docker#1671

We deployed ray serve to a cloud compute VM using cluster_config.yaml, after running couple hours, the docker suddenly lost access to the host GPU.

This post seems to published some workaround: NVIDIA/nvidia-docker#1730

However i am not sure how can i integrate any of those solutions into the ray up cluster_config.yaml process

Versions / Dependencies

our container base image is from rayproject/ray-ml:2.24.0-py310-gpu
our GCP boot disk source image is from: projects/ml-images/global/images/c0-deeplearning-common-cu118-v20240613-debian-11-py310

Reproduction script

see linked nvidia issue

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

lobo1586 · 2024-07-11T18:02:00Z

ok, looks like it can be fixed using nvidia's posted workaround mentioned in NVIDIA/nvidia-docker#1730, however this would require a one time fix effort, but not sure if feasible it is when there are multiple worker node with GPU

rynewang · 2024-07-15T21:28:51Z

If I read this correctly, the workaround involves running a script on each node. Can you try to put the command here in the cluster_config.yaml? https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#initialization-commands

Generally speaking we expect NVIDIA to fix the issue for good, such that we don't need extra commands to fix.

lobo1586 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 10, 2024

anyscalesam added the core Issues that should be addressed in Ray Core label Jul 15, 2024

rynewang closed this as completed Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cluster] ray running inside docker on a cloud VM losing GPU access after few hours #46552

[Cluster] ray running inside docker on a cloud VM losing GPU access after few hours #46552

lobo1586 commented Jul 10, 2024

lobo1586 commented Jul 11, 2024

rynewang commented Jul 15, 2024

[Cluster] ray running inside docker on a cloud VM losing GPU access after few hours #46552

[Cluster] ray running inside docker on a cloud VM losing GPU access after few hours #46552

Comments

lobo1586 commented Jul 10, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

lobo1586 commented Jul 11, 2024

rynewang commented Jul 15, 2024