You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
lobo1586 opened this issue
Jul 10, 2024
· 2 comments
Labels
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoretriageNeeds triage (eg: priority, bug/not-bug, and owning component)
However i am not sure how can i integrate any of those solutions into the ray up cluster_config.yaml process
Versions / Dependencies
our container base image is from rayproject/ray-ml:2.24.0-py310-gpu
our GCP boot disk source image is from: projects/ml-images/global/images/c0-deeplearning-common-cu118-v20240613-debian-11-py310
Reproduction script
see linked nvidia issue
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered:
lobo1586
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jul 10, 2024
ok, looks like it can be fixed using nvidia's posted workaround mentioned in NVIDIA/nvidia-docker#1730, however this would require a one time fix effort, but not sure if feasible it is when there are multiple worker node with GPU
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoretriageNeeds triage (eg: priority, bug/not-bug, and owning component)
What happened + What you expected to happen
I believe i am currently affected by NVIDIA/nvidia-docker#1671
We deployed ray serve to a cloud compute VM using cluster_config.yaml, after running couple hours, the docker suddenly lost access to the host GPU.
This post seems to published some workaround: NVIDIA/nvidia-docker#1730
However i am not sure how can i integrate any of those solutions into the
ray up cluster_config.yaml
processVersions / Dependencies
our container base image is from rayproject/ray-ml:2.24.0-py310-gpu
our GCP boot disk source image is from: projects/ml-images/global/images/c0-deeplearning-common-cu118-v20240613-debian-11-py310
Reproduction script
see linked nvidia issue
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: