-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Checkpointing: Can't save pma with non-MemoryFile of type *nvproxy.frontendFDMemmapFile #10478
Comments
This is a cuda-checkpoint issue: NVIDIA/cuda-checkpoint#4. |
You are correct. I just re-wrote my PyTorch program in CUDA, and |
Update: I managed to get the |
FYI I ran the reproducer and got a different error. This is the same error from https://modal-public-assets.s3.amazonaws.com/gpu_ckpt_logs.zip: It seems that the Side note: need to fix some things in the Dockerfile:
To properly close FDs during checkpointing, you would need to iterate all FDTables during checkpointing to find nvproxy FDs (via type-assertion) and release/remove them. Given that we can't reasonably expect applications to continue working correctly after silently closing some of their FDs, we probably wouldn't want this in mainline runsc. |
It seems like NVIDIA is aware of this issue, and is working on a fix. Until then, I'll use this temporary patch in my prototyping. Thank you for your help! I'm glad that this isn't a gVisor issue after all. |
A friendly reminder that this issue had no activity for 120 days. |
This issue has been closed due to lack of activity. |
Description
Overview
Hi, I'm with modal.com. We are interested in using a combination of cuda-checkpoint and
runsc checkpoint
in order to snapshot GPUs within gVisor. Thecuda-checkpoint
utility freezes a CUDA process, and copies the GPU state into the CPU memory. We have managed to successfully runcuda-checkpoint
from within a gVisor container. Ideally, we would then runrunsc checkpoint
(this is where the error lies). In principle, running the gVisor checkpointer after running the cuda checkpointer will checkpoint the GPU memory, ascuda-checkpoint
moves the GPU memory into the CPU, which is then saved byrunsc checkpoint
.Current Thinking
We currently believe the reason for this error is that gVisor acquires GPU devices before checkpointing, which prevents the checkpoint from succeeding as there are device files left open. However, since gVisor doesn't need access to the GPU during a checkpoint, we believe that it should not hold the GPU device.
Potential Solution
If this is indeed the source of the issue, then we would be content with a fix/patch that doesn't acquire the GPU devices, and makes it the user (our)'s job to keep track of mounting GPU devices on restore. If there is a way to make gVisor relinquish control of the GPU before checkpointing, that would also be desirable.
cc: @luiscape @thundergolfer
Steps to reproduce
Dockerfile:
Then follow the steps here to create an OCI bundle.
Run with
sudo runsc -nvproxy -nvproxy-driver-version '550.54.14' -nvproxy-docker run nvtest
Then run
cuda-checkpoint
in the container (assuming pid ofpython3 /app/main.py
is1
):Up to this point, everything works (running
nvidia-smi
shows no GPU processes)Now, the gVisor checkpoint:
This should trigger the error.
runsc version
docker version (if using docker)
uname
Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu x86_64 GNU/Linux
kubectl (if using Kubernetes)
repo state (if built from source)
N/A
runsc debug logs (if available)
The text was updated successfully, but these errors were encountered: