Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Checkpointing: Can't save pma with non-MemoryFile of type *nvproxy.frontendFDMemmapFile #10478

Closed
mattnappo opened this issue May 23, 2024 · 7 comments
Labels
area: gpu Issue related to sandboxed GPU access auto-closed stale-issue This issue has not been updated in 120 days. type: bug Something isn't working

Comments

@mattnappo
Copy link
Contributor

mattnappo commented May 23, 2024

Description

Overview
Hi, I'm with modal.com. We are interested in using a combination of cuda-checkpoint and runsc checkpoint in order to snapshot GPUs within gVisor. The cuda-checkpoint utility freezes a CUDA process, and copies the GPU state into the CPU memory. We have managed to successfully run cuda-checkpoint from within a gVisor container. Ideally, we would then run runsc checkpoint (this is where the error lies). In principle, running the gVisor checkpointer after running the cuda checkpointer will checkpoint the GPU memory, as cuda-checkpoint moves the GPU memory into the CPU, which is then saved by runsc checkpoint.

Current Thinking
We currently believe the reason for this error is that gVisor acquires GPU devices before checkpointing, which prevents the checkpoint from succeeding as there are device files left open. However, since gVisor doesn't need access to the GPU during a checkpoint, we believe that it should not hold the GPU device.

Potential Solution
If this is indeed the source of the issue, then we would be content with a fix/patch that doesn't acquire the GPU devices, and makes it the user (our)'s job to keep track of mounting GPU devices on restore. If there is a way to make gVisor relinquish control of the GPU before checkpointing, that would also be desirable.


cc: @luiscape @thundergolfer

Steps to reproduce

Dockerfile:

FROM nvidia/cuda:12.4.1-devel-ubuntu20.04

WORKDIR /app

RUN apt-get update && \
    apt-get install -y \
        wget \
        python3-pip \
        python3-dev

RUN wget -O /bin/cuda-checkpoint https://github.com/NVIDIA/cuda-checkpoint/raw/main/bin/x86_64_Linux/cuda-checkpoint

RUN python3 -m pip install --upgrade pip

# PyTorch for Linux CUDA 12.1
RUN pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121

COPY <<EOF main.py
import torch, time, sys
if not torch.cuda.is_available():
    print("cuda is not available")
    sys.exit(-1)

counter = torch.tensor(0, device="cuda")
while True:
    print(counter)
    counter += 1
    time.sleep(1)
EOF

ENTRYPOINT ["python3", "/app/main.py"]

Then follow the steps here to create an OCI bundle.

Run with

sudo runsc -nvproxy -nvproxy-driver-version '550.54.14' -nvproxy-docker run nvtest

Then run cuda-checkpoint in the container (assuming pid of python3 /app/main.py is 1):

sudo runsc exec nvtest sh -c 'cuda-checkpoint --toggle --pid 1'

Up to this point, everything works (running nvidia-smi shows no GPU processes)

Now, the gVisor checkpoint:

sudo runsc checkpoint -leave-running -image-path image/ nvtest

This should trigger the error.

runsc version

runsc version release-20240513.0
spec: 1.1.0-rc.1

CUDA driver version: 550.54.14

docker version (if using docker)

N/A

uname

Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu x86_64 GNU/Linux

kubectl (if using Kubernetes)

N/A

repo state (if built from source)

N/A

runsc debug logs (if available)

Full logs: https://modal-public-assets.s3.amazonaws.com/gpu_ckpt_logs.zip

Main stack trace is in 
gvisor_logs/runsc.log.20240522-215727.751628.checkpoint.txt

Partial stack trace:
+ sudo runsc -debug -debug-log gvisor_logs/ checkpoint -image-path image/ -leave-running 8b769be0-488b-4a84-8b80-6e861da7d5b7
checkpoint failed: checkpointing container "8b769be0-488b-4a84-8b80-6e861da7d5b7": encoding error: Can't save pma with non-MemoryFile of type *nvproxy.frontendFDMemmapFile:
goroutine 656 [running]:
gvisor.dev/gvisor/pkg/state.safely.func1()
	pkg/state/state.go:309 +0x179
panic({0x10a7e60?, 0xc00007d450?})
	GOROOT/src/runtime/panic.go:770 +0x132
gvisor.dev/gvisor/pkg/sentry/mm.(*pma).saveFile(0x4d47ed?)
	pkg/sentry/mm/save_restore.go:144 +0xf9
gvisor.dev/gvisor/pkg/sentry/mm.(*pma).StateSave(0xc000b6dea0, {{0xc000604708?, 0xc0008f4f18?}})
	bazel-out/k8-fastbuild/bin/pkg/sentry/mm/mm_state_autogen.go:373 +0x30
gvisor.dev/gvisor/pkg/state.(*encodeState).encodeStruct(0xc000604708, {0x123c4e0, 0xc000b6dea0, 0x199}, 0xc0008f6a40)
	pkg/state/encode.go:537 +0x5a9
gvisor.dev/gvisor/pkg/state.(*encodeState).encodeObject(0xc000604708, {0x123c4e0?, 0xc000b6dea0?, 0x30?}, 0x0, 0xc0008f6a40)
	pkg/state/encode.go:734 +0x5e5
gvisor.dev/gvisor/pkg/state.(*objectEncoder).save(0x11eecc0?, 0xc000b6dea0?, {0x123c4e0?, 0xc000b6dea0?, 0x11a8be0?})
	pkg/state/encode.go:478 +0x8c
gvisor.dev/gvisor/pkg/state.Sink.Save({{0xc000604708, 0xc0008f4ee8}}, 0x2, {0x11eecc0?, 0xc000b6dea0?})
	pkg/state/state.go:160 +0xa5
gvisor.dev/gvisor/pkg/sentry/mm.(*pmaFlatSegment).StateSave(0xc000b6de90, {{0xc000604708?, 0xc0008f4ee8?}})
	bazel-out/k8-fastbuild/bin/pkg/sentry/mm/mm_state_autogen.go:488 +0x90
gvisor.dev/gvisor/pkg/state.(*encodeState).encodeStruct(0xc000604708, {0x11a8ca0, 0xc000b6de90, 0x199}, 0xc0001775e8)
	pkg/state/encode.go:537 +0x5a9
gvisor.dev/gvisor/pkg/state.(*encodeState).encodeObject(0xc000604708, {0x11a8ca0?, 0xc000b6de90?, 0x7?}, 0x1, 0xc0001775e8)
	pkg/state/encode.go:734 +0x5e5
gvisor.dev/gvisor/pkg/state.(*encodeState).encodeArray(0xc000604708, {0xc0005aeaf0?, 0xc000b6a000?, 0x199?}, 0xc0006fde20)
	pkg/state/encode.go:551 +0x110
gvisor.dev/gvisor/pkg/state.(*encodeState).encodeObject(0xc000604708, {0xc0005aeaf0?, 0xc000b6a000?, 0x30?}, 0x0, 0xc0006fde20)
	pkg/state/encode.go:715 +0x3bb
gvisor.dev/gvisor/pkg/state.(*encodeState).Save.func2()
	pkg/state/encode.go:771 +0x8e
gvisor.dev/gvisor/pkg/state.safely(0xc000604708?)
	pkg/state/state.go:322 +0x57
gvisor.dev/gvisor/pkg/state.(*encodeState).Save(0xc000604708, {0x12dbc20?, 0xc00032d888?, 0x0?})
	pkg/state/encode.go:764 +0x21e
gvisor.dev/gvisor/pkg/state.Save.func1()
	pkg/state/state.go:104 +0x98
gvisor.dev/gvisor/pkg/state.safely(0x0?)
	pkg/state/state.go:322 +0x57
gvisor.dev/gvisor/pkg/state.Save({0x7efe9a75fa98, 0xc000306160}, {0x7efe59672138, 0xc00128a580}, {0x12dc920, 0xc00032d888})
	pkg/state/state.go:103 +0x1d3
gvisor.dev/gvisor/pkg/sentry/kernel.(*Kernel).SaveTo(0xc00032d888, {0x154ae98, 0xc000306160}, {0x7efe59672138, 0xc00128a580}, 0x0, 0x0, {0x18?})
	pkg/sentry/kernel/kernel.go:646 +0x7d9
gvisor.dev/gvisor/pkg/sentry/state.SaveOpts.Save({{0x152b560, 0xc000effda0}, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc0001acf30, {0x0}, 0xc0010e2960, ...}, ...)
	pkg/sentry/state/state.go:102 +0x285
gvisor.dev/gvisor/pkg/sentry/control.(*State).Save(0xc0001a74e0, 0xc00001e3c0, 0x12f8365?)
	pkg/sentry/control/state.go:113 +0x425
gvisor.dev/gvisor/runsc/boot.(*Loader).save(0xc0004bc488, 0xc00001e3c0)
	runsc/boot/restore.go:327 +0x11f
gvisor.dev/gvisor/runsc/boot.(*containerManager).Checkpoint(0xc0004e9b60, 0xc00001e3c0, 0x0?)
	runsc/boot/controller.go:426 +0x58
reflect.Value.call({0xc0002014a0?, 0xc000098140?, 0xc000357c70?}, {0x12ec8aa, 0x4}, {0xc000357eb0, 0x3, 0xc000357ca0?})
	GOROOT/src/reflect/value.go:596 +0xce5
reflect.Value.Call({0xc0002014a0?, 0xc000098140?, 0xa0?}, {0xc000357eb0?, 0xc00001e3c0?, 0x16?})
	GOROOT/src/reflect/value.go:380 +0xb9
gvisor.dev/gvisor/pkg/urpc.(*Server).handleOne(0xc000108a50, 0xc00010cea0)
	pkg/urpc/urpc.go:338 +0x63b
gvisor.dev/gvisor/pkg/urpc.(*Server).handleRegistered(...)
	pkg/urpc/urpc.go:433
gvisor.dev/gvisor/pkg/urpc.(*Server).StartHandling.func1()
	pkg/urpc/urpc.go:453 +0x76
created by gvisor.dev/gvisor/pkg/urpc.(*Server).StartHandling in goroutine 50
	pkg/urpc/urpc.go:451 +0x6b
@mattnappo mattnappo added the type: bug Something isn't working label May 23, 2024
@ayushr2
Copy link
Collaborator

ayushr2 commented May 23, 2024

This is a cuda-checkpoint issue: NVIDIA/cuda-checkpoint#4.

@ayushr2 ayushr2 added the area: gpu Issue related to sandboxed GPU access label May 23, 2024
@mattnappo
Copy link
Contributor Author

mattnappo commented May 23, 2024

You are correct. I just re-wrote my PyTorch program in CUDA, and runsc checkpoint worked. Until NVIDIA fixes this, could you advise how I could temporarily patch gVisor to close the FDs, similar to what was done here? Is this a simple task?

@mattnappo
Copy link
Contributor Author

mattnappo commented May 23, 2024

Update: I managed to get the torch example working by changing the panics to warnings in this file. This is obviously very hacky, and I wonder if there is a way to manually release the *nvproxy?

@ayushr2
Copy link
Collaborator

ayushr2 commented May 23, 2024

FYI I ran the reproducer and got a different error. This is the same error from https://modal-public-assets.s3.amazonaws.com/gpu_ckpt_logs.zip: W0522 21:57:27.787759 25674 util.go:64] FATAL ERROR: checkpoint failed: checkpointing container "d83b8fa4-08bd-4d69-9a6f-5c3c28e98856": encoding error: can't save with live nvproxy clients:

It seems that the Can't save pma with non-MemoryFile of type *nvproxy.frontendFDMemmapFile error pasted above is from a different run where cuda-checkpoint was not run on PID 1.

Side note: need to fix some things in the Dockerfile:

  • Install wget
  • Import sys in main.py

could you advise how I could temporarily patch gVisor to close the FDs?

To properly close FDs during checkpointing, you would need to iterate all FDTables during checkpointing to find nvproxy FDs (via type-assertion) and release/remove them. Given that we can't reasonably expect applications to continue working correctly after silently closing some of their FDs, we probably wouldn't want this in mainline runsc.

@mattnappo
Copy link
Contributor Author

It seems like NVIDIA is aware of this issue, and is working on a fix. Until then, I'll use this temporary patch in my prototyping. Thank you for your help! I'm glad that this isn't a gVisor issue after all.

Copy link

A friendly reminder that this issue had no activity for 120 days.

@github-actions github-actions bot added the stale-issue This issue has not been updated in 120 days. label Sep 26, 2024
Copy link

This issue has been closed due to lack of activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: gpu Issue related to sandboxed GPU access auto-closed stale-issue This issue has not been updated in 120 days. type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants