Failed to checkpoint dump container using GPU - Unable to connect a transport socket: Permission denied #19

ezerk · 2024-11-14T17:25:02Z

Hi
my end goal is to checkpoint containers that use GPU during CI phase and later deploy it using kubernetes in order to reduce pod warmup time.

so far I experiment locally and was able to dump a GPU process (even relatively complexed one using the workaround suggested in #4)

but when i run the app as a container it fails at early dump stages

Steps to reproduce:

following the provided example code in this repo counter.cu and wrapping it with a container using this Dockerfile:

FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 as builder
COPY counter.cu .
RUN nvcc counter.cu -o /tmp/counter

FROM nvidia/cuda:12.4.1-base-ubuntu22.04 as main
WORKDIR /app
COPY --from=builder /tmp/counter /app/counter
EXPOSE 10000
ENTRYPOINT ["/app/counter"]

im using podman since it provides the option to checkpoint to an image which i hope to use later on

clone this repo and place Dockerfile under src folder
building local image sudo podman build -t counter .
run image sudo podman run -p 10000:10000/udp --gpus=all --name=counter counter
run nvidia-smi (on host machine) shows the PID as expected

PID: 192578 /app/counter

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   74C    P0             32W /   70W |     103MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    192578      C   /app/counter                                  100MiB |
+-----------------------------------------------------------------------------------------+

run sudo cuda-checkpoint --toggle --pid $PID successfully
validate with nvidia-smi that process is offloaded from GPU

output: `No running processes found` as expected

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   74C    P0             33W /   70W |       1MiB /  15360MiB |      7%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

trying to checkout using podman sudo podman container checkpoint counter --create-image=counter_chechpoint fails. (using other flags provided by podman checkpoint resulted in the same error)

CRIU checkpointing failed -52.  Please check CRIU logfile .. 

662 (00.007387) Error (criu/mount.c:757): mnt: 637:./usr/lib/firmware/nvidia/550.127.05/gsp_tu10x.bin doesn't hav    e a proper root mount
663 (00.007403) net: Unlock network
664 (00.007407) Running network-unlock scripts
665 (00.024091) Unfreezing tasks into 1
666 (00.024103)     Unseizing 192578 into 1
667 (00.024161) Error (criu/cr-dump.c:2111): Dumping FAILED.

trying to use underlaying criu command directly sudo criu dump --shell-job --images-dir dump --external 'mnt[]:sm' -vvvv -o dump2.log I can get some progress (due to --external 'mnt[]:sm' flag) but still fail, this time with:

(00.017847) Putting tsock into pid 192578
(00.018191) Error (compel/src/lib/infect.c:713): Unable to connect a transport socket: Permission denied
(00.018200) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.018202) Error (compel/src/lib/ptrace.c:96): Can't poke 192578 @ 0x55887f472000 from 0x7ffc2ad59a58 sized 8
(00.018204) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.018206) Error (compel/src/lib/ptrace.c:100): Can't restore the original data with poke
(00.018207) Error (compel/src/lib/infect.c:637): Can't inject syscall blob (pid: 192578)
(00.018209) Warn  (criu/parasite-syscall.c:439): Can't cure failed infection
(00.018215) Error (criu/cr-dump.c:1610): Can't infect (pid: 192578) with parasite
(00.018276) net: Unlock network
(00.018279) Running network-unlock scripts
(00.034182) Unfreezing tasks into 1
(00.034194)     Unseizing 192578 into 1
(00.034200) Error (compel/src/lib/infect.c:418): Unable to detach from 192578: No such process
(00.034239) Error (criu/cr-dump.c:2111): Dumping FAILED.

note - same error reproduced with both rpm installed criu (criu-3.19-1.el9.x86_64) and with criu v4 compiled from latest commit from criu-dev branch

complete criu log files:

Spec

criu v3.19 (rpm version)

$ criu_ORIG --version
Version: 3.19

$ sudo criu_ORIG check --all
Warn  (criu/cr-check.c:1346): Nftables based locking requires libnftables and set concatenations support
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.

criu v4 compiled version

$ criu --version
Version: 4.0
GitID: v4.0-23-gf6baf8143

$ sudo criu check --all
Warn  (criu/cr-check.c:1348): Nftables based locking requires libnftables and set concatenations support
Error (criu/cr-check.c:1553): unmatched dev:ino 0:38:9 (expected 0:39:9)
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.

CentOS host details:

$ uname -mor
5.14.0-522.el9.x86_64 x86_64 GNU/Linux

$ cat /etc/system-release
CentOS Stream release 9

The text was updated successfully, but these errors were encountered:

rst0git · 2024-11-14T18:01:40Z

Driver Version: 550.127.05

@ezerk Would you be able to update your driver version to 555 or 560?

The following readme file provides more information:
https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/cuda

ezerk · 2024-11-14T18:18:54Z

@rst0git - thanks for the supper fast reply - much appreciated !
I will give it a try and update

ezerk · 2024-11-18T13:48:58Z

upgraded driver on host twice to 555 and 560 - still getting the same error (tried with both criu v3.19 and v4)

see driver details

nvidia-driver.x86_64 3:555.42.06-1.el9 @cuda-rhel9-x86_64

dnf module install nvidia-driver:555-open

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+

I've tried running image based on nvidia/cuda:12.6.2-base-ubuntu24.04 or on nvidia/cuda:12.5.0-base-ubuntu24.04

nvidia-driver.x86_64 3:560.35.03-1.el9 @cuda-rhel9-x86_64

dnf module install nvidia-driver:560-open

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+

rst0git · 2024-11-18T15:11:16Z

(00.018191) Error (compel/src/lib/infect.c:713): Unable to connect a transport socket: Permission denied
(00.018200) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.018202) Error (compel/src/lib/ptrace.c:96): Can't poke 192578 @ 0x55887f472000 from 0x7ffc2ad59a58 sized 8
(00.018204) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.018206) Error (compel/src/lib/ptrace.c:100): Can't restore the original data with poke
(00.018207) Error (compel/src/lib/infect.c:637): Can't inject syscall blob (pid: 192578)
(00.018209) Warn  (criu/parasite-syscall.c:439): Can't cure failed infection

This error above is unrelated. It occurs because selinux is enabled and prevents the CRIU parasite code from writing to the log file descriptor.

trying to use underlaying criu command directly

I would not recommend this approach because specifying all required CRIU options would be very challenging.

@ezerk Would you be able to try the following with CRIU v4.0 and CUDA plugin?

mkdir -p /etc/criu
echo -e "tcp-established\nghost-limit=100M\ntimeout=300" | sudo tee /etc/criu/runc.conf
sed -i 's/#runtime = "crun"/runtime = "runc"/' /usr/share/containers/containers.conf

sudo podman run -d --name cuda-counter --device nvidia.com/gpu=all --security-opt=label=disable \
        quay.io/radostin/cuda-counter

sudo podman logs -l
sudo podman container checkpoint -l -e /tmp/test.tar
podman rm -f cuda-counter
sudo podman container restore -i /tmp/test.tar
sudo podman logs -l

ezerk · 2024-11-18T18:22:48Z

it worked !
changing /usr/share/containers/containers.conf runtime = "runc"
this flag is also mandatory podman run --security-opt=label=disable

changes to /etc/criu/runc.conf does not seem to be mandatory

i will update later on about my experience with more complexed applications
Many thanks

ezerk · 2024-11-19T23:55:21Z

created a simple Dockerfile for comfyanonymous/ComfyUI based on nvidia/cuda:12.6.2-base-ubuntu24.04 for testing

note - this is app using pytoarch so i also compiled criu with this workaround - #4 (comment)

i was able to checkpoint and restore successfully on the same host
i still dont have a proper benchmark comparing container startup+warmup vs restore+cuda-checkpoint toggle, but it seems to have an advantage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to checkpoint dump container using GPU - Unable to connect a transport socket: Permission denied #19

Failed to checkpoint dump container using GPU - Unable to connect a transport socket: Permission denied #19

ezerk commented Nov 14, 2024

rst0git commented Nov 14, 2024 •

edited

Loading

ezerk commented Nov 14, 2024

ezerk commented Nov 18, 2024

rst0git commented Nov 18, 2024 •

edited

Loading

ezerk commented Nov 18, 2024

ezerk commented Nov 19, 2024

Failed to checkpoint dump container using GPU - Unable to connect a transport socket: Permission denied #19

Failed to checkpoint dump container using GPU - Unable to connect a transport socket: Permission denied #19

Comments

ezerk commented Nov 14, 2024

Steps to reproduce:

Spec

rst0git commented Nov 14, 2024 • edited Loading

ezerk commented Nov 14, 2024

ezerk commented Nov 18, 2024

rst0git commented Nov 18, 2024 • edited Loading

ezerk commented Nov 18, 2024

ezerk commented Nov 19, 2024

rst0git commented Nov 14, 2024 •

edited

Loading

rst0git commented Nov 18, 2024 •

edited

Loading