Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to checkpoint dump container using GPU - Unable to connect a transport socket: Permission denied #19

Open
ezerk opened this issue Nov 14, 2024 · 6 comments

Comments

@ezerk
Copy link

ezerk commented Nov 14, 2024

Hi
my end goal is to checkpoint containers that use GPU during CI phase and later deploy it using kubernetes in order to reduce pod warmup time.

so far I experiment locally and was able to dump a GPU process (even relatively complexed one using the workaround suggested in #4)

but when i run the app as a container it fails at early dump stages

Steps to reproduce:

following the provided example code in this repo counter.cu and wrapping it with a container using this Dockerfile:

FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 as builder
COPY counter.cu .
RUN nvcc counter.cu -o /tmp/counter

FROM nvidia/cuda:12.4.1-base-ubuntu22.04 as main
WORKDIR /app
COPY --from=builder /tmp/counter /app/counter
EXPOSE 10000
ENTRYPOINT ["/app/counter"]

im using podman since it provides the option to checkpoint to an image which i hope to use later on

  • clone this repo and place Dockerfile under src folder
  • building local image sudo podman build -t counter .
  • run image sudo podman run -p 10000:10000/udp --gpus=all --name=counter counter
  • run nvidia-smi (on host machine) shows the PID as expected
PID: 192578 /app/counter
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   74C    P0             32W /   70W |     103MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    192578      C   /app/counter                                  100MiB |
+-----------------------------------------------------------------------------------------+
  • run sudo cuda-checkpoint --toggle --pid $PID successfully
  • validate with nvidia-smi that process is offloaded from GPU
output: `No running processes found` as expected
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   74C    P0             33W /   70W |       1MiB /  15360MiB |      7%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
  • trying to checkout using podman sudo podman container checkpoint counter --create-image=counter_chechpoint fails. (using other flags provided by podman checkpoint resulted in the same error)
CRIU checkpointing failed -52.  Please check CRIU logfile .. 

662 (00.007387) Error (criu/mount.c:757): mnt: 637:./usr/lib/firmware/nvidia/550.127.05/gsp_tu10x.bin doesn't hav    e a proper root mount
663 (00.007403) net: Unlock network
664 (00.007407) Running network-unlock scripts
665 (00.024091) Unfreezing tasks into 1
666 (00.024103)     Unseizing 192578 into 1
667 (00.024161) Error (criu/cr-dump.c:2111): Dumping FAILED.
  • trying to use underlaying criu command directly sudo criu dump --shell-job --images-dir dump --external 'mnt[]:sm' -vvvv -o dump2.log I can get some progress (due to --external 'mnt[]:sm' flag) but still fail, this time with:
(00.017847) Putting tsock into pid 192578
(00.018191) Error (compel/src/lib/infect.c:713): Unable to connect a transport socket: Permission denied
(00.018200) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.018202) Error (compel/src/lib/ptrace.c:96): Can't poke 192578 @ 0x55887f472000 from 0x7ffc2ad59a58 sized 8
(00.018204) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.018206) Error (compel/src/lib/ptrace.c:100): Can't restore the original data with poke
(00.018207) Error (compel/src/lib/infect.c:637): Can't inject syscall blob (pid: 192578)
(00.018209) Warn  (criu/parasite-syscall.c:439): Can't cure failed infection
(00.018215) Error (criu/cr-dump.c:1610): Can't infect (pid: 192578) with parasite
(00.018276) net: Unlock network
(00.018279) Running network-unlock scripts
(00.034182) Unfreezing tasks into 1
(00.034194)     Unseizing 192578 into 1
(00.034200) Error (compel/src/lib/infect.c:418): Unable to detach from 192578: No such process
(00.034239) Error (criu/cr-dump.c:2111): Dumping FAILED.

note - same error reproduced with both rpm installed criu (criu-3.19-1.el9.x86_64) and with criu v4 compiled from latest commit from criu-dev branch

complete criu log files:

Spec
criu v3.19 (rpm version)
$ criu_ORIG --version
Version: 3.19
$ sudo criu_ORIG check --all
Warn  (criu/cr-check.c:1346): Nftables based locking requires libnftables and set concatenations support
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.
criu v4 compiled version
$ criu --version
Version: 4.0
GitID: v4.0-23-gf6baf8143
$ sudo criu check --all
Warn  (criu/cr-check.c:1348): Nftables based locking requires libnftables and set concatenations support
Error (criu/cr-check.c:1553): unmatched dev:ino 0:38:9 (expected 0:39:9)
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.
CentOS host details:
$ uname -mor
5.14.0-522.el9.x86_64 x86_64 GNU/Linux

$ cat /etc/system-release
CentOS Stream release 9
@rst0git
Copy link

rst0git commented Nov 14, 2024

Driver Version: 550.127.05

@ezerk Would you be able to update your driver version to 555 or 560?

The following readme file provides more information:
https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/cuda

@ezerk
Copy link
Author

ezerk commented Nov 14, 2024

@rst0git - thanks for the supper fast reply - much appreciated !
I will give it a try and update

@ezerk
Copy link
Author

ezerk commented Nov 18, 2024

upgraded driver on host twice to 555 and 560 - still getting the same error (tried with both criu v3.19 and v4)

see driver details

nvidia-driver.x86_64 3:555.42.06-1.el9 @cuda-rhel9-x86_64
dnf module install nvidia-driver:555-open

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+

I've tried running image based on nvidia/cuda:12.6.2-base-ubuntu24.04 or on nvidia/cuda:12.5.0-base-ubuntu24.04

nvidia-driver.x86_64 3:560.35.03-1.el9 @cuda-rhel9-x86_64
dnf module install nvidia-driver:560-open

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+

@rst0git
Copy link

rst0git commented Nov 18, 2024

(00.018191) Error (compel/src/lib/infect.c:713): Unable to connect a transport socket: Permission denied
(00.018200) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.018202) Error (compel/src/lib/ptrace.c:96): Can't poke 192578 @ 0x55887f472000 from 0x7ffc2ad59a58 sized 8
(00.018204) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.018206) Error (compel/src/lib/ptrace.c:100): Can't restore the original data with poke
(00.018207) Error (compel/src/lib/infect.c:637): Can't inject syscall blob (pid: 192578)
(00.018209) Warn  (criu/parasite-syscall.c:439): Can't cure failed infection

This error above is unrelated. It occurs because selinux is enabled and prevents the CRIU parasite code from writing to the log file descriptor.

trying to use underlaying criu command directly

I would not recommend this approach because specifying all required CRIU options would be very challenging.

@ezerk Would you be able to try the following with CRIU v4.0 and CUDA plugin?

mkdir -p /etc/criu
echo -e "tcp-established\nghost-limit=100M\ntimeout=300" | sudo tee /etc/criu/runc.conf
sed -i 's/#runtime = "crun"/runtime = "runc"/' /usr/share/containers/containers.conf

sudo podman run -d --name cuda-counter --device nvidia.com/gpu=all --security-opt=label=disable \
        quay.io/radostin/cuda-counter

sudo podman logs -l
sudo podman container checkpoint -l -e /tmp/test.tar
podman rm -f cuda-counter
sudo podman container restore -i /tmp/test.tar
sudo podman logs -l

@ezerk
Copy link
Author

ezerk commented Nov 18, 2024

it worked !
changing /usr/share/containers/containers.conf runtime = "runc"
this flag is also mandatory podman run --security-opt=label=disable

changes to /etc/criu/runc.conf does not seem to be mandatory

i will update later on about my experience with more complexed applications
Many thanks

@ezerk
Copy link
Author

ezerk commented Nov 19, 2024

created a simple Dockerfile for comfyanonymous/ComfyUI based on nvidia/cuda:12.6.2-base-ubuntu24.04 for testing

note - this is app using pytoarch so i also compiled criu with this workaround - #4 (comment)

i was able to checkpoint and restore successfully on the same host
i still dont have a proper benchmark comparing container startup+warmup vs restore+cuda-checkpoint toggle, but it seems to have an advantage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants