-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to checkpoint dump container using GPU - Unable to connect a transport socket: Permission denied #19
Comments
@ezerk Would you be able to update your driver version to 555 or 560? The following readme file provides more information: |
@rst0git - thanks for the supper fast reply - much appreciated ! |
upgraded driver on host twice to 555 and 560 - still getting the same error (tried with both criu v3.19 and v4) see driver details
|
This error above is unrelated. It occurs because selinux is enabled and prevents the CRIU parasite code from writing to the log file descriptor.
I would not recommend this approach because specifying all required CRIU options would be very challenging. @ezerk Would you be able to try the following with CRIU v4.0 and CUDA plugin?
|
it worked ! changes to i will update later on about my experience with more complexed applications |
created a simple Dockerfile for comfyanonymous/ComfyUI based on note - this is app using pytoarch so i also compiled criu with this workaround - #4 (comment) i was able to checkpoint and restore successfully on the same host |
Hi
my end goal is to checkpoint containers that use GPU during CI phase and later deploy it using kubernetes in order to reduce pod warmup time.
so far I experiment locally and was able to dump a GPU process (even relatively complexed one using the workaround suggested in #4)
but when i run the app as a container it fails at early dump stages
Steps to reproduce:
following the provided example code in this repo counter.cu and wrapping it with a container using this
Dockerfile
:im using podman since it provides the option to checkpoint to an image which i hope to use later on
src
foldersudo podman build -t counter .
sudo podman run -p 10000:10000/udp --gpus=all --name=counter counter
nvidia-smi
(on host machine) shows the PID as expectedPID:
192578 /app/counter
sudo cuda-checkpoint --toggle --pid $PID
successfullynvidia-smi
that process is offloaded from GPUoutput: `No running processes found` as expected
sudo podman container checkpoint counter --create-image=counter_chechpoint
fails. (using other flags provided by podman checkpoint resulted in the same error)criu
command directlysudo criu dump --shell-job --images-dir dump --external 'mnt[]:sm' -vvvv -o dump2.log
I can get some progress (due to--external 'mnt[]:sm'
flag) but still fail, this time with:note - same error reproduced with both rpm installed criu (criu-3.19-1.el9.x86_64) and with criu v4 compiled from latest commit from criu-dev branch
complete criu log files:
Spec
criu v3.19 (rpm version)
$ sudo criu_ORIG check --all Warn (criu/cr-check.c:1346): Nftables based locking requires libnftables and set concatenations support Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure.
criu v4 compiled version
$ sudo criu check --all Warn (criu/cr-check.c:1348): Nftables based locking requires libnftables and set concatenations support Error (criu/cr-check.c:1553): unmatched dev:ino 0:38:9 (expected 0:39:9) Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure.
CentOS host details:
The text was updated successfully, but these errors were encountered: