Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple NVIDIA GPUs #901

Open
chris-gputrader opened this issue Jan 17, 2025 · 3 comments
Open

Support for multiple NVIDIA GPUs #901

chris-gputrader opened this issue Jan 17, 2025 · 3 comments

Comments

@chris-gputrader
Copy link

I'm encountering an issue when attempting to use the sysbox runtime with containers that require NVIDIA GPU access on a system with multiple GPUs. While the setup works seamlessly on a single-GPU machine, it fails when deployed on a multiple GPU machine.

The container should have access to all or specific GPUs as defined in the Docker Compose file, with GPU devices and drivers properly passed through by the sysbox runtime.

When deploying a container on the multi-GPU system, the following error occurs:

Failed to deploy a stack: compose up operation failed: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: container_linux.go:439: starting container process caused: process_linux.go:608: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: mount error: mount operation failed: /var/lib/docker/overlay2/e5409caee5c762014641d9a3fa7981fc960b3c2309980dda0e6b5d87b096a649/merged/proc/driver/nvidia: no such file or directory: unknown

@ctalledo
Copy link
Member

Hi @chris-gputrader, thanks for reporting.

What distro and kernel are you on?

lsb_release -a
uname -a

Also, can you provide the output of the sysbox-mgr (journalctl -u sysbox-mgr), particularly the first ~10 lines after it starts:

Jan 24 21:57:17 lenovo systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Jan 24 21:57:17 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="Starting ..."
Jan 24 21:57:17 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="Sysbox data root: /var/lib/sysbox"
Jan 24 21:57:18 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="Shiftfs module found in kernel: no"
Jan 24 21:57:18 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="Shiftfs works properly: no"
Jan 24 21:57:18 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="Shiftfs-on-overlayfs works properly: no"
Jan 24 21:57:18 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="ID-mapped mounts supported by kernel: yes"
Jan 24 21:57:18 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="Overlayfs on ID-mapped mounts supported by kernel: yes"
...

I want to see if shiftfs and/or ID-mapped mounts are working.

Thanks!

@JerFree
Copy link

JerFree commented Jan 25, 2025

I had the same issue.

root@ecs-19330674:~#  docker run  --gpus all --runtime=sysbox-runc --rm -it --hostname=syscont nestybox/ubuntu-bionic-systemd 
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: container_linux.go:439: starting container process caused: process_linux.go:608: container init caused: Running hook #0:: error running hook: '
nvidia-container-cli: mount error: mount operation failed: /var/lib/docker/overlay2/6e296a8ed6217cbc604509faa29a9ee7533f6ec5e71fda80632dea66c09f808f/merged/proc/driver/nvidia: no such file or directory: unknown.

root@ecs-19330674:~# journalctl -u sysbox-mgr
Jan 26 01:10:19 ecs-19330674 systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Starting ..."
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Sysbox data root: /var/lib/sysbox"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Shiftfs module found in kernel: yes"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Shiftfs works properly: no"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Shiftfs-on-overlayfs works properly: yes"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="ID-mapped mounts supported by kernel: yes"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Overlayfs on ID-mapped mounts supported by kernel: no"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Operating in system container mode."
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Relaxed read-only mode disabled."
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Inner container image preloading enabled."
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Listening on /run/sysbox/sysmgr.sock"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Ready ..."
Jan 26 01:10:20 ecs-19330674 systemd[1]: Started sysbox-mgr (part of the Sysbox container runtime).
Jan 26 01:12:17 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:12:17" level=info msg="registered new container c00103f71e59"
Jan 26 01:13:53 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:13:53" level=info msg="unregistered container c00103f71e59"
Jan 26 01:13:53 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:13:53" level=info msg="released resources for container c00103f71e59"
Jan 26 01:14:29 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:14:29" level=info msg="registered new container a9b940e80df7"
Jan 26 01:14:30 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:14:30" level=info msg="unregistered container a9b940e80df7"
Jan 26 01:14:30 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:14:30" level=info msg="released resources for container a9b940e80df7"
Jan 26 01:15:43 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:15:43" level=info msg="registered new container 33d34a93d64c"
Jan 26 01:27:47 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:27:47" level=info msg="unregistered container 33d34a93d64c"
Jan 26 01:27:48 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:27:48" level=info msg="released resources for container 33d34a93d64c"
Jan 26 04:58:26 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 04:58:26" level=info msg="registered new container 029902e180f1"
Jan 26 04:59:40 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 04:59:40" level=info msg="registered new container 01db8efcb24d"
Jan 26 05:00:14 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 05:00:14" level=info msg="registered new container 3e1560640e8a"
Jan 26 05:00:15 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 05:00:15" level=info msg="unregistered container 3e1560640e8a"
Jan 26 05:00:15 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 05:00:15" level=info msg="released resources for container 3e1560640e8a"

thanks .

@chris-gputrader
Copy link
Author

Here is the information you requested:

bootstrap@dsmnva100dgx0297:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.5 LTS
Release:        22.04
Codename:       jammy
bootstrap@dsmnva100dgx0297:~$ uname -a
Linux dsmnva100dgx0297 5.15.0-1071-nvidia #72-Ubuntu SMP Thu Jan 16 00:47:54 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Feb 14 17:12:26 dsmnva100dgx0297 systemd[1]: Stopping sysbox-mgr (part of the Sysbox container runtime)...
Feb 14 17:12:26 dsmnva100dgx0297 sysbox-mgr[3476255]: time="2025-02-14 17:12:26" level=info msg="Caught OS signal: terminated"
Feb 14 17:12:26 dsmnva100dgx0297 sysbox-mgr[3476255]: time="2025-02-14 17:12:26" level=info msg="Stopping (gracefully) ..."
Feb 14 17:12:26 dsmnva100dgx0297 sysbox-mgr[3476255]: time="2025-02-14 17:12:26" level=warning msg="The following containers are active and will s>
Feb 14 17:12:26 dsmnva100dgx0297 sysbox-mgr[3476255]: time="2025-02-14 17:12:26" level=warning msg="container id: 8094839157d1"
Feb 14 17:12:26 dsmnva100dgx0297 sysbox-mgr[3476255]: time="2025-02-14 17:12:26" level=error msg="failed to remove bind mounts over orig rootfs: i>
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3476255]: time="2025-02-14 17:12:28" level=info msg=Stopped.
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3476255]: time="2025-02-14 17:12:28" level=info msg=Exiting.
Feb 14 17:12:28 dsmnva100dgx0297 systemd[1]: sysbox-mgr.service: Deactivated successfully.
Feb 14 17:12:28 dsmnva100dgx0297 systemd[1]: Stopped sysbox-mgr (part of the Sysbox container runtime).
Feb 14 17:12:28 dsmnva100dgx0297 systemd[1]: sysbox-mgr.service: Consumed 58.754s CPU time.
Feb 14 17:12:28 dsmnva100dgx0297 systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:28" level=info msg="Starting ..."
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:28" level=info msg="Sysbox data root: /var/lib/sysbox"
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:28" level=info msg="Shiftfs module found in kernel: yes"
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:28" level=info msg="Shiftfs works properly: no"
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:28" level=info msg="Shiftfs-on-overlayfs works properly: yes"
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:28" level=info msg="ID-mapped mounts supported by kernel: yes"
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:28" level=info msg="Overlayfs on ID-mapped mounts supported by kernel>
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:28" level=info msg="Operating in system container mode."
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:28" level=info msg="Relaxed read-only mode disabled."
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:28" level=info msg="Inner container image preloading enabled."
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:28" level=info msg="Listening on /run/sysbox/sysmgr.sock"
Feb 14 17:12:28 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:28" level=info msg="Ready ..."
Feb 14 17:12:28 dsmnva100dgx0297 systemd[1]: Started sysbox-mgr (part of the Sysbox container runtime).
Feb 14 17:12:36 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:36" level=info msg="registered new container 73ee23b60227"
Feb 14 17:12:44 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:44" level=info msg="unregistered container 73ee23b60227"
Feb 14 17:12:46 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:12:46" level=info msg="released resources for container 73ee23b60227"
Feb 14 17:18:49 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:18:49" level=info msg="registered new container fc8d6e805cbb"
Feb 14 17:18:49 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:18:49" level=info msg="unregistered container fc8d6e805cbb"
Feb 14 17:18:49 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:18:49" level=info msg="released resources for container fc8d6e805cbb"
Feb 14 17:20:34 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:20:34" level=info msg="registered new container 9c1ae3dcbb15"
Feb 14 17:21:44 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:21:44" level=info msg="unregistered container 9c1ae3dcbb15"
Feb 14 17:21:45 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:21:45" level=info msg="released resources for container 9c1ae3dcbb15"
Feb 14 17:22:03 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:22:03" level=info msg="registered new container 8a552937731e"
Feb 14 17:22:04 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:22:04" level=info msg="unregistered container 8a552937731e"
Feb 14 17:22:05 dsmnva100dgx0297 sysbox-mgr[3570320]: time="2025-02-14 17:22:05" level=info msg="released resources for container 8a552937731e"
lines 15-59/59 (END)

The strange thing is on some machines sysbox with nvidia passthrough seems to work fine others it doesn't. Can't figure out what the cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants