-
Notifications
You must be signed in to change notification settings - Fork 2k
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed #1648
Comments
@zyr-NULL to confirm whether this is a regression or not, would you be able to repeat the test after downgrading to |
@elezar i make
|
Sorry, I should have made it clearer. Could you also downgrade |
@elezar this the same error
|
What happens when you run Also, this warning in the log seems suspicious:
Is there something unconventional about how you are running docker on your host? For example, rootless, in a snap sandbox, etc. |
@klueska
i run docker in root model and i used |
If it’s taking that long to get a result from nvidia-smi on the host then I could understand why the RPC might time out in the nvidia-container-cli when trying to get a result from the driver RPC call. Im assuming this is happening because (1) you are not running the nvidia-persistenced daemon and (2) your GPUs are not in persistence mode. Both (1) and (2) achieve the same thing, but (1) is the preferred method to keep the GPU driver alive even when no clients are attached. Try enabling one of these methods and report back. |
@klueska Thanks, my problem has been solved through enable persistence model. |
@zyr-NULL given your comment above I am closing this issue. Please reopen if there is still a problem. |
@klueska Yeah dude thanks =) |
I met this problem when I use shell to run , after I add |
I tried using
Here are the logs:
Any ideas @klueska? 🥲 Here's the error by the way:
|
I have the same problem with @aurelien-m when I run
|
I met this problem too. My GPU is A30 and GPU driver is 525.85.12.
|
@fighterhit this may be related to changes in the GSP firmware paths and should be addressed in the v1.13.0-rc.1 release. Would you be able to give it a try? |
Hi @elezar , yes I can, but do I need to restart the node or the containers on the node? Because it may need to be deployed in our production environment. Or does this problem exist in previous versions, I can also accept lower versions :). |
Updating the If this is the firmware issue that I think it is, then a lower version would not work. What is the output of
If there is a single |
@elezar The output is:
It seems that there is no such |
Hi @elezar , do you have any more suggestions? Thanks! |
The path is /lib/firmware/… |
Thanks @klueska 😅, there are indeed two
|
Hi @klueska , can the solution provided by the |
The new RC adds support for detecting multiple GSP firmware files, which is required for container support to work correctly on the 525 driver. The persistenced issue is still relevant but this is a new one related to the latest NVIDIA driver. |
Thanks @klueska , how can I install this version
Does the persistenced issue only appear on the latest driver (525.85.12)? Can I solve it by downgrading to a certain driver version? |
Persistenced is always needed, but the firmware issue could be „resolved“ by downgrading. That said, I’d recommend updating the nvidia-container-toolkit to ensure compatibility with all future drivers. Since the latest toolkit is still an RC it is not yet in our stable apt repo. You will need to configure our experimental repo to get access to it. Instructions here: |
Thank you for your help @klueska. What confuses me is that in the past, our GPU clusters (1080Ti, 2080Ti, 3090, A30, A100) have not turned on the persistent mode and there is no such problem, only A30 nodes (with driver 525.85.12) have this problem, none of the other types of GPU nodes(1080Ti, 2080Ti, 3090 with driver 525.78.01) have this problem. |
Hi @klueska @elezar , when I configure to use the experimental repo I get the following error and can't install the
|
@fighterhit what distribution are you using? Please ensure that the |
He said he’s on debian11. |
@elezar My distribution is |
Hi @klueska @elezar , I test the
Maybe |
For debian, yes, it should be This file gets installed under The correct config file should have been selected automatically based on your distribution, were you not able to install the debian11 one directly? |
Yes @klueska , I failed to installed the latest version using the experimental repo(#1648 (comment)), so I followed @elezar advice and set the distro to |
Sorry about that. I was making an assumption about the distribution you are using. You can install the |
Thanks @elezar ! I have tried it and it works fine. It may be better if |
@fighterhit I have just double-checked our repo configuration and the issue is that for I have created a link / redirect for this now and the official instructions should work as expected. (It may take about 30 minutes for the changes to reflect in the repo though). |
Thanks for your confirmation, it works now. @elezar |
I am trying to install NVIDIA container toolkit on Amazon Linux 2, created a new ec2 instance and followed the instructions in this page and running into below issue. I get the same error when i tried this on a Ubuntu instance as well.
@elezar @klueska please advise how to fix this issue, appreciate your inputs. Thanks |
Hi @elezar @klueska , unfortunately I have used the latest toolkit but this problem reappeared, I think this may be related to the driver, and I asked the driver community for help but got no more reply (https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446), https://forums.developer.nvidia.com/t/timeout-waiting-for-rpc-from-gsp/244789 could you communicate with the driver team about this issue? Because some users in the community also encountered the same problem.Thanks!
|
I was getting similar errors trying to run I noticed that I then ran across the following forum post which led me to think that perhaps my drivers had somehow been disabled/corrupted by a recent system and/or kernel upgrade so I decided to reinstall (and upgrade) the nvidia driver using the PPA, and |
I was having the same issue on Ubuntu Server 22.04 and docker-compose. Reinstalling docker with apt (instead of snap) solved my problem. |
Uninstalling docker desktop and installing using apt worked for me |
for some reason, reinstalling docker help me as well by executing (ubuntu 22.04): |
@bkocis Thanks, that worked! I uninstalled docker with the official docker instructions, then reinstalled from the same link and it's now all working again. |
In my case, if there is another docker installed through snap, this kind of issue happened. |
Issue or feature description
when i use docker to create container, i get this error
Steps to reproduce the issue
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: driver rpc error: timed out: unknown.
sudo docker run hello-world
here is some Information
Kernel version
Linux gpu-server 4.15.0-187-generic NVIDIA/nvidia-docker#198-Ubuntu SMP Tue Jun 14 03:23:51 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Driver information
The text was updated successfully, but these errors were encountered: