docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed #1648

zyr-NULL · 2022-06-21T07:25:09Z

Issue or feature description

when i use docker to create container, i get this error

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: driver rpc error: timed out: unknown.

Steps to reproduce the issue

when i executed the following command
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
i get the following error
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: driver rpc error: timed out: unknown.
but when i executed this following command, it has the expected output
sudo docker run hello-world

here is some Information

Some nvidia-container information

gpu-server@gpu-server:~$ nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0621 07:07:51.735875 4789 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I0621 07:07:51.735941 4789 nvc.c:350] using root /
I0621 07:07:51.735947 4789 nvc.c:351] using ldcache /etc/ld.so.cache
I0621 07:07:51.735963 4789 nvc.c:352] using unprivileged user 1000:1000
I0621 07:07:51.735984 4789 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0621 07:07:51.736064 4789 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0621 07:07:51.739205 4791 nvc.c:273] failed to set inheritable capabilities
W0621 07:07:51.739329 4791 nvc.c:274] skipping kernel modules load due to failure
I0621 07:07:51.739807 4793 rpc.c:71] starting driver rpc service
W0621 07:08:16.774958 4789 rpc.c:121] terminating driver rpc service (forced)
I0621 07:08:20.481845 4789 rpc.c:135] driver rpc service terminated with signal 15
nvidia-container-cli: initialization error: driver rpc error: timed out
I0621 07:08:20.481972 4789 nvc.c:434] shutting down library context

Kernel version
Linux gpu-server 4.15.0-187-generic NVIDIA/nvidia-docker#198-Ubuntu SMP Tue Jun 14 03:23:51 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Driver information

==============NVSMI LOG==============

Timestamp                                 : Tue Jun 21 07:13:57 2022
Driver Version                            : 515.48.07
CUDA Version                              : 11.7

Attached GPUs                             : 4
GPU 00000000:01:00.0
    Product Name                          : NVIDIA A100-SXM4-40GB
    Product Brand                         : NVIDIA
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1561221014674
    GPU UUID                              : GPU-b67da01e-feba-d839-62c5-2773d4e963f0
    Minor Number                          : 0
    VBIOS Version                         : 92.00.19.00.13
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : 692-2G506-0202-002
    Module ID                             : 3
    Inforom Version
        Image Version                     : G506.0202.00.02
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 515.48.07
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x20B010DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x144E10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 40960 MiB
        Reserved                          : 571 MiB
        Used                              : 0 MiB
        Free                              : 40388 MiB
    BAR1 Memory Usage
        Total                             : 65536 MiB
        Used                              : 1 MiB
        Free                              : 65535 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 640 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 32 C
        GPU Shutdown Temp                 : 92 C
        GPU Slowdown Temp                 : 89 C
        GPU Max Operating Temp            : 85 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 34 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 54.92 W
        Power Limit                       : 400.00 W
        Default Power Limit               : 400.00 W
        Enforced Power Limit              : 400.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 400.00 W
    Clocks
        Graphics                          : 1080 MHz
        SM                                : 1080 MHz
        Memory                            : 1215 MHz
        Video                             : 975 MHz
    Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Default Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1215 MHz
        Video                             : 1290 MHz
    Max Customer Boost Clocks
        Graphics                          : 1410 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 731.250 mV
    Processes                             : None

GPU 00000000:41:00.0
    Product Name                          : NVIDIA A100-SXM4-40GB
    Product Brand                         : NVIDIA
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1561221014888
    GPU UUID                              : GPU-6ca82e47-c63a-1bea-38ad-d3af9e1dc26b
    Minor Number                          : 1
    VBIOS Version                         : 92.00.19.00.13
    MultiGPU Board                        : No
    Board ID                              : 0x4100
    GPU Part Number                       : 692-2G506-0202-002
    Module ID                             : 1
    Inforom Version
        Image Version                     : G506.0202.00.02
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 515.48.07
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x41
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x20B010DE
        Bus Id                            : 00000000:41:00.0
        Sub System Id                     : 0x144E10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 40960 MiB
        Reserved                          : 571 MiB
        Used                              : 0 MiB
        Free                              : 40388 MiB
    BAR1 Memory Usage
        Total                             : 65536 MiB
        Used                              : 1 MiB
        Free                              : 65535 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 640 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 30 C
        GPU Shutdown Temp                 : 92 C
        GPU Slowdown Temp                 : 89 C
        GPU Max Operating Temp            : 85 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 40 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 57.45 W
        Power Limit                       : 400.00 W
        Default Power Limit               : 400.00 W
        Enforced Power Limit              : 400.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 400.00 W
    Clocks
        Graphics                          : 915 MHz
        SM                                : 915 MHz
        Memory                            : 1215 MHz
        Video                             : 780 MHz
    Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Default Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1215 MHz
        Video                             : 1290 MHz
    Max Customer Boost Clocks
        Graphics                          : 1410 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 700.000 mV
    Processes                             : None

GPU 00000000:81:00.0
    Product Name                          : NVIDIA A100-SXM4-40GB
    Product Brand                         : NVIDIA
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1561221015040
    GPU UUID                              : GPU-7e4b55d2-75fc-8ab5-e212-09e69e84704b
    Minor Number                          : 2
    VBIOS Version                         : 92.00.19.00.13
    MultiGPU Board                        : No
    Board ID                              : 0x8100
    GPU Part Number                       : 692-2G506-0202-002
    Module ID                             : 2
    Inforom Version
        Image Version                     : G506.0202.00.02
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 515.48.07
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x81
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x20B010DE
        Bus Id                            : 00000000:81:00.0
        Sub System Id                     : 0x144E10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 40960 MiB
        Reserved                          : 571 MiB
        Used                              : 0 MiB
        Free                              : 40388 MiB
    BAR1 Memory Usage
        Total                             : 65536 MiB
        Used                              : 1 MiB
        Free                              : 65535 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 640 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 32 C
        GPU Shutdown Temp                 : 92 C
        GPU Slowdown Temp                 : 89 C
        GPU Max Operating Temp            : 85 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 33 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 54.65 W
        Power Limit                       : 400.00 W
        Default Power Limit               : 400.00 W
        Enforced Power Limit              : 400.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 400.00 W
    Clocks
        Graphics                          : 1080 MHz
        SM                                : 1080 MHz
        Memory                            : 1215 MHz
        Video                             : 975 MHz
    Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Default Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1215 MHz
        Video                             : 1290 MHz
    Max Customer Boost Clocks
        Graphics                          : 1410 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 712.500 mV
    Processes                             : None

GPU 00000000:C1:00.0
    Product Name                          : NVIDIA A100-SXM4-40GB
    Product Brand                         : NVIDIA
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1561221014695
    GPU UUID                              : GPU-66ba085a-5496-d204-d4da-aa9f112d3fd8
    Minor Number                          : 3
    VBIOS Version                         : 92.00.19.00.13
    MultiGPU Board                        : No
    Board ID                              : 0xc100
    GPU Part Number                       : 692-2G506-0202-002
    Module ID                             : 0
    Inforom Version
        Image Version                     : G506.0202.00.02
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 515.48.07
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0xC1
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x20B010DE
        Bus Id                            : 00000000:C1:00.0
        Sub System Id                     : 0x144E10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 40960 MiB
        Reserved                          : 571 MiB
        Used                              : 0 MiB
        Free                              : 40388 MiB
    BAR1 Memory Usage
        Total                             : 65536 MiB
        Used                              : 1 MiB
        Free                              : 65535 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 640 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 30 C
        GPU Shutdown Temp                 : 92 C
        GPU Slowdown Temp                 : 89 C
        GPU Max Operating Temp            : 85 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 38 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 59.01 W
        Power Limit                       : 400.00 W
        Default Power Limit               : 400.00 W
        Enforced Power Limit              : 400.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 400.00 W
    Clocks
        Graphics                          : 1080 MHz
        SM                                : 1080 MHz
        Memory                            : 1215 MHz
        Video                             : 975 MHz
    Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Default Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1215 MHz
        Video                             : 1290 MHz
    Max Customer Boost Clocks
        Graphics                          : 1410 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 737.500 mV
    Processes                             : None

docker version

Client: Docker Engine - Community
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:02:56 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.11
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.9
  Git commit:       847da18
  Built:            Thu Nov 18 00:35:16 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.6
  GitCommit:        10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 nvidia:
  Version:          1.1.2
  GitCommit:        v1.1.2-0-ga916309
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

NVIDIA container library version

cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

NVIDIA packages version

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                 Version                 Architecture            Description
+++-====================================-=======================-=======================-=============================================================================
ii  libnvidia-container-tools            1.10.0-1                amd64                   NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64           1.10.0-1                amd64                   NVIDIA container runtime library
un  nvidia-container-runtime             <none>                  <none>                  (no description available)
un  nvidia-container-runtime-hook        <none>                  <none>                  (no description available)
ii  nvidia-container-toolkit             1.10.0-1                amd64                   NVIDIA container runtime hook
un  nvidia-docker                        <none>                  <none>                  (no description available)
ii  nvidia-docker2                       2.7.0-1                 all                     nvidia-docker CLI wrapper

The text was updated successfully, but these errors were encountered:

elezar · 2022-06-21T07:35:47Z

@zyr-NULL to confirm whether this is a regression or not, would you be able to repeat the test after downgrading to nvidia-container-toolkit=1.9.0-1?

zyr-NULL · 2022-06-21T07:54:33Z

@elezar i make nvidia-container-toolkit=1.9.0-1，but output remain the same.

Details are as follows

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                 Version                 Architecture            Description
+++-====================================-=======================-=======================-=============================================================================
ii  libnvidia-container-tools            1.10.0-1                amd64                   NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64           1.10.0-1                amd64                   NVIDIA container runtime library
un  nvidia-container-runtime             <none>                  <none>                  (no description available)
un  nvidia-container-runtime-hook        <none>                  <none>                  (no description available)
ii  nvidia-container-toolkit             1.9.0-1                 amd64                   NVIDIA container runtime hook
un  nvidia-docker                        <none>                  <none>                  (no description available)
ii  nvidia-docker2                       2.7.0-1                 all                     nvidia-docker CLI wrapper

gpu-server@gpu-server:~$ sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver rpc error: timed out: unknown.

elezar · 2022-06-21T08:44:01Z

Sorry, I should have made it clearer. Could you also downgrade libnvidia-container-tools and libnvidia-container1 and repeat.

zyr-NULL · 2022-06-21T09:38:56Z

@elezar this the same error

gpu-server@gpu-server:~$ dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                 Version                 Architecture            Description
+++-====================================-=======================-=======================-=============================================================================
ii  libnvidia-container-tools            1.9.0-1                 amd64                   NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64           1.9.0-1                 amd64                   NVIDIA container runtime library
un  nvidia-container-runtime             <none>                  <none>                  (no description available)
un  nvidia-container-runtime-hook        <none>                  <none>                  (no description available)
ii  nvidia-container-toolkit             1.9.0-1                 amd64                   NVIDIA container runtime hook
un  nvidia-docker                        <none>                  <none>                  (no description available)
ii  nvidia-docker2                       2.7.0-1                 all                     nvidia-docker CLI wrapper
gpu-server@gpu-server:~$ systemctl restart docker 
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ===
Authentication is required to restart 'docker.service'.
Authenticating as: gpu-server
Password: 
==== AUTHENTICATION COMPLETE ===
gpu-server@gpu-server:~$ sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver rpc error: timed out: unknown.

klueska · 2022-06-21T11:31:08Z

What happens when you run nvidia-smi directly on the host?

Also, this warning in the log seems suspicious:

W0621 07:07:51.739205 4791 nvc.c:273] failed to set inheritable capabilities

Is there something unconventional about how you are running docker on your host? For example, rootless, in a snap sandbox, etc.

zyr-NULL · 2022-06-21T14:07:08Z

@klueska
when i run this nvidia-smi on the host, it will two minutes for expected results

gpu-server@gpu-server:~$ nvidia-smi
Tue Jun 21 13:47:51 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:01:00.0 Off |                    0 |
| N/A   32C    P0    52W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:41:00.0 Off |                    0 |
| N/A   30C    P0    53W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  Off  | 00000000:81:00.0 Off |                    0 |
| N/A   31C    P0    52W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   30C    P0    52W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

i run docker in root model and i used sudo gpasswd -a gpu-server docker, then run docker ,in addition i did not installed docker with snap.

klueska · 2022-06-21T14:21:05Z

If it’s taking that long to get a result from nvidia-smi on the host then I could understand why the RPC might time out in the nvidia-container-cli when trying to get a result from the driver RPC call.

Im assuming this is happening because (1) you are not running the nvidia-persistenced daemon and (2) your GPUs are not in persistence mode.

Both (1) and (2) achieve the same thing, but (1) is the preferred method to keep the GPU driver alive even when no clients are attached.

Try enabling one of these methods and report back.

zyr-NULL · 2022-06-22T07:10:42Z

@klueska Thanks, my problem has been solved through enable persistence model.

elezar · 2022-06-30T11:28:54Z

@zyr-NULL given your comment above I am closing this issue. Please reopen if there is still a problem.

alkavan · 2022-07-18T11:04:27Z

@klueska Yeah dude thanks =)
That's exactly it, running nvidia-persistenced as root without any additional parameters solved it.
You can confirm with nvidia-smi that the GPU units load faster.

LukeLIN-web · 2022-10-16T12:50:56Z

I met this problem when I use shell to run , after I add #!/bin/bash at the top of shell script. This error disappear.

montmejat · 2023-02-06T13:33:43Z

If it’s taking that long to get a result from nvidia-smi on the host then I could understand why the RPC might time out in the nvidia-container-cli when trying to get a result from the driver RPC call.

Im assuming this is happening because (1) you are not running the nvidia-persistenced daemon and (2) your GPUs are not in persistence mode.

Both (1) and (2) achieve the same thing, but (1) is the preferred method to keep the GPU driver alive even when no clients are attached.

Try enabling one of these methods and report back.

I tried using nvidia-smi -i 0 -pm ENABLED but not luck. I'm having the same error as @zyr-NULL. I then tried using:

# nvidia-persistenced --user aurelien
nvidia-persistenced failed to initialize. Check syslog for more details.

Here are the logs:

# cat /var/log/syslog
Feb  6 14:15:46 pop-os nvidia-persistenced: device 0000:01:00.0 - persistence mode disabled.
Feb  6 14:15:46 pop-os nvidia-persistenced: device 0000:01:00.0 - NUMA memory offlined.
Feb  6 14:16:06 pop-os nvidia-persistenced: Failed to change ownership of /var/run/nvidia-persistenced: Operation not permitted
Feb  6 14:16:06 pop-os nvidia-persistenced: Shutdown (16491)
Feb  6 14:16:38 pop-os nvidia-persistenced: Failed to open PID file: Permission denied
Feb  6 14:16:38 pop-os nvidia-persistenced: Shutdown (16516)
Feb  6 14:16:46 pop-os rtkit-daemon[1589]: Supervising 8 threads of 5 processes of 1 users.
Feb  6 14:16:48 pop-os rtkit-daemon[1589]: message repeated 5 times: [ Supervising 8 threads of 5 processes of 1 users.]
Feb  6 14:17:01 pop-os nvidia-persistenced: device 0000:01:00.0 - persistence mode enabled.
Feb  6 14:17:01 pop-os nvidia-persistenced: device 0000:01:00.0 - NUMA memory onlined.
Feb  6 14:17:01 pop-os CRON[16652]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb  6 14:17:03 pop-os nvidia-persistenced: Failed to open PID file: Permission denied
Feb  6 14:17:03 pop-os nvidia-persistenced: Shutdown (16656)
Feb  6 14:18:03 pop-os rtkit-daemon[1589]: Supervising 8 threads of 5 processes of 1 users.
Feb  6 14:18:04 pop-os rtkit-daemon[1589]: message repeated 3 times: [ Supervising 8 threads of 5 processes of 1 users.]
Feb  6 14:18:33 pop-os nvidia-persistenced: Failed to open PID file: Permission denied
Feb  6 14:18:33 pop-os nvidia-persistenced: Shutdown (16799)
Feb  6 14:18:38 pop-os rtkit-daemon[1589]: Supervising 8 threads of 5 processes of 1 users.
Feb  6 14:18:53 pop-os rtkit-daemon[1589]: message repeated 7 times: [ Supervising 8 threads of 5 processes of 1 users.]
Feb  6 14:19:17 pop-os nvidia-persistenced: Failed to open PID file: Permission denied
Feb  6 14:19:17 pop-os nvidia-persistenced: Shutdown (16994)
Feb  6 14:20:04 pop-os rtkit-daemon[1589]: Supervising 8 threads of 5 processes of 1 users.
Feb  6 14:20:07 pop-os rtkit-daemon[1589]: message repeated 5 times: [ Supervising 8 threads of 5 processes of 1 users.]
Feb  6 14:20:56 pop-os systemd[1]: Starting Cleanup of Temporary Directories...
Feb  6 14:20:56 pop-os systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Feb  6 14:20:56 pop-os systemd[1]: Finished Cleanup of Temporary Directories.
Feb  6 14:21:03 pop-os rtkit-daemon[1589]: Supervising 8 threads of 5 processes of 1 users.
Feb  6 14:21:07 pop-os rtkit-daemon[1589]: message repeated 3 times: [ Supervising 8 threads of 5 processes of 1 users.]
Feb  6 14:21:15 pop-os systemd[1]: Reloading.
Feb  6 14:23:27 pop-os chronyd[1007]: NTS-KE session with 52.10.183.132:4460 (oregon.time.system76.com) timed out
Feb  6 14:23:33 pop-os chronyd[1007]: NTS-KE session with 18.228.202.30:4460 (brazil.time.system76.com) timed out
Feb  6 14:23:33 pop-os chronyd[1007]: NTS-KE session with 15.237.97.214:4460 (paris.time.system76.com) timed out
Feb  6 14:23:34 pop-os chronyd[1007]: NTS-KE session with 3.134.129.152:4460 (ohio.time.system76.com) timed out
Feb  6 14:23:36 pop-os chronyd[1007]: NTS-KE session with 3.220.42.39:4460 (virginia.time.system76.com) timed out
Feb  6 14:24:21 pop-os rtkit-daemon[1589]: Supervising 8 threads of 5 processes of 1 users.
Feb  6 14:24:21 pop-os rtkit-daemon[1589]: Supervising 8 threads of 5 processes of 1 users.
Feb  6 14:25:04 pop-os nvidia-persistenced: Failed to open PID file: Permission denied
Feb  6 14:25:04 pop-os nvidia-persistenced: Shutdown (17372)

Any ideas @klueska? 🥲

Here's the error by the way:

$ sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

Metrora · 2023-02-08T07:14:01Z

I have the same problem with @aurelien-m when I run

$ sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

fighterhit · 2023-02-23T02:22:58Z

I met this problem too. My GPU is A30 and GPU driver is 525.85.12.

 State:      Terminated
      Reason:   StartError
      Message:  failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: nvml error: unknown error: unknown

[Tue Feb 21 21:10:41 2023] NVRM: Xid (PCI:0000:22:00): 119, pid=3660230, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).

elezar · 2023-02-23T06:18:24Z

@fighterhit this may be related to changes in the GSP firmware paths and should be addressed in the v1.13.0-rc.1 release. Would you be able to give it a try?

fighterhit · 2023-02-23T06:37:29Z

@fighterhit this may be related to changes in the GSP firmware paths and should be addressed in the v1.13.0-rc.1 release. Would you be able to give it a try?

Hi @elezar , yes I can, but do I need to restart the node or the containers on the node? Because it may need to be deployed in our production environment.

Or does this problem exist in previous versions, I can also accept lower versions :).

elezar · 2023-02-23T06:47:31Z

Updating the nvidia-container-toolkit should not require a restart. The components of the NVIDIA container stack are only invoked when containers are created and would not affect running containers.

If this is the firmware issue that I think it is, then a lower version would not work.

What is the output of

ls -al /firmware/nvidia/525.85*/*

If there is a single gsp.bin file there, then a v1.13 release is not required and any NVIDIA Container Toolkit version after v1.8.1 should work. If there are multiple gsp_*.bin files there then v1.13.0-rc.1 would be required to address this.

fighterhit · 2023-02-23T06:58:24Z

Updating the nvidia-container-toolkit should not require a restart. The components of the NVIDIA container stack are only invoked when containers are created and would not affect running containers.

If this is the firmware issue that I think it is, then a lower version would not work.

What is the output of
ls -al /firmware/nvidia/525.85*/*
If there is a single gsp.bin file there, then a v1.13 release is not required and any NVIDIA Container Toolkit version after v1.8.1 should work. If there are multiple gsp_*.bin files there then v1.13.0-rc.1 would be required to address this.

@elezar The output is:

ls: cannot access '/firmware/nvidia/525.85*/*': No such file or directory

It seems that there is no such /firmware directory on our all GPU nodes. FYI, our driver is installed by NVIDIA-Linux-x86_64-525.85.12.run.

fighterhit · 2023-02-24T06:26:14Z

Updating the nvidia-container-toolkit should not require a restart. The components of the NVIDIA container stack are only invoked when containers are created and would not affect running containers.
If this is the firmware issue that I think it is, then a lower version would not work.
What is the output of
ls -al /firmware/nvidia/525.85*/*
If there is a single gsp.bin file there, then a v1.13 release is not required and any NVIDIA Container Toolkit version after v1.8.1 should work. If there are multiple gsp_*.bin files there then v1.13.0-rc.1 would be required to address this.
@elezar The output is:
ls: cannot access '/firmware/nvidia/525.85*/*': No such file or directory
It seems that there is no such /firmware directory on our all GPU nodes. FYI, our driver is installed by NVIDIA-Linux-x86_64-525.85.12.run.

Hi @elezar , do you have any more suggestions? Thanks!

klueska · 2023-02-24T07:03:57Z

The path is /lib/firmware/…

fighterhit · 2023-02-24T07:12:53Z

The path is /lib/firmware/…

Thanks @klueska 😅, there are indeed two gsp_*.bin files in this directory.

-r--r--r-- 1 root root 25M Feb 18 11:34 /lib/firmware/nvidia/525.85.12/gsp_ad10x.bin
-r--r--r-- 1 root root 36M Feb 18 11:34 /lib/firmware/nvidia/525.85.12/gsp_tu10x.bin

fighterhit · 2023-02-24T07:23:43Z

Hi @klueska , can the solution provided by the nvidia-container-toolkit v1.13.0-rc.1 or the method of using the persistent mode you mentioned before solve this problem? Not sure if I also have to install this latest version since I have turned on persistent mode as you suggested.

klueska · 2023-02-24T07:27:33Z

The new RC adds support for detecting multiple GSP firmware files, which is required for container support to work correctly on the 525 driver.

The persistenced issue is still relevant but this is a new one related to the latest NVIDIA driver.

fighterhit · 2023-02-24T07:35:21Z

Thanks @klueska , how can I install this version v1.13.0-rc.1? I searched by using apt-cache show nvidia-container-toolkit but there seems to be no package for this version.

The persistenced issue is still relevant but this is a new one related to the latest NVIDIA driver.

Does the persistenced issue only appear on the latest driver (525.85.12)? Can I solve it by downgrading to a certain driver version?

klueska · 2023-02-24T07:43:25Z

Persistenced is always needed, but the firmware issue could be „resolved“ by downgrading.

That said, I’d recommend updating the nvidia-container-toolkit to ensure compatibility with all future drivers.

Since the latest toolkit is still an RC it is not yet in our stable apt repo. You will need to configure our experimental repo to get access to it.

Instructions here:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit

fighterhit · 2023-02-24T09:36:03Z

Thank you for your help @klueska. What confuses me is that in the past, our GPU clusters (1080Ti, 2080Ti, 3090, A30, A100) have not turned on the persistent mode and there is no such problem, only A30 nodes (with driver 525.85.12) have this problem, none of the other types of GPU nodes(1080Ti, 2080Ti, 3090 with driver 525.78.01) have this problem.

fighterhit · 2023-02-27T02:32:47Z

Persistenced is always needed, but the firmware issue could be „resolved“ by downgrading.

That said, I’d recommend updating the nvidia-container-toolkit to ensure compatibility with all future drivers.

Since the latest toolkit is still an RC it is not yet in our stable apt repo. You will need to configure our experimental repo to get access to it.

Instructions here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit

Hi @klueska @elezar , when I configure to use the experimental repo I get the following error and can't install the v1.13.0-rc.1 , my system distribution is debian11.

root@xxx:/etc/apt/sources.list.d# distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | \
         sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
         sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Unsupported distribution!
# Check https://nvidia.github.io/libnvidia-container

elezar · 2023-02-27T08:31:52Z

Persistenced is always needed, but the firmware issue could be „resolved“ by downgrading.
That said, I’d recommend updating the nvidia-container-toolkit to ensure compatibility with all future drivers.
Since the latest toolkit is still an RC it is not yet in our stable apt repo. You will need to configure our experimental repo to get access to it.
Instructions here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit

Hi @klueska @elezar , when I configure to use the experimental repo I get the following error and can't install the v1.13.0-rc.1 , my system distribution is debian11.
root@xxx:/etc/apt/sources.list.d# distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | \
         sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
         sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Unsupported distribution!
# Check https://nvidia.github.io/libnvidia-container

@fighterhit what distribution are you using? Please ensure that the distribution variable is set to ubuntu18.04 and try again.

klueska · 2023-02-27T08:37:04Z

He said he’s on debian11.

fighterhit · 2023-02-27T08:39:03Z

Persistenced is always needed, but the firmware issue could be „resolved“ by downgrading.
That said, I’d recommend updating the nvidia-container-toolkit to ensure compatibility with all future drivers.
Since the latest toolkit is still an RC it is not yet in our stable apt repo. You will need to configure our experimental repo to get access to it.
Instructions here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit

Hi @klueska @elezar , when I configure to use the experimental repo I get the following error and can't install the v1.13.0-rc.1 , my system distribution is debian11.
root@xxx:/etc/apt/sources.list.d# distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | \
         sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
         sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Unsupported distribution!
# Check https://nvidia.github.io/libnvidia-container
@fighterhit what distribution are you using? Please ensure that the distribution variable is set to ubuntu18.04 and try again.

@elezar My distribution is Debian GNU/Linux 11 (bullseye).

fighterhit · 2023-02-27T12:11:41Z

Hi @klueska @elezar , I test the v1.13.0-rc.1 but it still report some errors:

  Warning  Failed     12s (x3 over 64s)  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory: unknown
  Warning  BackOff  11s (x3 over 44s)  kubelet  Back-off restarting failed container

Maybe /sbin/ldconfig.real should be /sbin/ldconfig?

klueska · 2023-02-27T12:18:19Z

For debian, yes, it should be @/sbin/ldconfig (without the .real), as seen here:
https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/config/config.toml.debian#L15

This file gets installed under/etc/nvidia-container-runtime/config.toml

The correct config file should have been selected automatically based on your distribution, were you not able to install the debian11 one directly?

fighterhit · 2023-02-27T12:30:21Z

For debian, yes, it should be @/sbin/ldconfig (without the .real), as seen here: https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/config/config.toml.debian#L15

This file gets installed under/etc/nvidia-container-runtime/config.toml

The correct config file should have been selected automatically based on your distribution, were you not able to install the debian11 one directly?

Yes @klueska , I failed to installed the latest version using the experimental repo(#1648 (comment)), so I followed @elezar advice and set the distro to ubuntu18.04. Can I manually modify /etc/nvidia-container-runtime/config.toml to make it work? But I'm not sure if this will have any other effects.

elezar · 2023-02-27T12:59:11Z

For debian, yes, it should be @/sbin/ldconfig (without the .real), as seen here: https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/config/config.toml.debian#L15
This file gets installed under/etc/nvidia-container-runtime/config.toml
The correct config file should have been selected automatically based on your distribution, were you not able to install the debian11 one directly?

Yes @klueska , I failed to installed the latest version using the experimental repo(#1648 (comment)), so I followed @elezar advice and set the distro to ubuntu18.04. Can I manually modify /etc/nvidia-container-runtime/config.toml to make it work? But I'm not sure if this will have any other effects.

Sorry about that. I was making an assumption about the distribution you are using. You can install the ubuntu18.04 package and then update the /etc/nvidia-container-runtime/config.toml value after the fact. This should have no other effects.

fighterhit · 2023-02-27T13:08:25Z

For debian, yes, it should be @/sbin/ldconfig (without the .real), as seen here: https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/config/config.toml.debian#L15
This file gets installed under/etc/nvidia-container-runtime/config.toml
The correct config file should have been selected automatically based on your distribution, were you not able to install the debian11 one directly?

Yes @klueska , I failed to installed the latest version using the experimental repo(#1648 (comment)), so I followed @elezar advice and set the distro to ubuntu18.04. Can I manually modify /etc/nvidia-container-runtime/config.toml to make it work? But I'm not sure if this will have any other effects.

Sorry about that. I was making an assumption about the distribution you are using. You can install the ubuntu18.04 package and then update the /etc/nvidia-container-runtime/config.toml value after the fact. This should have no other effects.

Thanks @elezar ! I have tried it and it works fine. It may be better if debian11 could be supported to use the experimental repo.

elezar · 2023-02-27T13:19:25Z

@fighterhit I have just double-checked our repo configuration and the issue is that for debian10 (to which debian11 redirects) the repo only had a libnvidia-container-experimental.list and no libnvidia-container.list file which is referred to by the instructions.

I have created a link / redirect for this now and the official instructions should work as expected. (It may take about 30 minutes for the changes to reflect in the repo though).

fighterhit · 2023-02-27T14:05:44Z

@fighterhit I have just double-checked our repo configuration and the issue is that for debian10 (to which debian11 redirects) the repo only had a libnvidia-container-experimental.list and no libnvidia-container.list file which is referred to by the instructions.

I have created a link / redirect for this now and the official instructions should work as expected. (It may take about 30 minutes for the changes to reflect in the repo though).

Thanks for your confirmation, it works now. @elezar

vikramelango · 2023-03-07T00:22:01Z

I am trying to install NVIDIA container toolkit on Amazon Linux 2, created a new ec2 instance and followed the instructions in this page and running into below issue. I get the same error when i tried this on a Ubuntu instance as well.

nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ERRO[0000] error waiting for container: context canceled

@elezar @klueska please advise how to fix this issue, appreciate your inputs. Thanks

fighterhit · 2023-03-07T02:23:02Z

Hi @elezar @klueska , unfortunately I have used the latest toolkit but this problem reappeared, I think this may be related to the driver, and I asked the driver community for help but got no more reply (https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446), https://forums.developer.nvidia.com/t/timeout-waiting-for-rpc-from-gsp/244789 could you communicate with the driver team about this issue? Because some users in the community also encountered the same problem.Thanks!

[Mon Mar  6 21:09:45 2023] NVRM: GPU at PCI:0000:61:00: GPU-81f5d81e-7906-c145-3def-82e281d7b260
[Mon Mar  6 21:09:45 2023] NVRM: GPU Board Serial Number: 1322621149674
[Mon Mar  6 21:09:45 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977720, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00b0 0x0).
[Mon Mar  6 21:09:45 2023] CPU: 15 PID: 977720 Comm: nvidia-smi Tainted: P           OE     5.10.0-20-amd64 NVIDIA/nvidia-docker#1 Debian 5.10.158-2
[Mon Mar  6 21:09:45 2023] Hardware name: Inspur NF5468A5/YZMB-02382-101, BIOS 4.02.12 01/28/2022
[Mon Mar  6 21:09:45 2023] Call Trace:
[Mon Mar  6 21:09:45 2023]  dump_stack+0x6b/0x83
[Mon Mar  6 21:09:45 2023]  _nv011231rm+0x39d/0x470 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? _nv011168rm+0x62/0x2e0 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? _nv040022rm+0xdb/0x140 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? _nv041148rm+0x2ce/0x3a0 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? _nv015451rm+0x788/0x800 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? _nv039541rm+0xac/0xe0 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? _nv041150rm+0xac/0x140 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? _nv041149rm+0x37a/0x4d0 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? _nv039443rm+0xc9/0x150 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? _nv039444rm+0x42/0x70 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? rm_cleanup_file_private+0x128/0x180 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? _nv000554rm+0x49/0x60 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? _nv000694rm+0x7fb/0xc80 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Mon Mar  6 21:09:45 2023]  ? __x64_sys_ioctl+0x8b/0xc0
[Mon Mar  6 21:09:45 2023]  ? do_syscall_64+0x33/0x80
[Mon Mar  6 21:09:45 2023]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Mon Mar  6 21:10:30 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977720, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0050 0x0).
[Mon Mar  6 21:11:15 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977720, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0040 0x0).
[Mon Mar  6 21:12:00 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Mon Mar  6 21:12:45 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Mon Mar  6 21:13:30 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Mon Mar  6 21:14:15 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Mon Mar  6 21:15:00 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Mon Mar  6 21:15:45 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Mon Mar  6 21:16:30 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Mon Mar  6 21:17:15 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=905052, name=python, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801702 0x4).
[Mon Mar  6 21:18:00 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Mon Mar  6 21:18:45 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Mon Mar  6 21:19:30 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=980026, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Mon Mar  6 21:20:15 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Mon Mar  6 21:21:00 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=980026, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Mon Mar  6 21:21:45 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=980026, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Mon Mar  6 21:22:08 2023] INFO: task python:976158 blocked for more than 120 seconds.
[Mon Mar  6 21:22:08 2023]       Tainted: P           OE     5.10.0-20-amd64 NVIDIA/nvidia-docker#1 Debian 5.10.158-2
[Mon Mar  6 21:22:08 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Mar  6 21:22:08 2023] task:python          state:D stack:    0 pid:976158 ppid:835062 flags:0x00000000
[Mon Mar  6 21:22:08 2023] Call Trace:
[Mon Mar  6 21:22:08 2023]  __schedule+0x282/0x880
[Mon Mar  6 21:22:08 2023]  schedule+0x46/0xb0
[Mon Mar  6 21:22:08 2023]  rwsem_down_write_slowpath+0x246/0x4d0
[Mon Mar  6 21:22:08 2023]  os_acquire_rwlock_write+0x31/0x40 [nvidia]
[Mon Mar  6 21:22:08 2023]  _nv038505rm+0xc/0x30 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv039453rm+0x18d/0x1d0 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv041182rm+0x45/0xd0 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv041127rm+0x142/0x2b0 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv039415rm+0x15a/0x2e0 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv039416rm+0x5b/0x90 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv039416rm+0x31/0x90 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv012688rm+0x1d/0x30 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv039431rm+0xb0/0xb0 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv012710rm+0x54/0x70 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv011426rm+0xc4/0x120 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv000659rm+0x63/0x70 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv000582rm+0x2c/0x40 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv000694rm+0x86c/0xc80 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? __x64_sys_ioctl+0x8b/0xc0
[Mon Mar  6 21:22:08 2023]  ? do_syscall_64+0x33/0x80
[Mon Mar  6 21:22:08 2023]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Mon Mar  6 21:22:08 2023] INFO: task nvidia-smi:993195 blocked for more than 120 seconds.
[Mon Mar  6 21:22:08 2023]       Tainted: P           OE     5.10.0-20-amd64 NVIDIA/nvidia-docker#1 Debian 5.10.158-2
[Mon Mar  6 21:22:08 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Mar  6 21:22:08 2023] task:nvidia-smi      state:D stack:    0 pid:993195 ppid: 58577 flags:0x00000004
[Mon Mar  6 21:22:08 2023] Call Trace:
[Mon Mar  6 21:22:08 2023]  __schedule+0x282/0x880
[Mon Mar  6 21:22:08 2023]  schedule+0x46/0xb0
[Mon Mar  6 21:22:08 2023]  rwsem_down_write_slowpath+0x246/0x4d0
[Mon Mar  6 21:22:08 2023]  os_acquire_rwlock_write+0x31/0x40 [nvidia]
[Mon Mar  6 21:22:08 2023]  _nv038505rm+0xc/0x30 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv039453rm+0x18d/0x1d0 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv041182rm+0x45/0xd0 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv041127rm+0x142/0x2b0 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv039415rm+0x15a/0x2e0 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv039416rm+0x5b/0x90 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv039416rm+0x31/0x90 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv000560rm+0x59/0x70 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv000560rm+0x33/0x70 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? _nv000694rm+0x4ae/0xc80 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Mon Mar  6 21:22:08 2023]  ? __x64_sys_ioctl+0x8b/0xc0
[Mon Mar  6 21:22:08 2023]  ? do_syscall_64+0x33/0x80
[Mon Mar  6 21:22:08 2023]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Mon Mar  6 21:22:31 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Mon Mar  6 21:23:16 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Mon Mar  6 21:24:01 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Mon Mar  6 21:24:46 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Mon Mar  6 21:25:31 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Mon Mar  6 21:26:16 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Mon Mar  6 21:27:01 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Mon Mar  6 21:27:46 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Mon Mar  6 21:28:31 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Mon Mar  6 21:29:16 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Mon Mar  6 21:30:01 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Mon Mar  6 21:30:46 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Mon Mar  6 21:31:31 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Mon Mar  6 21:32:16 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=980026, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Mon Mar  6 21:33:01 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Mon Mar  6 21:33:46 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Mon Mar  6 21:34:31 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Mon Mar  6 21:35:16 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Mon Mar  6 21:36:01 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Mon Mar  6 21:36:46 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Mon Mar  6 21:37:31 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=905052, name=python, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801702 0x4).
[Mon Mar  6 21:38:16 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Mon Mar  6 21:39:01 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Mon Mar  6 21:39:46 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00b0 0x0).
[Mon Mar  6 21:40:31 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Mon Mar  6 21:41:16 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0050 0x0).
[Mon Mar  6 21:42:01 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Mon Mar  6 21:42:46 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Mon Mar  6 21:43:31 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Mon Mar  6 21:44:16 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Mon Mar  6 21:45:01 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Mon Mar  6 21:45:46 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977772, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0040 0x0).
[Mon Mar  6 21:46:18 2023] INFO: task python:976158 blocked for more than 120 seconds.
[Mon Mar  6 21:46:18 2023]       Tainted: P           OE     5.10.0-20-amd64 NVIDIA/nvidia-docker#1 Debian 5.10.158-2
[Mon Mar  6 21:46:18 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Mar  6 21:46:18 2023] task:python          state:D stack:    0 pid:976158 ppid:835062 flags:0x00000000
[Mon Mar  6 21:46:18 2023] Call Trace:
[Mon Mar  6 21:46:18 2023]  __schedule+0x282/0x880
[Mon Mar  6 21:46:18 2023]  schedule+0x46/0xb0
[Mon Mar  6 21:46:18 2023]  rwsem_down_write_slowpath+0x246/0x4d0
[Mon Mar  6 21:46:18 2023]  os_acquire_rwlock_write+0x31/0x40 [nvidia]
[Mon Mar  6 21:46:18 2023]  _nv038505rm+0xc/0x30 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv039453rm+0x18d/0x1d0 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv041182rm+0x45/0xd0 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv041127rm+0x142/0x2b0 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv039415rm+0x15a/0x2e0 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv039416rm+0x5b/0x90 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv039416rm+0x31/0x90 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv012688rm+0x1d/0x30 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? task_numa_fault+0x2a3/0xb70
[Mon Mar  6 21:46:18 2023]  ? _nv039431rm+0xb0/0xb0 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv012710rm+0x54/0x70 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv011426rm+0xc4/0x120 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv000659rm+0x63/0x70 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv000582rm+0x2c/0x40 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv000694rm+0x86c/0xc80 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? __x64_sys_ioctl+0x8b/0xc0
[Mon Mar  6 21:46:18 2023]  ? do_syscall_64+0x33/0x80
[Mon Mar  6 21:46:18 2023]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Mon Mar  6 21:46:18 2023] INFO: task nvidia-smi:980026 blocked for more than 120 seconds.
[Mon Mar  6 21:46:18 2023]       Tainted: P           OE     5.10.0-20-amd64 NVIDIA/nvidia-docker#1 Debian 5.10.158-2
[Mon Mar  6 21:46:18 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Mar  6 21:46:18 2023] task:nvidia-smi      state:D stack:    0 pid:980026 ppid:980025 flags:0x00000000
[Mon Mar  6 21:46:18 2023] Call Trace:
[Mon Mar  6 21:46:18 2023]  __schedule+0x282/0x880
[Mon Mar  6 21:46:18 2023]  ? psi_task_change+0x88/0xd0
[Mon Mar  6 21:46:18 2023]  schedule+0x46/0xb0
[Mon Mar  6 21:46:18 2023]  rwsem_down_read_slowpath+0x18e/0x500
[Mon Mar  6 21:46:18 2023]  os_acquire_rwlock_read+0x31/0x40 [nvidia]
[Mon Mar  6 21:46:18 2023]  _nv038503rm+0xc/0x30 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv039453rm+0x64/0x1d0 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv041182rm+0x45/0xd0 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv041133rm+0xfd/0x2b0 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv012728rm+0x59a/0x690 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv039431rm+0x53/0xb0 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv011404rm+0x52/0xa0 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? _nv000694rm+0x5ae/0xc80 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Mon Mar  6 21:46:18 2023]  ? __x64_sys_ioctl+0x8b/0xc0
[Mon Mar  6 21:46:18 2023]  ? do_syscall_64+0x33/0x80
[Mon Mar  6 21:46:18 2023]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Mon Mar  6 21:46:31 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Mon Mar  6 21:47:16 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Mon Mar  6 21:48:01 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Mon Mar  6 21:48:46 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Mon Mar  6 21:49:31 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Mon Mar  6 21:50:16 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=980026, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Mon Mar  6 21:51:02 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=980026, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Mon Mar  6 21:51:47 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=980026, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Mon Mar  6 21:52:32 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Mon Mar  6 21:53:17 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=977823, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Mon Mar  6 21:54:02 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=980026, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).

lemketron · 2023-05-11T01:34:35Z

I was getting similar errors trying to run nvidia-smi after updating my Ubuntu 22.04 system.

I noticed that sudo systemctl status nvidia-persistenced was not enabled, so I tried to enable it with sudo systemctl enable nvidia-persistenced but that failed as well.

I then ran across the following forum post which led me to think that perhaps my drivers had somehow been disabled/corrupted by a recent system and/or kernel upgrade so I decided to reinstall (and upgrade) the nvidia driver using the PPA, and nvidia-smi is working again. Hope this helps someone else...

https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver-make-sure-that-the-latest-nvidia-driver-is-installed-and-running/197141/2

Anvilondre · 2023-10-23T20:44:43Z

I was having the same issue on Ubuntu Server 22.04 and docker-compose. Reinstalling docker with apt (instead of snap) solved my problem.

Haseeb707 · 2023-10-24T10:53:25Z

Uninstalling docker desktop and installing using apt worked for me
As mentioned in NVIDIA/nvidia-container-toolkit#229

bkocis · 2023-10-30T11:58:01Z

for some reason, reinstalling docker help me as well by executing (ubuntu 22.04):
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

JosephKuchar · 2023-10-30T14:12:46Z

@bkocis Thanks, that worked! I uninstalled docker with the official docker instructions, then reinstalled from the same link and it's now all working again.

han9rykim · 2024-01-16T00:06:41Z

In my case, if there is another docker installed through snap, this kind of issue happened.
After removing the docker by
sudo snap remove --purge docker
figured out this problem.

elezar closed this as completed Jun 30, 2022

daohu527 mentioned this issue Jul 2, 2022

jetson agx orin 上安装apollo错误 ApolloAuto/apollo#14505

Closed

zkytony mentioned this issue Jul 4, 2022

docker with nvidia cannot restart after host computer sleeps zkytony/robotdev#9

Closed

tfoote mentioned this issue Sep 19, 2022

Running --privileged and --nvidia together leads to an error osrf/rocker#199

Open

juliusHuelsmann mentioned this issue Jan 22, 2024

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed NVIDIA/nvidia-container-toolkit#230

Open

fighterhit mentioned this issue Feb 27, 2023

Timeout waiting for RPC from GSP! NVIDIA/open-gpu-kernel-modules#446

Open

allisontw mentioned this issue Jun 6, 2023

nvidia-container-cli: initialization error on Ubuntu22.04LTS NVIDIA/nvidia-container-toolkit#250

Open

5 tasks

grg2rsr mentioned this issue Jul 27, 2023

Using spikeinterface with a spike sorter installed in a different environment SpikeInterface/spikeinterface#1890

Open

Godeta mentioned this issue Oct 10, 2023

No nvidia gpu, docker: Error response from daemon: failed to create shim task: OCI runtime create failed #1786

Closed

michalk8 mentioned this issue Oct 19, 2023

Investigate failing GPU CI ott-jax/ott#443

Closed

bkocis mentioned this issue Oct 30, 2023

nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1 NVIDIA/nvidia-container-toolkit#154

Open

IdavalapatiRamanjaneyulu mentioned this issue Jan 8, 2024

Increase the timeout of nvidia-container-toolkit NVIDIA/nvidia-container-toolkit#202

Open

martabal mentioned this issue Jan 20, 2024

immich_microservices error after update to 1.93.2 immich-app/immich#6531

Closed

3 tasks

ArthurOuaknine mentioned this issue Jan 24, 2024

Error while running the container valeoai/MVRSS#7

Open

ddundo mentioned this issue Oct 27, 2024

Don't install torch in firedrake-parmmg image mesh-adaptation/docs#53

Merged

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed #1648

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed #1648

Comments

zyr-NULL commented Jun 21, 2022 • edited Loading

Issue or feature description

Steps to reproduce the issue

here is some Information

elezar commented Jun 21, 2022

zyr-NULL commented Jun 21, 2022

elezar commented Jun 21, 2022

zyr-NULL commented Jun 21, 2022

klueska commented Jun 21, 2022

zyr-NULL commented Jun 21, 2022

klueska commented Jun 21, 2022

zyr-NULL commented Jun 22, 2022

elezar commented Jun 30, 2022

alkavan commented Jul 18, 2022

LukeLIN-web commented Oct 16, 2022

montmejat commented Feb 6, 2023

Metrora commented Feb 8, 2023 • edited Loading

fighterhit commented Feb 23, 2023 • edited Loading

elezar commented Feb 23, 2023

fighterhit commented Feb 23, 2023 • edited Loading

elezar commented Feb 23, 2023 • edited Loading

fighterhit commented Feb 23, 2023 • edited Loading

fighterhit commented Feb 24, 2023

klueska commented Feb 24, 2023

fighterhit commented Feb 24, 2023

fighterhit commented Feb 24, 2023

klueska commented Feb 24, 2023

fighterhit commented Feb 24, 2023

klueska commented Feb 24, 2023

fighterhit commented Feb 24, 2023 • edited Loading

fighterhit commented Feb 27, 2023 • edited Loading

elezar commented Feb 27, 2023

klueska commented Feb 27, 2023

fighterhit commented Feb 27, 2023

fighterhit commented Feb 27, 2023 • edited Loading

klueska commented Feb 27, 2023

fighterhit commented Feb 27, 2023 • edited Loading

elezar commented Feb 27, 2023

fighterhit commented Feb 27, 2023

elezar commented Feb 27, 2023

fighterhit commented Feb 27, 2023

vikramelango commented Mar 7, 2023 • edited Loading

fighterhit commented Mar 7, 2023 • edited Loading

lemketron commented May 11, 2023 • edited Loading

Anvilondre commented Oct 23, 2023

Haseeb707 commented Oct 24, 2023 • edited Loading

bkocis commented Oct 30, 2023

JosephKuchar commented Oct 30, 2023

han9rykim commented Jan 16, 2024

zyr-NULL commented Jun 21, 2022 •

edited

Loading

Metrora commented Feb 8, 2023 •

edited

Loading

fighterhit commented Feb 23, 2023 •

edited

Loading

fighterhit commented Feb 23, 2023 •

edited

Loading

elezar commented Feb 23, 2023 •

edited

Loading

fighterhit commented Feb 23, 2023 •

edited

Loading

fighterhit commented Feb 24, 2023 •

edited

Loading

fighterhit commented Feb 27, 2023 •

edited

Loading

fighterhit commented Feb 27, 2023 •

edited

Loading

fighterhit commented Feb 27, 2023 •

edited

Loading

vikramelango commented Mar 7, 2023 •

edited

Loading

fighterhit commented Mar 7, 2023 •

edited

Loading

lemketron commented May 11, 2023 •

edited

Loading

Haseeb707 commented Oct 24, 2023 •

edited

Loading