Skip to content

Commit

Permalink
cuda: check for gpu instead of /dev/nvidiactl
Browse files Browse the repository at this point in the history
The check for `/dev/nvidiactl` to determine if the CUDA plugin can be
used is unreliable because in some cases the default path for driver
installation is different [1]. This patch changes the logic to check
if a GPU device is available in `/proc/driver/nvidia/gpus/`. This is
a more accurate indicator, and the subsequent check for `--action`
option would confirm if the NVIDIA driver supports checkpoint/restore.

[1] https://github.com/NVIDIA/gpu-operator

Fixes: #2509

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
  • Loading branch information
rst0git committed Nov 8, 2024
1 parent 216d804 commit 3d103d0
Showing 1 changed file with 13 additions and 2 deletions.
15 changes: 13 additions & 2 deletions plugins/cuda/cuda_plugin.c
Original file line number Diff line number Diff line change
Expand Up @@ -470,6 +470,17 @@ int cuda_plugin_resume_devices_late(int pid)
}
CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, cuda_plugin_resume_devices_late)

static bool has_nvidia_gpu(void)
{
const char *gpu_path = "/proc/driver/nvidia/gpus/";
struct stat sb;

if (stat(gpu_path, &sb) != 0)
return false;

return S_ISDIR(sb.st_mode);
}

int cuda_plugin_init(int stage)
{
int ret;
Expand All @@ -481,8 +492,8 @@ int cuda_plugin_init(int stage)
}
}

if (!fault_injected(FI_PLUGIN_CUDA_FORCE_ENABLE) && access("/dev/nvidiactl", F_OK)) {
pr_info("/dev/nvidiactl doesn't exist. The CUDA plugin is disabled.\n");
if (!fault_injected(FI_PLUGIN_CUDA_FORCE_ENABLE) && !has_nvidia_gpu()) {
pr_info("No GPU device found; CUDA plugin is disabled\n");
plugin_disabled = true;
return 0;
}
Expand Down

0 comments on commit 3d103d0

Please sign in to comment.