Fix grafana dashboard cannot display properly in vGPU cluster #240

Levi080513 · 2024-01-29T02:25:51Z

Test

Create a k8s cluster with vGPU configured on one node.

kc get node hw-sks-test-vgpu-vgpunode-8jwnn -oyaml | yq '.status.allocatable'
cpu: "8"
ephemeral-storage: "57976119610"
hugepages-2Mi: "0"
memory: 15968092Ki
nvidia.com/gpu: "1"
pods: "110"

kc exec -ti nvidia-driver-daemonset-4.18.0-477.27.1.el8.8-rocky8.8-x7qj8 -n sks-system-nvidia-gpu -- nvidia-smi
Mon Jan 29 10:20:21 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID V100-4C        On   | 00000000:00:0A.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |     12MiB /  4096MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    243152      C   /app/gpu_burn                      12MiB |
+-----------------------------------------------------------------------------+

Before fixing

After fixing

…name) * Change PromQL queries to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>

…name) * Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240) * Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>

…name) * Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (~ PR NVIDIA#240) * Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>

Fix grafana dashboard cannot display properly in GPU cluster

e458834

frittentheke mentioned this pull request Jul 8, 2024

[dashboard] Rework dashboard (MIG support, Grafana deprecations, Hostname) #355

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix grafana dashboard cannot display properly in vGPU cluster #240

Fix grafana dashboard cannot display properly in vGPU cluster #240

Levi080513 commented Jan 29, 2024

Fix grafana dashboard cannot display properly in vGPU cluster #240

Are you sure you want to change the base?

Fix grafana dashboard cannot display properly in vGPU cluster #240

Conversation

Levi080513 commented Jan 29, 2024

Test