Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix grafana dashboard cannot display properly in vGPU cluster #240

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Levi080513
Copy link

Fix #236

Test

Create a k8s cluster with vGPU configured on one node.

kc get node hw-sks-test-vgpu-vgpunode-8jwnn -oyaml | yq '.status.allocatable'
cpu: "8"
ephemeral-storage: "57976119610"
hugepages-2Mi: "0"
memory: 15968092Ki
nvidia.com/gpu: "1"
pods: "110"

kc exec -ti nvidia-driver-daemonset-4.18.0-477.27.1.el8.8-rocky8.8-x7qj8 -n sks-system-nvidia-gpu -- nvidia-smi
Mon Jan 29 10:20:21 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID V100-4C        On   | 00000000:00:0A.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |     12MiB /  4096MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    243152      C   /app/gpu_burn                      12MiB |
+-----------------------------------------------------------------------------+

Before fixing
image

After fixing
image

image

frittentheke added a commit to frittentheke/dcgm-exporter that referenced this pull request Jul 8, 2024
…name)

* Change PromQL queries to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/dcgm-exporter that referenced this pull request Jul 8, 2024
…name)

* Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240)
* Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/dcgm-exporter that referenced this pull request Jul 8, 2024
…name)

* Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (~ PR NVIDIA#240)
* Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NVIDIA DCGM Exporter Dashboard does not work in vGPU cluster
1 participant