You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kepler when deployed using release-0.7.10 (With/Without dcgm) and release-0.7.11 (With/Without dcgm) it under reports the Power usage when compared against the DCGM exporter that exposes DCGM_FI_DEV_POWER_USAGE which tells you power usage in Watts.
When using release-0.7.10 Kepler uses NVML
When using release-0.7.10-dcgm Kepler uses DCGM
When using release-0.7.11-dcgm Kepler uses DCGM
What did you expect to happen?
It should'nt produce a lower value than what DCGM exporter produces
How can we reproduce it (as minimally and precisely as possible)?
Deploy Kepler on cluster with the mentioned versions that have access to the GPU
Stress the GPU nodes so that we can get some value
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Kepler deployment config
For on kubernetes:
$ KEPLER_NAMESPACE=kepler
# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE}
# paste output here
# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE}
For standalone:
put your Kepler command argument here
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered:
@rootfs@sunya-ch@marceloamaral Could anyone provide insights on how Kepler's GPU metrics were validated against DCGM/NVML? Additionally, were there any alternative validation methods used when these features were initially introduced in Kepler?
What happened?
Kepler when deployed using
release-0.7.10
(With/Without dcgm) andrelease-0.7.11
(With/Without dcgm) it under reports the Power usage when compared against the DCGM exporter that exposes DCGM_FI_DEV_POWER_USAGE which tells you power usage in Watts.When using
release-0.7.10
Kepler uses NVMLWhen using
release-0.7.10-dcgm
Kepler uses DCGMWhen using
release-0.7.11-dcgm
Kepler uses DCGMWhat did you expect to happen?
It should'nt produce a lower value than what DCGM exporter produces
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
For on kubernetes:
For standalone:
put your Kepler command argument here
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: