Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kepler underreporting power usage for gpu_joules_total #1926

Open
vprashar2929 opened this issue Feb 17, 2025 · 1 comment
Open

Kepler underreporting power usage for gpu_joules_total #1926

vprashar2929 opened this issue Feb 17, 2025 · 1 comment
Labels
kind/bug report bug issue

Comments

@vprashar2929
Copy link
Collaborator

What happened?

Kepler when deployed using release-0.7.10 (With/Without dcgm) and release-0.7.11 (With/Without dcgm) it under reports the Power usage when compared against the DCGM exporter that exposes DCGM_FI_DEV_POWER_USAGE which tells you power usage in Watts.

When using release-0.7.10 Kepler uses NVML

Image

Image

When using release-0.7.10-dcgm Kepler uses DCGM

Image

When using release-0.7.11-dcgm Kepler uses DCGM

Image

What did you expect to happen?

It should'nt produce a lower value than what DCGM exporter produces

How can we reproduce it (as minimally and precisely as possible)?

  1. Deploy Kepler on cluster with the mentioned versions that have access to the GPU
  2. Stress the GPU nodes so that we can get some value

Anything else we need to know?

No response

Kepler image tag

release-0.7.10, release-0.7.10-dcgm, release-0.7.11, release-0.7.11-dcgm

Kubernetes version

$ kubectl version
# paste output here

Cloud provider or bare metal

VM

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Kepler deployment config

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE}
# paste output here

# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE}

For standalone:

put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@vprashar2929 vprashar2929 added the kind/bug report bug issue label Feb 17, 2025
@vprashar2929
Copy link
Collaborator Author

@rootfs @sunya-ch @marceloamaral Could anyone provide insights on how Kepler's GPU metrics were validated against DCGM/NVML? Additionally, were there any alternative validation methods used when these features were initially introduced in Kepler?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug report bug issue
Projects
None yet
Development

No branches or pull requests

1 participant