Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: k3s pod dashboard #60

Merged
merged 3 commits into from
Oct 4, 2024

Conversation

Reanmachine
Copy link
Contributor

Following #57, this PR adds dashboards for the k3s metrics produced by TrueNAS' metrics exporter. This dashboard gives an overview of the k3s cluster as whole and a collapsible and repeatable section for each pod.

The cpu metric was identified to be instantaneous cpu time in ns for a given second. This makes the metric a bit tricky to work with as it does not play nice with graphana/prometheus' rate intervals, but each value can be computed on the whole by dividing it by 1bn.

image

This dashboard gives an overview of the k3s cluster as whole and a collapsable
and repeatable section for each pod.

The cpu metric was identified to be instantaneous cpu time in ns for a given
second. This makes the metric a bit tricky to work with as it does not play
nice with graphana/prometheus' rate intervals, but each value can be computed on
the whole by dividing it by 1bn.
},
"editorMode": "code",
"exemplar": false,
"expr": "k3s_pod_cpu{instance=~\"$instance\"} / 1000000000",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Supporterino For your consideration. The POD CPU does seem to be in ns, but it seems to be an instantaneous measure for the current second at the time of submission.

Since it's a gauge and not a counter, we can't collect the changes over the graphana $__rate_interval so instead we're just dividing the value by the number of ns in a second to get the instantaneous cpu % usage.

I don't have a huge variety of workloads to press this value locally, but I did upload a bunch of pictures to immich to trigger the ML container and was able to capture examples of the cpu hitting about 80%.

This seems to be the most clear & reasonable measure from what I've seen.

image

This change makes the variable values refresh on time range change so
old pods don't show up anymore.
@Supporterino
Copy link
Owner

@Reanmachine dashboard Looks good for me. Just one thing could you rename the CPU graphs to usage since you are converting it to that

This fix standardizes the names to `cpu usage` as that's the measurement we're showing. Also
noticed the cpu gague had the old calculation so aligned it with the others and added the truenas tag.
@Supporterino
Copy link
Owner

LGTM. Ty for your Submission

@Supporterino Supporterino merged commit 6c205a3 into Supporterino:main Oct 4, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants