[kube-prometheus-stack] API server metrics broken after upgrade #5274

krankkkk · 2025-02-05T13:02:34Z

Describe the bug a clear and concise description of what the bug is.

We upgraded prometheus-kube-stack from 62.6.0 to 68.4.5 and observed the uptime metrics (apiserver_request:availability30d) of the api-server to be completely implausible.

What's your helm version?

Deployed via ArgoCD, which internally uses 3.15.4

What's your kubectl version?

Irrelevant

Which chart?

prometheus-kube-stack

What's the chart version?

68.4.5

What happened?

We observed the metrics of apiserver_request:availability30d to go from 99.999 % to way beyond 100% i.e. apiserver_request:availability30d{verb="all"} is currently at 1.6425321904704488 on one cluster and 2.225346243637766 on another.

If we take a look via Prometheus UI we can spot the exact time we initiated the update.

If we rollback the update, we can see the metrics going back to normal.

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

No response

Enter the command that you execute and failing/misfunctioning.

No command necessary

Anything else we need to know?

ClusterVersion is v1.29.2

The text was updated successfully, but these errors were encountered:

OOub · 2025-02-07T09:59:20Z

same issue here

KihyeokK · 2025-02-07T16:07:21Z

Getting the same issue on version 68.1.1. API Server dashboard seems off with values going way above 100% after upgrading to version 68.1.1.

Also getting KubeAPIErrorBudgetBurn alerts after upgrade just like in this issue, seems like burnrate queries like apiserver_request:burnrate1d may have problems too.

abohatyrenko · 2025-02-07T18:41:59Z

Same here with the latest version of kube-prometheus-stack(69.2.0)

boettluSICKAG · 2025-02-17T13:19:18Z

Getting the same issue on the latest chart version. The recording rule apiserver_request:availability30d{verb="read"} is not working. Seemingly because the sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{le=~"30(\\.0)?",scope="cluster",verb=~"LIST|GET"}) metric does not exist. Looks like the le=30 bucket is missing.

krankkkk added the bug Something isn't working label Feb 5, 2025

zeritti changed the title ~~[prometheus-kube-stack] API server metrics broken after upgrade~~ [kube-prometheus-stack] API server metrics broken after upgrade Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kube-prometheus-stack] API server metrics broken after upgrade #5274

[kube-prometheus-stack] API server metrics broken after upgrade #5274

krankkkk commented Feb 5, 2025

OOub commented Feb 7, 2025

KihyeokK commented Feb 7, 2025

abohatyrenko commented Feb 7, 2025

boettluSICKAG commented Feb 17, 2025

[kube-prometheus-stack] API server metrics broken after upgrade #5274

[kube-prometheus-stack] API server metrics broken after upgrade #5274

Comments

krankkkk commented Feb 5, 2025

Describe the bug a clear and concise description of what the bug is.

What's your helm version?

What's your kubectl version?

Which chart?

What's the chart version?

What happened?

What you expected to happen?

How to reproduce it?

Enter the changed values of values.yaml?

Enter the command that you execute and failing/misfunctioning.

Anything else we need to know?

OOub commented Feb 7, 2025

KihyeokK commented Feb 7, 2025

abohatyrenko commented Feb 7, 2025

boettluSICKAG commented Feb 17, 2025