Prometheus alert for pod crashlooping #21394

chaodaiG · 2021-03-16T15:59:50Z

Add alert catching pod crashlooping

chaodaiG · 2021-03-16T16:00:50Z

The alert is currently set to be pretty sensitive, any pod restart would trigger it. Will adjust later if we feel it's too loud

chaodaiG · 2021-03-16T16:01:00Z

/cc @alvaroaleman

k8s-ci-robot · 2021-03-16T16:04:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, chaodaiG

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~config/prow/cluster/monitoring/OWNERS~~ [chaodaiG]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cjwagner · 2021-03-16T18:06:20Z

config/prow/cluster/monitoring/mixins/prometheus/prow_alerts.libsonnet

+        rules: [
+          {
+            alert: 'prow-pod-crashlooping',
+            expr: 'rate(kube_pod_container_status_restarts_total{namespace=~"default|prow-monitoring",job="kube-state-metrics"}[5m]) * 60 * 5 > 0',


FYI for the future: Prometheus has a convenient increase(<metric>[<duration>]) function that can be used instead of rate(<metric>[<duration>]) * <duration in seconds>. It handles breaks in monotonicity just like rate().

Prometheus alert for pod crashlooping

9772301

k8s-ci-robot requested review from cblecker, cjwagner and stevekuznetsov March 16, 2021 16:00

k8s-ci-robot requested a review from alvaroaleman March 16, 2021 16:01

alvaroaleman approved these changes Mar 16, 2021

View reviewed changes

k8s-ci-robot assigned alvaroaleman Mar 16, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 16, 2021

k8s-ci-robot merged commit ce346f4 into kubernetes:master Mar 16, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 16, 2021

cjwagner reviewed Mar 16, 2021

View reviewed changes

chaodaiG mentioned this pull request Mar 16, 2021

presubmits were not triggering for kubernetes/kubernetes #21090

Closed

chaodaiG deleted the alert-pod-crashlooping branch March 16, 2021 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus alert for pod crashlooping #21394

Prometheus alert for pod crashlooping #21394

chaodaiG commented Mar 16, 2021

chaodaiG commented Mar 16, 2021

chaodaiG commented Mar 16, 2021

k8s-ci-robot commented Mar 16, 2021

cjwagner Mar 16, 2021

Prometheus alert for pod crashlooping #21394

Prometheus alert for pod crashlooping #21394

Conversation

chaodaiG commented Mar 16, 2021

chaodaiG commented Mar 16, 2021

chaodaiG commented Mar 16, 2021

k8s-ci-robot commented Mar 16, 2021

cjwagner Mar 16, 2021

Choose a reason for hiding this comment