Prometheus stops firing but alerts stay active #815

mxinden · 2017-05-22T08:33:55Z

This issue was reported by Jack and Julius here:

When the metrics behind my test alerts go back to normal, Prometheus stops firing, but the alert stays active in AM. The PagerDuty incident is not resolved. Yikes.

Can confirm. A simple "up == 0" alert sticks around in Alertmanager (current master) after it gets resolved in Prometheus, and also doesn't disappear after 5 minutes (the resolve timeout). It stays in the UI and it keeps sending "firing" webhook notifications.

juliusv · 2017-05-23T14:20:46Z

Prometheus built from current master:

./prometheus -alertmanager.url=http://localhost:9093/
INFO[0000] Starting prometheus (version=1.6.2, branch=master, revision=5405a4724fbc7307786a65c176168a8948c0cbbf)  source="main.go:88"
INFO[0000] Build context (go=go1.8, user=julius@julius-samsung, date=20170523-13:45:49)  source="main.go:89"

Alertmanager built from current master:

./alertmanager
INFO[0000] Starting alertmanager (version=0.7.0-rc.0, branch=master, revision=6be3fbcb68af2ceb4d6bf77049a3390a1d6ad32f)  source="main.go:109"
INFO[0000] Build context (go=go1.8, user=julius@julius-samsung, date=20170523-13:47:45)  source="main.go:110"

Prometheus config:

global:
  scrape_interval: 5s

rule_files:
  - "prometheus.rules"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

prometheus.rules file:

ALERT SomethingIsDown IF up == 0

Alertmanager config:

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h

  receiver: webhook

receivers:
- name: 'webhook'
  webhook_configs:
    - url: 'http://localhost:12345/'

Start Prometheus without node exporter running
Wait for alert to fire - it appears on AM and notifications to the webhook (listener started with netcat) are sent
Bring up node exporter

After the last step, the alert clears in Prometheus, but stays visible on the Alertmanager UI even after 20 minutes. The webhook notification keeps getting sent never-endingly repeating, but with a resolved status now...

POST / HTTP/1.1
Host: localhost:12345
User-Agent: Go-http-client/1.1
Content-Length: 602
Content-Type: application/json
Accept-Encoding: gzip

{"receiver":"webhook","status":"resolved","alerts":[{"status":"resolved","labels":{"alertname":"SomethingIsDown","instance":"localhost:9100","job":"node"},"annotations":{},"startsAt":"2017-05-23T15:53:37.804+02:00","endsAt":"2017-05-23T16:00:37.809+02:00","generatorURL":"http://julius-samsung:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=0"}],"groupLabels":{"alertname":"SomethingIsDown"},"commonLabels":{"alertname":"SomethingIsDown","instance":"localhost:9100","job":"node"},"commonAnnotations":{},"externalURL":"http://julius-samsung:9093","version":"4","groupKey":"{}:{alertname=\"SomethingIsDown\"}"}

fabxc · 2017-05-23T16:25:12Z

The display issue is theoretically unrelated as we simply forgot to filter resolved alerts in the new API endpoint. So they remain visible until they are garbage collected.

The re-sending might be related to the fixes we made for 0.6 already. If it's only occurring in 0.7 I'd be surprised. Any chance you can verify that?

fabxc · 2017-05-26T11:25:06Z

@juliusv @mxinden any further insights on this?

juliusv · 2017-05-26T12:16:56Z

Checking again now with 0.6.

juliusv · 2017-05-26T13:14:46Z

So actually part of the weird behavior I got earlier from master was that I used curl to receive webhook requests, but that doesn't respond with a HTTP status response, so those requests just timed out and were retried for that reason probably.

Now I'm using a small Go web server that just responds with 200, and I'm still seeing weird (but different) behavior.

This is from current master:

when the alert starts firing in Prometheus, it appears in AM
30 seconds later (group_wait: 30s), a firing webhook request gets sent out by the AM
when I resolve the alert in Prometheus, it remains visible in the AM UI, and only exactly 5 minutes after the firing notification got sent, does a resolved notification get sent out (I assume that is even intentional because group_interval: 5m also applies when all alerts in a group are resolved?)

So in current master the only thing that's clearly a bug is the UI! And in 0.6.2, that is also the only behavioral difference - in 0.6.2, the alert disappears from the UI immediately after being resolved in Prometheus.

brian-brazil · 2017-05-26T13:20:28Z

(I assume that is even intentional because group_interval: 5m also applies when all alerts in a group are resolved?)

That's correct, otherwise we'd spam users as each individual resolution came in. Resolved notifications need to obey group_interval.

fabxc · 2017-05-26T13:24:45Z

Okay, what you describe is the expected behavior. Curious about cases now though where people report resolved being spammed endlessly into Slack. For the UI, there's already a fixing PR open.

…

On Fri, May 26, 2017 at 3:20 PM Brian Brazil ***@***.***> wrote: (I assume that is even intentional because group_interval: 5m also applies when all alerts in a group are resolved?) That's correct, otherwise we'd spam users as each individual resolution came in. Resolved notifications need to obey group_interval. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#815 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEuA8oLH_-tEA3LvuoKrJegfGq6d7Nlpks5r9tGdgaJpZM4NiCej> .

brian-brazil · 2017-05-26T13:31:20Z

We did have a bugs fixed in 0.6 around resolved notifications, so it could be those.

juliusv · 2017-05-26T13:39:04Z

#820 indeed fixes the UI issues for me. So from my side, I cannot reproduce this issue anymore. Should we close it and see if someone else reports similar issues again?

fabxc · 2017-05-26T13:47:45Z

Fine with me.

…

On Fri, May 26, 2017 at 3:39 PM Julius Volz ***@***.***> wrote: #820 <#820> indeed fixes the UI issues for me. So from my side, I cannot reproduce this issue anymore. Should we close it and see if someone else reports similar issues again? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#815 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEuA8tiJQv0FP8qvMkZqHGh9GMf0MoEeks5r9tX5gaJpZM4NiCej> .

damomurf · 2017-05-31T06:44:56Z

@juliusv I'm seeing the repeating "resolved" notification firing into Slack right now, with two different Prometheus installations. What information can I give you to try and nail this down?

Once the alert resolves, I see nothing in either Prometheus or Alertmanager UIs.

The following is the alert manager log in debug mode for the repeating notifications.

time="2017-05-31T06:40:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:40:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:40:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:40:38Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:41:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:41:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:41:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:08Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:43:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:43:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:43:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:43:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:43:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" ```

damomurf · 2017-05-31T07:56:05Z

I've just tried with an alertmanager built from master and observe the same behaviour. The offending alert, once resolved, notifies "resolved" every repeat_interval.

Route and Receivers config:

route:
  group_by: ['account', 'region', 'environment', 'alertname']
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 2m
  receiver: infra

receivers:
- name: 'infra'
  slack_configs:
  - send_resolved: true
    channel: "channel"

Fix log level regression in prometheus#533

bala0409 · 2019-12-01T17:47:51Z

I want to know when kubernetes pod was down get alert in slack or email.

roidelapluie · 2019-12-01T18:54:58Z

@bala0409 Thanks for your interest. It looks as if this is actually a question about usage and not development

To make your question, and all replies, easier to find, we suggest you move this over to our user mailing list, which you can also search. If you prefer more interactive help, join or our IRC channel, #prometheus on irc.freenode.net. Please be aware that our IRC channel has no logs, is not searchable, and that people might not answer quickly if they are busy or asleep. If in doubt, you should choose the mailing list.

If you think this is not purely a support question, feel free to comment in here or take the underlying issues to our developer mailing list.

mxinden added this to the v0.7 milestone May 22, 2017

mxinden mentioned this issue May 23, 2017

Do not expose resolved alerts & do not send resolved if never firing #820

Merged

juliusv closed this as completed May 26, 2017

damomurf mentioned this issue Jun 5, 2017

Repeated "resolved" notifications firing #844

Closed

hh pushed a commit to ii/alertmanager that referenced this issue Feb 9, 2018

Merge pull request prometheus#815 from prometheus/debug-log

9a5bd5f

Fix log level regression in prometheus#533

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus stops firing but alerts stay active #815

Prometheus stops firing but alerts stay active #815

mxinden commented May 22, 2017

juliusv commented May 23, 2017

fabxc commented May 23, 2017

fabxc commented May 26, 2017

juliusv commented May 26, 2017

juliusv commented May 26, 2017

brian-brazil commented May 26, 2017

fabxc commented May 26, 2017 via email

brian-brazil commented May 26, 2017

juliusv commented May 26, 2017

fabxc commented May 26, 2017 via email

damomurf commented May 31, 2017 •

edited

Loading

damomurf commented May 31, 2017

bala0409 commented Dec 1, 2019

roidelapluie commented Dec 1, 2019 •

edited

Loading

Prometheus stops firing but alerts stay active #815

Prometheus stops firing but alerts stay active #815

Comments

mxinden commented May 22, 2017

juliusv commented May 23, 2017

fabxc commented May 23, 2017

fabxc commented May 26, 2017

juliusv commented May 26, 2017

juliusv commented May 26, 2017

brian-brazil commented May 26, 2017

fabxc commented May 26, 2017 via email

brian-brazil commented May 26, 2017

juliusv commented May 26, 2017

fabxc commented May 26, 2017 via email

damomurf commented May 31, 2017 • edited Loading

damomurf commented May 31, 2017

bala0409 commented Dec 1, 2019

roidelapluie commented Dec 1, 2019 • edited Loading

damomurf commented May 31, 2017 •

edited

Loading

roidelapluie commented Dec 1, 2019 •

edited

Loading