Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus stops firing but alerts stay active #815

Closed
mxinden opened this issue May 22, 2017 · 14 comments
Closed

Prometheus stops firing but alerts stay active #815

mxinden opened this issue May 22, 2017 · 14 comments
Milestone

Comments

@mxinden
Copy link
Member

mxinden commented May 22, 2017

This issue was reported by Jack and Julius here:

When the metrics behind my test alerts go back to normal, Prometheus stops firing, but the alert stays active in AM. The PagerDuty incident is not resolved. Yikes.

Can confirm. A simple "up == 0" alert sticks around in Alertmanager (current master) after it gets resolved in Prometheus, and also doesn't disappear after 5 minutes (the resolve timeout). It stays in the UI and it keeps sending "firing" webhook notifications.

@juliusv
Copy link
Member

juliusv commented May 23, 2017

Prometheus built from current master:

./prometheus -alertmanager.url=http://localhost:9093/
INFO[0000] Starting prometheus (version=1.6.2, branch=master, revision=5405a4724fbc7307786a65c176168a8948c0cbbf)  source="main.go:88"
INFO[0000] Build context (go=go1.8, user=julius@julius-samsung, date=20170523-13:45:49)  source="main.go:89"

Alertmanager built from current master:

./alertmanager
INFO[0000] Starting alertmanager (version=0.7.0-rc.0, branch=master, revision=6be3fbcb68af2ceb4d6bf77049a3390a1d6ad32f)  source="main.go:109"
INFO[0000] Build context (go=go1.8, user=julius@julius-samsung, date=20170523-13:47:45)  source="main.go:110"

Prometheus config:

global:
  scrape_interval: 5s

rule_files:
  - "prometheus.rules"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

prometheus.rules file:

ALERT SomethingIsDown IF up == 0

Alertmanager config:

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h

  receiver: webhook

receivers:
- name: 'webhook'
  webhook_configs:
    - url: 'http://localhost:12345/'
  1. Start Prometheus without node exporter running
  2. Wait for alert to fire - it appears on AM and notifications to the webhook (listener started with netcat) are sent
  3. Bring up node exporter

After the last step, the alert clears in Prometheus, but stays visible on the Alertmanager UI even after 20 minutes. The webhook notification keeps getting sent never-endingly repeating, but with a resolved status now...

POST / HTTP/1.1
Host: localhost:12345
User-Agent: Go-http-client/1.1
Content-Length: 602
Content-Type: application/json
Accept-Encoding: gzip

{"receiver":"webhook","status":"resolved","alerts":[{"status":"resolved","labels":{"alertname":"SomethingIsDown","instance":"localhost:9100","job":"node"},"annotations":{},"startsAt":"2017-05-23T15:53:37.804+02:00","endsAt":"2017-05-23T16:00:37.809+02:00","generatorURL":"http://julius-samsung:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=0"}],"groupLabels":{"alertname":"SomethingIsDown"},"commonLabels":{"alertname":"SomethingIsDown","instance":"localhost:9100","job":"node"},"commonAnnotations":{},"externalURL":"http://julius-samsung:9093","version":"4","groupKey":"{}:{alertname=\"SomethingIsDown\"}"}

@fabxc
Copy link
Contributor

fabxc commented May 23, 2017

The display issue is theoretically unrelated as we simply forgot to filter resolved alerts in the new API endpoint. So they remain visible until they are garbage collected.

The re-sending might be related to the fixes we made for 0.6 already. If it's only occurring in 0.7 I'd be surprised. Any chance you can verify that?

@fabxc
Copy link
Contributor

fabxc commented May 26, 2017

@juliusv @mxinden any further insights on this?

@juliusv
Copy link
Member

juliusv commented May 26, 2017

Checking again now with 0.6.

@juliusv
Copy link
Member

juliusv commented May 26, 2017

So actually part of the weird behavior I got earlier from master was that I used curl to receive webhook requests, but that doesn't respond with a HTTP status response, so those requests just timed out and were retried for that reason probably.

Now I'm using a small Go web server that just responds with 200, and I'm still seeing weird (but different) behavior.

This is from current master:

  • when the alert starts firing in Prometheus, it appears in AM
  • 30 seconds later (group_wait: 30s), a firing webhook request gets sent out by the AM
  • when I resolve the alert in Prometheus, it remains visible in the AM UI, and only exactly 5 minutes after the firing notification got sent, does a resolved notification get sent out (I assume that is even intentional because group_interval: 5m also applies when all alerts in a group are resolved?)

So in current master the only thing that's clearly a bug is the UI! And in 0.6.2, that is also the only behavioral difference - in 0.6.2, the alert disappears from the UI immediately after being resolved in Prometheus.

@brian-brazil
Copy link
Contributor

(I assume that is even intentional because group_interval: 5m also applies when all alerts in a group are resolved?)

That's correct, otherwise we'd spam users as each individual resolution came in. Resolved notifications need to obey group_interval.

@fabxc
Copy link
Contributor

fabxc commented May 26, 2017 via email

@brian-brazil
Copy link
Contributor

We did have a bugs fixed in 0.6 around resolved notifications, so it could be those.

@juliusv
Copy link
Member

juliusv commented May 26, 2017

#820 indeed fixes the UI issues for me. So from my side, I cannot reproduce this issue anymore. Should we close it and see if someone else reports similar issues again?

@fabxc
Copy link
Contributor

fabxc commented May 26, 2017 via email

@juliusv juliusv closed this as completed May 26, 2017
@damomurf
Copy link

damomurf commented May 31, 2017

@juliusv I'm seeing the repeating "resolved" notification firing into Slack right now, with two different Prometheus installations. What information can I give you to try and nail this down?

Once the alert resolves, I see nothing in either Prometheus or Alertmanager UIs.

The following is the alert manager log in debug mode for the repeating notifications.

time="2017-05-31T06:40:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:40:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:40:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:40:38Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:41:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:41:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:41:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:08Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:43:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:43:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:43:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:43:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:43:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" ```

@damomurf
Copy link

I've just tried with an alertmanager built from master and observe the same behaviour. The offending alert, once resolved, notifies "resolved" every repeat_interval.

Route and Receivers config:

route:
  group_by: ['account', 'region', 'environment', 'alertname']
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 2m
  receiver: infra

receivers:
- name: 'infra'
  slack_configs:
  - send_resolved: true
    channel: "channel"

hh pushed a commit to ii/alertmanager that referenced this issue Feb 9, 2018
@bala0409
Copy link

bala0409 commented Dec 1, 2019

I want to know when kubernetes pod was down get alert in slack or email.

@roidelapluie
Copy link
Member

roidelapluie commented Dec 1, 2019

@bala0409 Thanks for your interest. It looks as if this is actually a question about usage and not development

To make your question, and all replies, easier to find, we suggest you move this over to our user mailing list, which you can also search. If you prefer more interactive help, join or our IRC channel, #prometheus on irc.freenode.net. Please be aware that our IRC channel has no logs, is not searchable, and that people might not answer quickly if they are busy or asleep. If in doubt, you should choose the mailing list.

If you think this is not purely a support question, feel free to comment in here or take the underlying issues to our developer mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants