Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated "resolved" notifications firing #844

Closed
damomurf opened this issue Jun 5, 2017 · 18 comments
Closed

Repeated "resolved" notifications firing #844

damomurf opened this issue Jun 5, 2017 · 18 comments
Milestone

Comments

@damomurf
Copy link

damomurf commented Jun 5, 2017

I'm seeing repeating "resolved" notification firing into notification endpoints from the same alert rule, with two different Prometheus installations. The alerts seem to repeat 5 times after the initial notification and then stop. It seems to affect every version of Alertmanager right back to 0.5.0 - the latest version that doesn't exhibit the problem is 0.4.2.

I'm assuming this relates to #815.

I've also tried with an alertmanager built from master and observe the same behaviour. The offending alert, once resolved, seems to notify every repeat_interval. The following example config below was an attempt to replicate it with much smaller timeframes.

Once the alert really resolves in Prometheus, I see nothing in either Prometheus or Alertmanager UIs.

The following is the alert manager log in debug mode for the repeating notifications.

time="2017-05-31T06:40:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:40:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:40:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:40:38Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:41:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:41:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:41:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:08Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:43:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:43:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:43:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:43:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:43:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 

Route and Receivers config:

route:
  group_by: ['account', 'region', 'environment', 'alertname']
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 2m
  receiver: infra

receivers:
- name: 'infra'
  slack_configs:
  - send_resolved: true
    channel: "channel"
@mxinden mxinden added this to the v0.7 milestone Jun 6, 2017
@fabxc
Copy link
Contributor

fabxc commented Jun 6, 2017

A note on the configuration in general. Repeat interval should be generally really long in a sane setup – at the order of 3 to 48 hours. You will be notified in the group interval should anything change about the state of alerts. The repeat interval is only about unchanged state – anything in the minute range is typically way too noisy for this.

Can you provide your Prometheus configuration as well. In particular the rule evaluation interval and alerting rule in question?

@fabxc
Copy link
Contributor

fabxc commented Jun 6, 2017

You are sending from two Prometheus servers, right? Do they have different external labels?

@damomurf
Copy link
Author

damomurf commented Jun 6, 2017

@fabxc - Sorry, I wasn't very clear on a couple of those things.

The alert manager config above was just so that I could replicate the problem without waiting and waiting for hours. The normal config I run with is:

 group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h

In reference to two servers I was highlighting that I have two completely independent, separate configurations of Prometheus/Alertmanager that end up triggering the same scenario. There's no transmission of metrics between them or sharing of an alertmamanger or anything like that. The common factor between them is that they're using the same exporter and alert rules.

I'm not sure how it's relevant, but the exporter in question that triggers the scenario is https://github.com/Braedon/prometheus-es-exporter (version 0.3.0).

I'm running with Prometheus configs of:

  scrape_interval:     15s
  evaluation_interval: 30s (or 15s) on the second server

The target is scraped via a standard file_sd_config target, that just configures a few labels.

The alert rule is like so:

ALERT ElasticSearchClusterStatus
  IF es_cluster_health_status{job="elasticsearch"} != 0
  FOR 1m
  ANNOTATIONS {
    description = "Elasticsearch cluster {{ $labels.account }} - {{ $labels.region }} - {{ $labels.environment }} health is no longer Green. {{ $value }}",
    account = "{{ $labels.account }}",
    region = "{{ $labels.region }}",
    environment = "{{ $labels.environment }}"
  }

I hope that's useful information.

@fabxc
Copy link
Contributor

fabxc commented Jun 6, 2017

So you have two Prometheus servers but they are not sending to the same Alertmanager cluster?
The logs indicate the same alert coming in twice in short sequence. Do they have any external labels set?

Really having trouble reproducing this in any way.

The logs you've shown above seem to be truncated in the front. Can you provide the full thing?
The really full configs for Prometheus and Alertmanager would be best. For example, indentation of group_wait above seems off but I don't know whether it's just a copy&paste error or not.

@damomurf
Copy link
Author

damomurf commented Jun 7, 2017

I'll do some significant digging into establishing whether there are two sources of the alerts, but I can't see how. If not, I'll reproduce and include all of the config and logs.

@mxinden
Copy link
Member

mxinden commented Jun 7, 2017

@damomurf Thanks for your help! I will move this issue to milestone v0.8. As we are having difficulties reproducing this bug, and as it is not a mission critical issue we will release v0.7 without it for now. Don't get me wrong. We are still very much interested in fixing this issue.

@mxinden mxinden modified the milestones: v0.8, v0.7 Jun 7, 2017
@sandersaares
Copy link

sandersaares commented Jun 15, 2017

I am also experiencing repeated "resolved" notifications with version=0.5.1, branch=master, revision=0ea1cac51e6a620ec09d053f0484b97932b5c902.

Anonymized configuration snippets follow.

Alertmanager configuration:

route:
  group_by: ['alertname', 'service']
  group_wait: 1m
  group_interval: 5m
  repeat_interval: 3h
  receiver: 'slack-devops'

receivers:
- name: 'slack-devops'
  slack_configs:
  - api_url: "xxx"
    channel: "#devops"
    send_resolved: true
    text: "{{ range .Alerts }}{{ .Labels.job }} {{ end }}\n{{ range .CommonAnnotations.SortedPairs }}{{ .Name }} = {{ .Value }}\n{{ end }}"

Prometheus alerts configuration:

ALERT web_endpoint_is_down
    IF probe_success < 1.0
    FOR 5m
    ANNOTATIONS {
        description = "xxx"
    }

Prometheus monitoring configuration:

global:
  scrape_interval: 1m
  scrape_timeout: 10s

scrape_configs:
  - job_name: 'xxx'
    metrics_path: /probe
    params:
      module: ['basic_http_get']
      target: ['https://xxx']
    static_configs:
      - targets: ['zzz:9115']

  - job_name: 'yyy'
    metrics_path: /probe
    params:
      module: ['basic_http_get']
      target: ['https://yyy']
    static_configs:
      - targets: ['zzz:9115']

rule_files:
  - "/etc/prometheus/alerts"

I do not have high-detail logs currently, will try to get some. Nothing in the "normal" logs.

@mxinden
Copy link
Member

mxinden commented Jun 15, 2017

@sandersaares Are you able to test out v0.7.1 in your environment and try to reproduce the issue?

@sandersaares
Copy link

sandersaares commented Jun 15, 2017

Yes, doing that right now. Is there anything special I should do to get the most debug info out of it? (beyond just log.level)

@sandersaares
Copy link

Meanwhile, one thing I noticed was that the "job" label collection was different in each of the repeated "resolved" notifications. Here are the ones it sent (together with all the firing notifications). You can see some sort of pattern here but I can't quite understand what it means.

[FIRING:1] web_endpoint_is_down (zzz:9115 website1)
website1 
description = xxx

[FIRING:3] web_endpoint_is_down (zzz:9115)
website1 website3 website2 
description = xxx

[FIRING:2] web_endpoint_is_down (zzz:9115)
website4 website5 website2 website1 website3
description = xxx

[13:47] 
[RESOLVED] web_endpoint_is_down (zzz:9115)
website2 website1 website3 website4 website5
description = xxx

[13:48] 
[RESOLVED] web_endpoint_is_down (zzz:9115)
website3 website4 website5 website2
description = xxx

[13:48] 
[RESOLVED] web_endpoint_is_down (zzz:9115 website1)
website1
description = xxx

[13:49] 
[RESOLVED] web_endpoint_is_down (zzz:9115)
website4 website5 website2 website1 website3
description = xxx

[13:50] 
[RESOLVED] web_endpoint_is_down (zzz:9115)
website2 website1 website3
description = xxx

[13:50] 
[RESOLVED] web_endpoint_is_down (zzz:9115)
website4 website5
description = xxx

[RESOLVED] web_endpoint_is_down (zzz:9115)
website3 website4 website5 website2 website1 
description = xxx

[RESOLVED] web_endpoint_is_down (zzz:9115)
website3 website4 website5 website2 
description = xxx

@sandersaares
Copy link

I am unable to reproduce with 0.7.1.

However, the behavior was also not consistently problematic with 0.5.1 - I first saw it yesterday and have been using the software for quite some time now. Will monitor behavior and report back if I have any more info to add - perhaps there are some hidden factors that affect the situation.

@mxinden
Copy link
Member

mxinden commented Jun 15, 2017

@sandersaares Thanks for reporting. Let us know if you are facing any further issues!

@damomurf Any news from your side? Are you still facing these issues with v0.7.1?

@damomurf
Copy link
Author

@mxinden I'll deploy v0.7.1 first thing in my morning and definitely report back.

@damomurf
Copy link
Author

damomurf commented Jun 16, 2017

I'll need to leave it for at least 24 hours to get a read on some regular (ugh, yes - regular) alerts that were triggering the issue. However, initial indications are that v0.7.1 looks better.

@damomurf
Copy link
Author

After running this for several days under v0.7.1, it definitely seems like the repeat resolved notification situation has been largely solved. I recall seeing one instance of a repeat, but believe there could have been additional reasons for that.

v0.7.1 shows "Error: Timeout exceeded" on the Silences page in the UI for me, but I'll raise that as another issue.

@mxinden
Copy link
Member

mxinden commented Jun 19, 2017

@damomurf Thanks for testing this out on v0.7.1! I will close here for now, as it seems you issue has been resolved. Please feel free to reopen if it occurs again.

@mxinden mxinden closed this as completed Jun 19, 2017
@dIoMeDs
Copy link

dIoMeDs commented Aug 22, 2017

Hi all.
I have the same problem with repeating "resolved" notifications firing.

Tried on v 0.8.0 and also 0.7.1

Testing with simple config

global:
  slack_api_url: "https://hooks.slack.com/services/qqqwwweee"

route:
  receiver: slack-alert
  group_wait: 30s
  group_interval: 2m
  repeat_interval: 3m

  # All alerts in a notification have the same value for these labels.
#  group_by: [alertname, datacenter, app]

receivers:
- name: "slack-alert"
  slack_configs:
  - channel: "#alerts"
    text: "<!channel> \ndescription: {{ .CommonAnnotations.description }}\nsummary: {{ .CommonAnnotations.summary }}"
    send_resolved: true

and alert.rules

# Alert Rules

ALERT disk_used_percent
  IF disk_used_percent > 20
  FOR 5s
  LABELS { severity="critical" }
  ANNOTATIONS {
   summary = "Disk",
   description = "{{ $labels.instance }} has critical disk usage (current value: {{ $value }})",
   }

@mxinden
Copy link
Member

mxinden commented Aug 23, 2017

@dIoMeDs according to @damomurf this has been fixed since v0.7.1. Please open up a new issue with more details on how we can reproduce your issue. Preferably with the help of the webhook and the send-alerts script.

hh pushed a commit to ii/alertmanager that referenced this issue Mar 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants