Repeated "resolved" notifications firing #844

damomurf · 2017-06-05T09:05:02Z

I'm seeing repeating "resolved" notification firing into notification endpoints from the same alert rule, with two different Prometheus installations. The alerts seem to repeat 5 times after the initial notification and then stop. It seems to affect every version of Alertmanager right back to 0.5.0 - the latest version that doesn't exhibit the problem is 0.4.2.

I'm assuming this relates to #815.

I've also tried with an alertmanager built from master and observe the same behaviour. The offending alert, once resolved, seems to notify every repeat_interval. The following example config below was an attempt to replicate it with much smaller timeframes.

Once the alert really resolves in Prometheus, I see nothing in either Prometheus or Alertmanager UIs.

The following is the alert manager log in debug mode for the repeating notifications.

time="2017-05-31T06:40:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:40:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:40:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:40:38Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:41:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:41:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:41:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:41:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:08Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:42:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:42:37Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:43:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:43:07Z" level=debug msg="flushing [JobDown[d36f98e][resolved]]" aggrGroup={}:{alertname="JobDown"} source="dispatch.go:425" 
time="2017-05-31T06:43:07Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:43:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183" 
time="2017-05-31T06:43:37Z" level=debug msg="Received alert" alert=JobDown[d36f98e][resolved] component=dispatcher source="dispatch.go:183"

Route and Receivers config:

route:
  group_by: ['account', 'region', 'environment', 'alertname']
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 2m
  receiver: infra

receivers:
- name: 'infra'
  slack_configs:
  - send_resolved: true
    channel: "channel"

The text was updated successfully, but these errors were encountered:

fabxc · 2017-06-06T09:48:49Z

A note on the configuration in general. Repeat interval should be generally really long in a sane setup – at the order of 3 to 48 hours. You will be notified in the group interval should anything change about the state of alerts. The repeat interval is only about unchanged state – anything in the minute range is typically way too noisy for this.

Can you provide your Prometheus configuration as well. In particular the rule evaluation interval and alerting rule in question?

fabxc · 2017-06-06T10:19:58Z

You are sending from two Prometheus servers, right? Do they have different external labels?

damomurf · 2017-06-06T10:51:43Z

@fabxc - Sorry, I wasn't very clear on a couple of those things.

The alert manager config above was just so that I could replicate the problem without waiting and waiting for hours. The normal config I run with is:

 group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h

In reference to two servers I was highlighting that I have two completely independent, separate configurations of Prometheus/Alertmanager that end up triggering the same scenario. There's no transmission of metrics between them or sharing of an alertmamanger or anything like that. The common factor between them is that they're using the same exporter and alert rules.

I'm not sure how it's relevant, but the exporter in question that triggers the scenario is https://github.com/Braedon/prometheus-es-exporter (version 0.3.0).

I'm running with Prometheus configs of:

  scrape_interval:     15s
  evaluation_interval: 30s (or 15s) on the second server

The target is scraped via a standard file_sd_config target, that just configures a few labels.

The alert rule is like so:

ALERT ElasticSearchClusterStatus
  IF es_cluster_health_status{job="elasticsearch"} != 0
  FOR 1m
  ANNOTATIONS {
    description = "Elasticsearch cluster {{ $labels.account }} - {{ $labels.region }} - {{ $labels.environment }} health is no longer Green. {{ $value }}",
    account = "{{ $labels.account }}",
    region = "{{ $labels.region }}",
    environment = "{{ $labels.environment }}"
  }

I hope that's useful information.

fabxc · 2017-06-06T11:22:32Z

So you have two Prometheus servers but they are not sending to the same Alertmanager cluster?
The logs indicate the same alert coming in twice in short sequence. Do they have any external labels set?

Really having trouble reproducing this in any way.

The logs you've shown above seem to be truncated in the front. Can you provide the full thing?
The really full configs for Prometheus and Alertmanager would be best. For example, indentation of group_wait above seems off but I don't know whether it's just a copy&paste error or not.

damomurf · 2017-06-07T11:13:24Z

I'll do some significant digging into establishing whether there are two sources of the alerts, but I can't see how. If not, I'll reproduce and include all of the config and logs.

mxinden · 2017-06-07T14:50:47Z

@damomurf Thanks for your help! I will move this issue to milestone v0.8. As we are having difficulties reproducing this bug, and as it is not a mission critical issue we will release v0.7 without it for now. Don't get me wrong. We are still very much interested in fixing this issue.

sandersaares · 2017-06-15T11:06:14Z

I am also experiencing repeated "resolved" notifications with version=0.5.1, branch=master, revision=0ea1cac51e6a620ec09d053f0484b97932b5c902.

Anonymized configuration snippets follow.

Alertmanager configuration:

route:
  group_by: ['alertname', 'service']
  group_wait: 1m
  group_interval: 5m
  repeat_interval: 3h
  receiver: 'slack-devops'

receivers:
- name: 'slack-devops'
  slack_configs:
  - api_url: "xxx"
    channel: "#devops"
    send_resolved: true
    text: "{{ range .Alerts }}{{ .Labels.job }} {{ end }}\n{{ range .CommonAnnotations.SortedPairs }}{{ .Name }} = {{ .Value }}\n{{ end }}"

Prometheus alerts configuration:

ALERT web_endpoint_is_down
    IF probe_success < 1.0
    FOR 5m
    ANNOTATIONS {
        description = "xxx"
    }

Prometheus monitoring configuration:

global:
  scrape_interval: 1m
  scrape_timeout: 10s

scrape_configs:
  - job_name: 'xxx'
    metrics_path: /probe
    params:
      module: ['basic_http_get']
      target: ['https://xxx']
    static_configs:
      - targets: ['zzz:9115']

  - job_name: 'yyy'
    metrics_path: /probe
    params:
      module: ['basic_http_get']
      target: ['https://yyy']
    static_configs:
      - targets: ['zzz:9115']

rule_files:
  - "/etc/prometheus/alerts"

I do not have high-detail logs currently, will try to get some. Nothing in the "normal" logs.

mxinden · 2017-06-15T11:18:51Z

@sandersaares Are you able to test out v0.7.1 in your environment and try to reproduce the issue?

sandersaares · 2017-06-15T11:44:38Z

Yes, doing that right now. Is there anything special I should do to get the most debug info out of it? (beyond just log.level)

sandersaares · 2017-06-15T12:01:48Z

Meanwhile, one thing I noticed was that the "job" label collection was different in each of the repeated "resolved" notifications. Here are the ones it sent (together with all the firing notifications). You can see some sort of pattern here but I can't quite understand what it means.

[FIRING:1] web_endpoint_is_down (zzz:9115 website1)
website1 
description = xxx

[FIRING:3] web_endpoint_is_down (zzz:9115)
website1 website3 website2 
description = xxx

[FIRING:2] web_endpoint_is_down (zzz:9115)
website4 website5 website2 website1 website3
description = xxx

[13:47] 
[RESOLVED] web_endpoint_is_down (zzz:9115)
website2 website1 website3 website4 website5
description = xxx

[13:48] 
[RESOLVED] web_endpoint_is_down (zzz:9115)
website3 website4 website5 website2
description = xxx

[13:48] 
[RESOLVED] web_endpoint_is_down (zzz:9115 website1)
website1
description = xxx

[13:49] 
[RESOLVED] web_endpoint_is_down (zzz:9115)
website4 website5 website2 website1 website3
description = xxx

[13:50] 
[RESOLVED] web_endpoint_is_down (zzz:9115)
website2 website1 website3
description = xxx

[13:50] 
[RESOLVED] web_endpoint_is_down (zzz:9115)
website4 website5
description = xxx

[RESOLVED] web_endpoint_is_down (zzz:9115)
website3 website4 website5 website2 website1 
description = xxx

[RESOLVED] web_endpoint_is_down (zzz:9115)
website3 website4 website5 website2 
description = xxx

sandersaares · 2017-06-15T13:13:27Z

I am unable to reproduce with 0.7.1.

However, the behavior was also not consistently problematic with 0.5.1 - I first saw it yesterday and have been using the software for quite some time now. Will monitor behavior and report back if I have any more info to add - perhaps there are some hidden factors that affect the situation.

mxinden · 2017-06-15T13:30:29Z

@sandersaares Thanks for reporting. Let us know if you are facing any further issues!

@damomurf Any news from your side? Are you still facing these issues with v0.7.1?

damomurf · 2017-06-15T13:32:36Z

@mxinden I'll deploy v0.7.1 first thing in my morning and definitely report back.

damomurf · 2017-06-16T04:23:55Z

I'll need to leave it for at least 24 hours to get a read on some regular (ugh, yes - regular) alerts that were triggering the issue. However, initial indications are that v0.7.1 looks better.

damomurf · 2017-06-18T23:37:58Z

After running this for several days under v0.7.1, it definitely seems like the repeat resolved notification situation has been largely solved. I recall seeing one instance of a repeat, but believe there could have been additional reasons for that.

v0.7.1 shows "Error: Timeout exceeded" on the Silences page in the UI for me, but I'll raise that as another issue.

mxinden · 2017-06-19T07:56:16Z

@damomurf Thanks for testing this out on v0.7.1! I will close here for now, as it seems you issue has been resolved. Please feel free to reopen if it occurs again.

dIoMeDs · 2017-08-22T13:52:04Z

Hi all.
I have the same problem with repeating "resolved" notifications firing.

Tried on v 0.8.0 and also 0.7.1

Testing with simple config

global:
  slack_api_url: "https://hooks.slack.com/services/qqqwwweee"

route:
  receiver: slack-alert
  group_wait: 30s
  group_interval: 2m
  repeat_interval: 3m

  # All alerts in a notification have the same value for these labels.
#  group_by: [alertname, datacenter, app]

receivers:
- name: "slack-alert"
  slack_configs:
  - channel: "#alerts"
    text: "<!channel> \ndescription: {{ .CommonAnnotations.description }}\nsummary: {{ .CommonAnnotations.summary }}"
    send_resolved: true

and alert.rules

# Alert Rules

ALERT disk_used_percent
  IF disk_used_percent > 20
  FOR 5s
  LABELS { severity="critical" }
  ANNOTATIONS {
   summary = "Disk",
   description = "{{ $labels.instance }} has critical disk usage (current value: {{ $value }})",
   }

mxinden · 2017-08-23T20:10:09Z

@dIoMeDs according to @damomurf this has been fixed since v0.7.1. Please open up a new issue with more details on how we can reproduce your issue. Preferably with the help of the webhook and the send-alerts script.

mxinden added this to the v0.7 milestone Jun 6, 2017

mxinden modified the milestones: v0.8, v0.7 Jun 7, 2017

damomurf mentioned this issue Jun 18, 2017

Error: Timeout exceeded on Silences page #873

Closed

mxinden closed this as completed Jun 19, 2017

hh pushed a commit to ii/alertmanager that referenced this issue Mar 8, 2018

Convert circleci config to 2.0. (prometheus#844)

65e485c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated "resolved" notifications firing #844

Repeated "resolved" notifications firing #844

damomurf commented Jun 5, 2017

fabxc commented Jun 6, 2017

fabxc commented Jun 6, 2017

damomurf commented Jun 6, 2017

fabxc commented Jun 6, 2017 •

edited

Loading

damomurf commented Jun 7, 2017

mxinden commented Jun 7, 2017

sandersaares commented Jun 15, 2017 •

edited

Loading

mxinden commented Jun 15, 2017

sandersaares commented Jun 15, 2017 •

edited

Loading

sandersaares commented Jun 15, 2017

sandersaares commented Jun 15, 2017

mxinden commented Jun 15, 2017

damomurf commented Jun 15, 2017

damomurf commented Jun 16, 2017 •

edited

Loading

damomurf commented Jun 18, 2017

mxinden commented Jun 19, 2017

dIoMeDs commented Aug 22, 2017

mxinden commented Aug 23, 2017

Repeated "resolved" notifications firing #844

Repeated "resolved" notifications firing #844

Comments

damomurf commented Jun 5, 2017

fabxc commented Jun 6, 2017

fabxc commented Jun 6, 2017

damomurf commented Jun 6, 2017

fabxc commented Jun 6, 2017 • edited Loading

damomurf commented Jun 7, 2017

mxinden commented Jun 7, 2017

sandersaares commented Jun 15, 2017 • edited Loading

mxinden commented Jun 15, 2017

sandersaares commented Jun 15, 2017 • edited Loading

sandersaares commented Jun 15, 2017

sandersaares commented Jun 15, 2017

mxinden commented Jun 15, 2017

damomurf commented Jun 15, 2017

damomurf commented Jun 16, 2017 • edited Loading

damomurf commented Jun 18, 2017

mxinden commented Jun 19, 2017

dIoMeDs commented Aug 22, 2017

mxinden commented Aug 23, 2017

fabxc commented Jun 6, 2017 •

edited

Loading

sandersaares commented Jun 15, 2017 •

edited

Loading

sandersaares commented Jun 15, 2017 •

edited

Loading

damomurf commented Jun 16, 2017 •

edited

Loading