Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes, no rendering alerts #812

Closed
gaetanars opened this issue Jul 9, 2019 · 21 comments
Closed

Sometimes, no rendering alerts #812

gaetanars opened this issue Jul 9, 2019 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@gaetanars
Copy link

We use Karma on top of an AlertManager high availability cluster (with 2 or more AM). We use filter on labels. Sometimes no alerts rendering on Karma.
In this example, the number on top indicate that no alert match 'pf="sdr"' but in AM we can show the alerts.

2019-07-09 08_51_23-Greenshot

2019-07-09 08_51_31-Greenshot (2)

We don't see error in Karma logs

time="2019-07-09T07:06:46Z" level=info msg="Pulling latest alerts and silences from Alertmanager"
time="2019-07-09T07:06:46Z" level=info msg="[sdr] Collecting alerts and silences"
time="2019-07-09T07:06:46Z" level=info msg="[xymon] Collecting alerts and silences"
time="2019-07-09T07:06:46Z" level=info msg="GET http://alertmanager.xymon:9093/metrics timeout=10s"
time="2019-07-09T07:06:46Z" level=info msg="GET http://alertmanager.prometheus:9093/metrics timeout=10s"
time="2019-07-09T07:06:46Z" level=info msg="Upstream version: 0.18.0"
time="2019-07-09T07:06:46Z" level=info msg="GET http://alertmanager.prometheus:9093/api/v1/status timeout=10s"
time="2019-07-09T07:06:46Z" level=info msg="[sdr] Remote Alertmanager version: 0.18.0"
time="2019-07-09T07:06:46Z" level=info msg="[sdr] Got 3 silences(s) in 764.867µs"
time="2019-07-09T07:06:46Z" level=info msg="[sdr] Detecting JIRA links in silences (3)"
time="2019-07-09T07:06:46Z" level=info msg="Upstream version: 0.17.0"
time="2019-07-09T07:06:46Z" level=info msg="GET http://alertmanager.xymon:9093/api/v1/status timeout=10s"
time="2019-07-09T07:06:46Z" level=info msg="[xymon] Remote Alertmanager version: 0.17.0"
time="2019-07-09T07:06:46Z" level=info msg="[xymon] Got 0 silences(s) in 4.333911ms"
time="2019-07-09T07:06:46Z" level=info msg="[xymon] Detecting JIRA links in silences (0)"
time="2019-07-09T07:06:46Z" level=info msg="[sdr] Got 136 alert group(s) in 67.698158ms"
time="2019-07-09T07:06:46Z" level=info msg="[sdr] Deduplicating alert groups (136)"
time="2019-07-09T07:06:46Z" level=info msg="[sdr] Processing unique alert groups (56)"
time="2019-07-09T07:06:46Z" level=info msg="[sdr] Merging autocomplete data (480)"
time="2019-07-09T07:06:46Z" level=info msg="[xymon] Got 997 alert group(s) in 585.932343ms"
time="2019-07-09T07:06:46Z" level=info msg="[xymon] Deduplicating alert groups (997)"
time="2019-07-09T07:06:46Z" level=info msg="[xymon] Processing unique alert groups (838)"
time="2019-07-09T07:06:46Z" level=info msg="[xymon] Merging autocomplete data (2908)"
time="2019-07-09T07:06:46Z" level=info msg="Pull completed"
time="2019-07-09T06:59:34Z" level=info msg="[10.244.6.0 MIS] <200> GET /alerts.json?q=pf%3Dsdr&q=%40state%3Dactive took 15.087184ms"
time="2019-07-09T07:00:34Z" level=info msg="[10.244.6.0 MIS] <200> GET /alerts.json?q=pf%3Dsdr&q=%40state%3Dactive took 18.785712ms"
time="2019-07-09T07:01:35Z" level=info msg="[10.244.6.0 MIS] <200> GET /alerts.json?q=pf%3Dsdr&q=%40state%3Dactive took 15.025122ms"
time="2019-07-09T07:02:35Z" level=info msg="[10.244.6.0 MIS] <200> GET /alerts.json?q=pf%3Dsdr&q=%40state%3Dactive took 11.880314ms"
time="2019-07-09T07:03:36Z" level=info msg="[10.244.6.0 MIS] <200> GET /alerts.json?q=pf%3Dsdr&q=%40state%3Dactive took 16.137345ms"
time="2019-07-09T07:04:37Z" level=info msg="[10.244.6.0 MIS] <200> GET /alerts.json?q=pf%3Dsdr&q=%40state%3Dactive took 19.860792ms"
time="2019-07-09T07:05:37Z" level=info msg="[10.244.6.0 MIS] <200> GET /alerts.json?q=pf%3Dsdr&q=%40state%3Dactive took 18.43056ms"
time="2019-07-09T07:06:38Z" level=info msg="[10.244.6.0 MIS] <200> GET /alerts.json?q=pf%3Dsdr&q=%40state%3Dactive took 21.778087ms"
time="2019-07-09T07:07:38Z" level=info msg="[10.244.6.0 MIS] <200> GET /alerts.json?q=pf%3Dsdr&q=%40state%3Dactive took 22.852808ms"
time="2019-07-09T07:08:39Z" level=info msg="[10.244.6.0 MIS] <200> GET /alerts.json?q=pf%3Dsdr&q=%40state%3Dactive took 14.763488ms"

Did you encounter this problem ?

@prymitive
Copy link
Owner

Definetely looks weird, do you have any configuration that would strip out some receiveres? Another possibility is that those alerts are very short lived so they get missed in between karma pulling data from the alertmanager API.

@prymitive prymitive self-assigned this Jul 9, 2019
@gaetanars
Copy link
Author

gaetanars commented Jul 9, 2019

Yes we have a filter on one receiver, because if we have an alert that match multiple receiver Karma don't aggregate them.

receivers:
      keep: ["alertes_pf"]
      strip: []

AM API seems to be OK, when we encounter this problem alerts correctly appears in AlertManager WebUI and API.

@prymitive
Copy link
Owner

prymitive commented Jul 9, 2019

Receivers aren't aggregated as they all can have different group_by rules, they're effectively different notification channels. You can filter alerts with @receiver=foo.
Is it possible that the alerts you're missing in karma are not routed to the alertes_pf receiver?

@gaetanars
Copy link
Author

Alerts are correctly routed in alertes_pf, because sometimes this alerts are correctly rendered in Karma.
It seems to be a flaping problem, sometimes it's OK, sometimes alerts disapears from Karma.

@prymitive
Copy link
Owner

Can you share your alertmanager & karma config?

@gaetanars
Copy link
Author

Alertmanager config

global:
  resolve_timeout: 5m
  http_config: {}
  smtp_hello: localhost
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  hipchat_api_url: https://api.hipchat.com/
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
  receiver: alertes_pf
  group_by:
  - alertname
  - pf
  - severity
  routes:
  - receiver: alerta
    match_re:
      alertname: ^(.*)$
    continue: true
  - receiver: "null"
    match:
      alertname: DeadMansSwitch
  - receiver: cachethq-receiver
    match_re:
      component: ^(API_K8S|AlertManager|Prometheus)$
    continue: true
  - receiver: alertes_pf
    match_re:
      alertname: ^(.*)$
    continue: true
    repeat_interval: 1d
  - receiver: "null"
    match:
      transition: "no"
  - receiver: amdb
    group_by:
    - pf
    - period
    match:
      period: hno
      severity: critical
      support: "yes"
    continue: true
    repeat_interval: 5m
  - receiver: dspm_sc
    match:
      support: "yes"
    repeat_interval: 1d
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
inhibit_rules:
- source_match:
    severity: critical
  target_match:
    severity: warning
  equal:
  - alertname
  - instance
receivers:
- name: "null"
- name: alerta
  webhook_configs:
  [...]
- name: amdb
  webhook_configs:
  [...]
- name: cachethq-receiver
  webhook_configs:
  [...]
- name: alertes_pf
  slack_configs:
  [...]
- name: dspm_sc
  slack_configs:
  [...]
templates:
- /etc/alertmanager/templates/*.tmpl

Karma config :

alertmanager:
  interval: 60s
  servers:
    - name: sdr
      uri: http://alertmanager.prometheus:9093
      timeout: 10s
      proxy: true
    - name: xymon
      uri: http://alertmanager.xymon:9093
      timeout: 10s
      proxy: true
annotations:
  default:
    hidden: false
  hidden: []
  visible: []
custom:
  css: /themes/dark.css
  js: /etc/karma/custom.js
debug: false
filters:
  default:
    - "@state=active"
    - "support=yes"
grid:
  sorting:
    order: label
    reverse: false
    label: severity
    customValues:
      labels:
        severity:
          critical: 1
          warning: 2
          info: 3
labels:
  color:
    static:
      - cluster
      - instance
    custom:
      severity:
        - value: unknown
          color: "#8340a4"
        - value: info
          color: "#87c4e0"
        - value: warning
          color: "#fee047"
        - value: critical
          color: "#ff220c"
  keep: []
  strip: []
listen:
  address: "0.0.0.0"
  port: 8080
  prefix: /
log:
  config: false
  level: info
jira:
  - regex: (SSUP|SSC)-[0-9]+
    uri: JIRA_URL
receivers:
  keep: ["alertes_pf"]
  strip: []

@prymitive
Copy link
Owner

Thanks, I'll try to see if I can reproduce this using that.
Can you see any errors or warnings in logs?

@gaetanars
Copy link
Author

No warning, no error, all collects from Karma seems to be good.

time="2019-07-09T09:01:46Z" level=info msg="[sdr] Got 139 alert group(s) in 89.703602ms"
time="2019-07-09T09:02:51Z" level=info msg="[sdr] Got 139 alert group(s) in 73.4441ms"
time="2019-07-09T09:03:46Z" level=info msg="[sdr] Got 139 alert group(s) in 74.817976ms"
time="2019-07-09T09:04:46Z" level=info msg="[sdr] Got 136 alert group(s) in 78.472535ms"
time="2019-07-09T09:05:46Z" level=info msg="[sdr] Got 136 alert group(s) in 93.536433ms"
time="2019-07-09T09:06:46Z" level=info msg="[sdr] Got 136 alert group(s) in 81.947709ms"
time="2019-07-09T09:07:46Z" level=info msg="[sdr] Got 136 alert group(s) in 71.114653ms"
time="2019-07-09T09:08:46Z" level=info msg="[sdr] Got 136 alert group(s) in 85.245316ms"
time="2019-07-09T09:09:46Z" level=info msg="[sdr] Got 136 alert group(s) in 120.289729ms"
time="2019-07-09T09:10:46Z" level=info msg="[sdr] Got 136 alert group(s) in 104.496622ms"
time="2019-07-09T09:11:46Z" level=info msg="[sdr] Got 138 alert group(s) in 75.564017ms"
time="2019-07-09T09:12:46Z" level=info msg="[sdr] Got 138 alert group(s) in 73.578361ms"
time="2019-07-09T09:13:46Z" level=info msg="[sdr] Got 136 alert group(s) in 90.662448ms"
time="2019-07-09T09:14:46Z" level=info msg="[sdr] Got 136 alert group(s) in 100.384113ms"
time="2019-07-09T09:15:46Z" level=info msg="[sdr] Got 136 alert group(s) in 80.619634ms"
time="2019-07-09T09:16:46Z" level=info msg="[sdr] Got 136 alert group(s) in 95.633335ms"
time="2019-07-09T09:17:46Z" level=info msg="[sdr] Got 136 alert group(s) in 74.227122ms"
time="2019-07-09T09:18:46Z" level=info msg="[sdr] Got 136 alert group(s) in 74.422998ms"
time="2019-07-09T09:19:46Z" level=info msg="[sdr] Got 136 alert group(s) in 89.802107ms"
time="2019-07-09T09:20:46Z" level=info msg="[sdr] Got 136 alert group(s) in 85.787169ms"

@prymitive
Copy link
Owner

Could you try alertmanager v0.18.0?
https://github.com/prometheus/alertmanager/releases/tag/v0.18.0
mentions [BUGFIX] [api] Fix /api/v2/alerts with multiple receivers. #1948 and I wonder if it somehow makes the difference here, but that's just guessing really

@gaetanars
Copy link
Author

We already have 0.18.0

time="2019-07-09T07:06:46Z" level=info msg="Upstream version: 0.18.0"

@prymitive
Copy link
Owner

Awesome, I'll get back here once I have some time for debugging, thanks.

@prymitive prymitive added the bug Something isn't working label Jul 9, 2019
@prymitive
Copy link
Owner

prymitive commented Jul 10, 2019

I think I can see the same issue (or similar) with local demo setup, some receivers seems to flap between all alerts and no alerts. That will make debugging easier.

Steps to reproduce:

  1. Open http://karma-demo.herokuapp.com/?q=%40receiver%3Dby-name
  2. Wait

Alerts will flap between lots and none, which shouldn't happen.
This is either a bug in alertmanager 0.17+ support code, or somewhere deeper in the karma.

@prymitive
Copy link
Owner

Issue seems to be with alertmanager rather than karma, at least with receivers

$ curl -s localhost:9093/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -c
      3   "name": "by-cluster"
     16   "name": "by-cluster-service"
     26   "name": "by-name"
$ curl -s localhost:9093/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -c
      3   "name": "by-cluster"
     16   "name": "by-cluster-service"
     26   "name": "default"
$ curl -s localhost:9093/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -c
      3   "name": "by-cluster"
     14   "name": "by-cluster-service"
     22   "name": "default"
$ curl -s localhost:9093/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -c
      3   "name": "by-cluster"
     16   "name": "by-cluster-service"
     26   "name": "default"
$ curl -s localhost:9093/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -c
      3   "name": "by-cluster"
     16   "name": "by-cluster-service"
     26   "name": "by-name"
$ curl -s localhost:9093/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -c
      3   "name": "by-cluster"
     16   "name": "by-cluster-service"
     26   "name": "default"

@prymitive
Copy link
Owner

Raised prometheus/alertmanager#1959

@gaetanars
Copy link
Author

I don't see this on our AlertManager, when Karma respond with a blank page all receivers seems to be good in AM API :

$ curl -sk https://alertmanager.domain/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -
  "name": "alertes_pf"
  "name": "null"
$ curl -sk https://alertmanager.domain/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -
  "name": "alertes_pf"
  "name": "null"
$ curl -sk https://alertmanager.domain/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -
  "name": "alertes_pf"
  "name": "null"
$ curl -sk https://alertmanager.domain/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -
  "name": "alertes_pf"
  "name": "null"
$ curl -sk https://alertmanager.domain/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -
  "name": "alertes_pf"
  "name": "null"
$ curl -sk https://alertmanager.domain/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -
  "name": "alertes_pf"
  "name": "null"
$ curl -sk https://alertmanager.domain/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -
  "name": "alertes_pf"
  "name": "null"
$ curl -sk https://alertmanager.domain/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -
  "name": "alertes_pf"
  "name": "null"
$ curl -sk https://alertmanager.domain/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -
  "name": "alertes_pf"
  "name": "null"
$ curl -sk https://alertmanager.domain/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -
  "name": "alertes_pf"
  "name": "null"
$ curl -sk https://alertmanager.domain/api/v2/alerts/groups | jq '.[].receiver' | grep name | sort | uniq -
  "name": "alertes_pf"
  "name": "null"

@prymitive
Copy link
Owner

Based on your config I don't see how you could be affected, so there might be a separate problem here.
But you could try removing

receivers:
  keep: ["alertes_pf"]
  strip: []

from your config.

If it's the same issue I found in the demo setup the way it could affect you is:

  1. you got same alerts going to different receivers
  2. those different receivers have identical group_by rules
  3. alertmanager incorrectly deduplicates these groups using group labels (that's why group_by matters) which means that they returned for only one of the receivers (which one is random, based on ordering in slice I think)

If you stop stripping receivers in karma you should see those alerts all the time, but they will move between receivers. You can filter alerts via @receiver=foo

@gaetanars
Copy link
Author

I try to test skip receivers instead of keep, I'm facing the same issue.
Now I'm testing without receivers strip or keep, and it seems to be fine.

I see that sometimes, the same alert match different receiver (alertes_pf or null), but never match two receivers at the same time :

2019-07-16 10_37_37-karma

2019-07-16 10_36_14-karma

@prymitive
Copy link
Owner

That does sound like prometheus/alertmanager#1959. Workarounds you could use while we wait for an upstream fix:

  • ensure that every route in alertmanager config uses different group_by
  • downgrade alertmanager to 0.15.x

@prymitive
Copy link
Owner

I'm surprised it goes to null when the config is:

  - receiver: "null"
    match:
      alertname: DeadMansSwitch

@prymitive
Copy link
Owner

A fix for prometheus/alertmanager#1959 was merged to alertmanager master branch, so the next release will include it

@prymitive
Copy link
Owner

This should be fixed in alertmanager v0.19.0

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants