Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong receiver #1824

Closed
vvchistyakov opened this issue Jun 11, 2020 · 18 comments
Closed

Wrong receiver #1824

vvchistyakov opened this issue Jun 11, 2020 · 18 comments
Labels
wontfix This will not be worked on

Comments

@vvchistyakov
Copy link

Hi. I've got strange situation.
I see the alert in Karma with non-legal receiver
image
I expect to see team-kafka-wake-up there

{
          "annotations": {
            "summary": "K2DWH Lag is growing. Info: group=k2dwh, count=8.172123e+06, location=fr"
          },
          "endsAt": "2020-06-11T09:54:41.353Z",
          "fingerprint": "52034abab298fe32",
          "receivers": [
            {
              "name": "team-kafka-wake-up"
            },
            {
              "name": "team-kafka-wake-up"
            }
          ],
          "startsAt": "2020-06-11T09:45:26.353Z",
          "status": {
            "inhibitedBy": [],
            "silencedBy": [],
            "state": "active"
          },
          "updatedAt": "2020-06-11T09:51:41.451Z",
          "generatorURL": "https://****",
          "labels": {
            "alertname": "k2dwh_lag",
            "consumer_group": "k2dwh",
            "kafka_location": "fr",
            "severity": "critical",
            "team": "Team_Kafka"
          }
        },
@prymitive
Copy link
Owner

All data comes from alertmanager API.
Do you see expected receiver if you add @receiver=team-kafka-wake-up filter in the UI?

@vvchistyakov
Copy link
Author

Yes, I can see team-kafka-wake-up for few other alerts.
image

@prymitive
Copy link
Owner

Each alert will appear seperately in every receiver it goes to as each receiver can have unique groupping configuration. There's nothing unexpected here.
There are configuration options to ignore some receivers if that's needed - see docs here

@vvchistyakov
Copy link
Author

I see, however when I tried avoiding opsgenie reciever or tried getting only team-kafka-wake-up. I didn't receive any alerts in Karma.
At that moment alerts were received over Slack and Email over team-kafka-wake-up successfully

@prymitive
Copy link
Owner

I'm not sure if I follow. You just pasted team-kafka-wake-up alert screenshot so clearly karma is getting that from alertmanager API.

@vvchistyakov
Copy link
Author

Ok. Sorry for wrong describing.
I have a lot receivers in Alertmanager configuration. Most interesting is team-kafka-wake-up for me. We has already configured it for Slack and Email. When I check alerts which I receive in Slack, I can't see them in Karma with this receiver. Alerts appear with other receivers, but not with team-kafka-wake-up. Yes, 1-2 alerts could appears with correct receiver.

@prymitive
Copy link
Owner

Can you see those alerts with team-kafka-wake-up receiver in alertmanager UI but not karma?
You can click on Receiver: All in alertmanager UI and select receiver from the dropdown.

@vvchistyakov
Copy link
Author

vvchistyakov commented Jun 11, 2020

Do you mean this?
image

@prymitive
Copy link
Owner

Yes, same alert with same receiver should be in karma. Is there any mismatch between what alertmanager UI shows and what karma UI shows for that receiver?

@vvchistyakov
Copy link
Author

vvchistyakov commented Jun 11, 2020

Yes. I received this alert only over opsgenie receiver in Karma =( Also the same situation for other alerts. Karma doesn't show them with team-kafka-wake-up receiver.

@prymitive
Copy link
Owner

Can you share configuration for karma and alertmanager routing tree? Any errors in karma logs? What alertmanager version are you using?

@prymitive
Copy link
Owner

FYI there were some bugs fixed in alertmanager 0.19 in case you are running older version (#812)

@vvchistyakov
Copy link
Author

Karma v0.63:

---
alertmanager:
  servers:
    - name: fr5
      uri: http://server-1:9093
      timeout: 5m
      proxy: true
      cors:
        credentials: omit
    - name: fr4
      uri: http://server-2:9093
      timeout: 5m
      proxy: true
      cors:
        credentials: omit
alertAcknowledgement:
  enabled: true
  duration: 15m0s
  author: Kafka Karma
  commentPrefix: working on it
filters:
  default:
    - team="Team_Kafka"
karma:
  name: Kafka Karma
annotations:
  default:
    hidden: false
  order:
    - alertname
    - location
grid:
  sorting:
    order: startsAt
    label: locationId
receivers:
  keep:
#    - team-kafka-wake-up
  strip:
#    - opsgenie
labels:
  color:
    static:
      - alertname
    unique:
      - "@receiver"
      - instance
ui:
  alertsPerGroup: 10
  collapseGroups: collapsed
  multiGridLabel: location
log:
  config: true
  level: debug
  format: text
  timestamp: true

Alertmanager v0.17:

route:
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: blackhole-slack
  routes:
  - group_by: ["..."]
    match_re:
      alertname: ".+"
    match:
      alerttype: "platform_host"
    receiver: opsgenie-platform-host
  - group_by: ["..."]
    match_re:
      alertname: ".+"
    match:
      alerttype: "platform_service"
    receiver: opsgenie-platform-service
  - group_by: ["..."]
    match_re:
      alertname: ".+"
    receiver: opsgenie
    continue: true
  - match:
      team: Team_1
    group_by: ['service', 'location']
    routes:
    - match:
        service_receiver: "true"
      receiver: service-email
      continue: false
    - match:
        severity: critical
      receiver: platform-monitoring
      repeat_interval: 4h
    - match:
        severity: warning
      receiver: platform-notifs
  - match:
      team: Team_2
    receiver: team-2
    group_by: ['service', 'location']
  - match:
      team: Team_3
    receiver: team-3
    group_by: ['job', 'location']
  - match:
      team: Team_4
    receiver: team-4
    group_by: ['job', 'location']
  - match:
      team: Team_5
    receiver: team-5
    group_by: ['job', 'location']
  - match:
      team: Team_Kafka
    receiver: team-kafka
    group_by: ['job', 'location']
    routes:
    - match:
        location: io
      receiver: team-kafka-io
    - match:
        severity: page
      receiver: team-kafka-wake-up
    - match:
        severity: critical
      receiver: team-kafka-wake-up
  - match:
      team: Team_6
      location: io
    receiver: team-6
    group_by: ['service', 'location']
  - match:
      team: Team_7
    receiver: team-7
    group_by: ['service', 'location']
  - match:
      team: Team_8
    receiver: team_8
    group_by: ['service', 'location']
  - match:
      team: Team_9
    receiver: team_9
    group_by: ['service', 'location']
  - match:
      team: Team_10
    receiver: team-10
    group_by: ['service', 'location']
  - match:
      team: Team_11
    receiver: team-11
    group_by: ['service', 'location']
  - match:
      team: Team_12
    receiver: team-12
    group_by: ['job']
    routes:
    - match:
        alert_receiver: internal
      receiver: team-12
    - match:
        alert_receiver: team-12-ss7
      receiver: team-12-ss7
      continue: true
    - receiver: team-12-1
      continue: true
    - receiver: core-12-2
  - match:
      team: Team_13
    receiver: team-13
    group_by: ['job', 'location']
  - match:
      team: Team_14
      type: warn
      location: io
    receiver: team-14-warning
    group_by: ['job', 'location']
  - match:
      team: Team_14
      type: suspend
      location: io
    receiver: team-14-suspend
    group_by: ['job', 'location']
  - match:
      team: Team_14
      type: warn
      location: fr
    receiver: team-14-fr-warning
    group_by: ['job', 'location']
  - match:
      team: Team_14
      type: suspend
      location: fr
    receiver: team-14-fr-suspend
    group_by: ['job', 'location']
  - match:
      team: Team_14
    receiver: team-14
    group_by: ['job', 'location']
  - match:
      team: Team_15
    group_wait: 1s
    receiver: team-15
    repeat_interval: 30m
    group_by: ['job', 'location']
  - match:
      team: Team_16
    receiver: team-16
    group_by: ['service', 'location']
  - match:
      team: Team_17
    receiver: team-17
    repeat_interval: 30m
    group_by: ['job', 'location']
  - match:
      team: Team_18
    receiver: team-18
    repeat_interval: 30m
    group_by: ['service', 'location']
  - match:
      team: Team_19
    receiver: team-19
    repeat_interval: 30m
    group_by: ['service', 'location']
  - match:
      team: Team_20
    receiver: team-20
    group_by: ['job', 'location']
  - match:
      team: Team_21
      severity: critical
    receiver: team-21-mail
    group_by: ['service', 'location']
    repeat_interval: 600m
  - match:
      team: Team_22
    receiver: team-22
    repeat_interval: 30m
    group_by: ['service', 'location']
  - match:
      team: centili-support
    receiver: centili-support
    repeat_interval: 30m
    group_by: ['job']
  - match:
      team: Team_23
    receiver: team-23
    repeat_interval: 30m
    group_by: ['service', 'location']
  - match:
      alertname: prometheus_dead_mans_switch
    receiver: prometheus-heartbeat
    group_wait: 30s
    group_interval: 30s
    repeat_interval: 30s
    group_by: ["alertname"]
  - match:
      team: Team_24
    receiver: team-24
    repeat_interval: 30m
    group_by: ['service', 'location']
  - match:
      team: Team_25
    receiver: team-25-mail
    group_by: ['location']
  - match:
      team: Team_26
    receiver: team-26
    group_by: ['location']
  - match:
      team: Team_27
    receiver: team-27
    repeat_interval: 60m
    group_by: ['alertname', 'instance', 'service', 'location']
  - match:
      team: Team_28
    receiver: team-28
    group_by: ['service', 'location']
  - match:
      team: Team_29
    receiver: Team-29
    repeat_interval: 60m
    group_by: ['alertname', 'instance', 'service', 'location']
    routes:
    - match_re:
        severity: warning|critical
      receiver: core-performance-voice
      continue: true
  - match:
      team: Team_30
    receiver: team-30
    repeat_interval: 8760h
    group_by: ['location']
  - match:
      team: Team_31
    receiver: team-31
    repeat_interval: 60m
    group_by: ['location']
  - match:
      team: Team_32
    receiver: team-32
    repeat_interval: 8737h
    group_by: ['alertname','service', 'location', 'job', 'number']
  - match:
      team: Team_33
    receiver: team-33
    group_by: ['service', 'location']
    repeat_interval: 60m
  - match:
      team: Team_34
    receiver: team-34
    group_by: ['service', 'location']
  - match:
      team: Team_35
    receiver: team-35
    group_by: ['service', 'location']
  - match:
      team: Team_36
    receiver: team-36
    repeat_interval: 30m
    group_by: ['service', 'location']
  - match:
      team: Team_37
    receiver: team-37
    group_by: ['instance', 'service', 'location']
  - match:
      team: Team_38
    receiver: team-38
    group_by: ['service', 'client', 'location']
  - match:
      team: Team_39
    receiver: team-39
    group_by: ['location']

Don't see any errors in log:

time="2020-06-11T13:30:16Z" level=info msg="[fr4] Upstream version: 0.17.0"
time="2020-06-11T13:30:16Z" level=warning msg="Alertmanager 0.17.0 might return incomplete list of alert groups in the API, please upgrade to >=0.19.0, see https://github.com/prymitive/karma/issues/812"
time="2020-06-11T13:30:16Z" level=info msg="[fr5] Got 66 silences(s) in 6.847856ms"
time="2020-06-11T13:30:16Z" level=info msg="[fr5] Detecting ticket links in silences (66)"
time="2020-06-11T13:30:16Z" level=info msg="[fr4] Got 66 silences(s) in 8.033665ms"
time="2020-06-11T13:30:16Z" level=info msg="[fr4] Detecting ticket links in silences (66)"
time="2020-06-11T13:30:16Z" level=info msg="[fr5] Got 1154 alert group(s) in 280.569968ms"
time="2020-06-11T13:30:16Z" level=info msg="[fr5] Deduplicating alert groups (1154)"
time="2020-06-11T13:30:16Z" level=info msg="[fr5] Processing unique alert groups (1139)"
time="2020-06-11T13:30:16Z" level=info msg="[fr5] Merging autocomplete data (4538)"
time="2020-06-11T13:30:17Z" level=info msg="[fr4] Got 1066 alert group(s) in 885.071458ms"
time="2020-06-11T13:30:17Z" level=info msg="[fr4] Deduplicating alert groups (1066)"
time="2020-06-11T13:30:17Z" level=info msg="[fr4] Processing unique alert groups (1052)"
time="2020-06-11T13:30:17Z" level=info msg="[fr4] Merging autocomplete data (4406)"
time="2020-06-11T13:30:17Z" level=info msg="Pull completed"
time="2020-06-11T13:30:17Z" level=info msg="Done, starting HTTP server"
time="2020-06-11T13:30:17Z" level=info msg="Listening on 0.0.0.0:80"
time="2020-06-11T13:30:22Z" level=debug msg="Compressed 9007 bytes to 2405 bytes (26.70%)"
time="2020-06-11T13:30:22Z" level=info msg="[10.0.0.1 MIS] <200> GET /alerts.json?&gridLabel=location&gridSortReverse=0&sortOrder=&sortLabel=&sortReverse=&q=%40receiver%3Dteam-kafka-wake-up took 26.304313ms"
time="2020-06-11T13:30:25Z" level=debug msg="Compressed 2185257 bytes to 126630 bytes (5.79%)"
time="2020-06-11T13:30:25Z" level=info msg="[10.0.0.1 MIS] <200> GET /alerts.json?&gridLabel=location&gridSortReverse=0&sortOrder=&sortLabel=&sortReverse=& took 405.209777ms"
time="2020-06-11T13:30:35Z" level=debug msg="Compressed 30889 bytes to 3491 bytes (11.30%)"
time="2020-06-11T13:30:35Z" level=info msg="[10.0.0.1 MIS] <200> GET /alerts.json?&gridLabel=location&gridSortReverse=0&sortOrder=&sortLabel=&sortReverse=&q=team%3DTeam_Kafka took 26.481914ms"
time="2020-06-11T13:31:17Z" level=info msg="Pulling latest alerts and silences from Alertmanager"
time="2020-06-11T13:31:17Z" level=info msg="[fr5] Collecting alerts and silences"
time="2020-06-11T13:31:17Z" level=info msg="GET http://server-1:9093/metrics timeout=5m0s"
time="2020-06-11T13:31:17Z" level=info msg="[fr4] Collecting alerts and silences"
time="2020-06-11T13:31:17Z" level=info msg="GET http://server-2:9093/metrics timeout=5m0s"
time="2020-06-11T13:31:17Z" level=info msg="[fr4] Upstream version: 0.17.0"
time="2020-06-11T13:31:17Z" level=warning msg="Alertmanager 0.17.0 might return incomplete list of alert groups in the API, please upgrade to >=0.19.0, see https://github.com/prymitive/karma/issues/812"
time="2020-06-11T13:31:17Z" level=info msg="[fr5] Upstream version: 0.17.0"
time="2020-06-11T13:31:17Z" level=warning msg="Alertmanager 0.17.0 might return incomplete list of alert groups in the API, please upgrade to >=0.19.0, see https://github.com/prymitive/karma/issues/812"
time="2020-06-11T13:31:17Z" level=info msg="[fr4] Got 66 silences(s) in 8.54807ms"
time="2020-06-11T13:31:17Z" level=info msg="[fr4] Detecting ticket links in silences (66)"
time="2020-06-11T13:31:17Z" level=info msg="[fr5] Got 66 silences(s) in 9.957681ms"
time="2020-06-11T13:31:17Z" level=info msg="[fr5] Detecting ticket links in silences (66)"
time="2020-06-11T13:31:17Z" level=info msg="[fr5] Got 1153 alert group(s) in 240.502656ms"
time="2020-06-11T13:31:17Z" level=info msg="[fr5] Deduplicating alert groups (1153)"
time="2020-06-11T13:31:17Z" level=info msg="[fr5] Processing unique alert groups (1138)"
time="2020-06-11T13:31:17Z" level=info msg="[fr5] Merging autocomplete data (4546)"
time="2020-06-11T13:31:18Z" level=info msg="[fr4] Got 1065 alert group(s) in 810.822695ms"
time="2020-06-11T13:31:18Z" level=info msg="[fr4] Deduplicating alert groups (1065)"
time="2020-06-11T13:31:18Z" level=info msg="[fr4] Processing unique alert groups (1051)"
time="2020-06-11T13:31:18Z" level=info msg="[fr4] Merging autocomplete data (4418)"
time="2020-06-11T13:31:18Z" level=info msg="Pull completed"
time="2020-06-11T13:32:17Z" level=info msg="Pulling latest alerts and silences from Alertmanager"
time="2020-06-11T13:32:17Z" level=info msg="[fr5] Collecting alerts and silences"
time="2020-06-11T13:32:17Z" level=info msg="GET http://server-1:9093/metrics timeout=5m0s"
time="2020-06-11T13:32:17Z" level=info msg="[fr4] Collecting alerts and silences"
time="2020-06-11T13:32:17Z" level=info msg="GET http://server-2:9093/metrics timeout=5m0s"
time="2020-06-11T13:32:17Z" level=info msg="[fr4] Upstream version: 0.17.0"
time="2020-06-11T13:32:17Z" level=warning msg="Alertmanager 0.17.0 might return incomplete list of alert groups in the API, please upgrade to >=0.19.0, see https://github.com/prymitive/karma/issues/812"
time="2020-06-11T13:32:17Z" level=info msg="[fr5] Upstream version: 0.17.0"
time="2020-06-11T13:32:17Z" level=warning msg="Alertmanager 0.17.0 might return incomplete list of alert groups in the API, please upgrade to >=0.19.0, see https://github.com/prymitive/karma/issues/812"

@vvchistyakov
Copy link
Author

Do you want to say that my bug with receivers depends on Alertmanager version?

@prymitive
Copy link
Owner

prymitive commented Jun 11, 2020

time="2020-06-11T13:32:17Z" level=warning msg="Alertmanager 0.17.0 might return incomplete list of alert groups in the API, please upgrade to >=0.19.0, see https://github.com/prymitive/karma/issues/812"

There was a bug in Alertmanager <0.19.0 where API wouldn't return all receivers, see prometheus/alertmanager#1959.
You might be hitting that issue, try upgrading to 0.20.0

@vvchistyakov
Copy link
Author

I see. Since I don't have permission to update Alertmanger in our company, I have to wait =(
I'll notify you after updating Alertmanager version.

@stale
Copy link

stale bot commented Aug 15, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Aug 15, 2020
@stale stale bot closed this as completed Aug 22, 2020
@github-actions
Copy link

github-actions bot commented Dec 1, 2020

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants