-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duplicate repeated notifications #1005
Comments
Thanks for working on this! Without having looked at any code to check, your initial hypothesis for repeated notifications not waiting (position*5sec) does sound like it could be the culprit. What is your priority currently between further refining the tooling and trying to solve this (and probably other) bugs? Are you focused on finishing the tooling and then addressing the bugs, or looking at the bugs as they're discovered? |
Regarding the tooling I think it'll go hand in hand with the tests we are developing and the bugs we find. The tooling can be shared as is, I just feel at this point it's more confusing then helping at the rate of change happening to it right now (I literally started the first lines of code a few days ago). I should be able to alternate between developing the tooling and fixing the discovered bugs found with it. As in in order to find the more complex bugs the tooling probably needs to get better. |
@brancz Think this is related to my last comment on the issue #964. Prometheus will call Alertmanager for the same issue a couple times. Supposed that 1 call was made, the AMs sync it and send only 1 notification, but prometheus keep calling the AMs a few other times, and as you configured the Does that make any sense? |
The same thing happens when the I should soon be able to dig into this issue further. |
@brancz Only one single alertmanager without high available also has the same problem. with repeat_interval: 2m resolved message send 6~7 times, with repeat_interval: 7m resolved message send ~3 times. |
@brancz the "duplicate repeated notifications" should because the prometheus send notification to alertmanager continuously for 15 mins. So if the notification send interval less than 15mins, it will send duplicate repeated? and the send interval time is group_interval + repeat_interval? And it's hardcode. It should be better make it configurable. or change the 15mins to 5mins or 10 mins, 15mins is too long in production. |
I also have same problem with HA! I have 3 receivers, jira over webhook, slack and pushover. Alertmanager config for both instances is:
Pushover and Slack works good don't send duplicate, but problem is in the webhook, AM1:
AM2:
EDIT: I try to disable HA and configure only one Prometheus and Alertmanager but problem is same ! |
I've played a bit with ambench and indeed I see the same issue using the master branch. Digging further into the code (and with the help of additional traces), I can explain what happens. Assuming 3 AlertManager instances where am1, am2 and am3 have respectively the position 0, 1, 2 in the cluster. I've set the The AM configuration is: global:
resolve_timeout: 60m
route:
group_by: ['__name__']
group_wait: 5s
group_interval: 20s
repeat_interval: 40s
receiver: 'webhook'
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://127.0.0.1:8080/notify'
send_resolved: false
At t=40s, am1 flushes the notification and sends it to the receiver because repeat_interval is over. Almost at the same time, am3 exits from its 10s WaitStage and it will race with am1 to send the notification too. The same timeline but with a drawing: The bold vertical bars are the flush calls and the green boxes represent the WaitStage intervals. I've got a couple of ideas to reduce the likelihood of this scenario but it won't probably cover all the cases. |
I had the same problem, but this occurs only when the configuration file is reloaded when the alarm is sent |
While developing some tooling to load test Alertmanager HA clusters, I believe to have found a problem regarding repeated notifications.
The tooling I've built captures the alerts it fires and the resulting notifications as events.
Alerts fired events start the line with
ALERTS
followed by the timestamp of the event, the Alertmanager instance it was fired against and finishes with a list of alerts represented by their hash.Notifications received start the line with
NOTIFICATION
followed by the timestamp, the Alertmanager that sent the notification, the group key, a hash of all alerts in the notification followed by a list of all alerts part of the notification.The Alertmanager configuration used is:
The "alerts" I'm sending are just labelsets that I've parsed out of an anonymized set of metrics of a Prometheus metric output, primarily because the dataset existed and is large, but that is why I chose to group by
__name__
.The test that I ran looks like this: a single worker sending a single alert every second and every 5 seconds switches to the next alert, this is run for 1 minute + 10 seconds to capture remaining notifications generated from the last alerts fired.
The result was the following (in order to be able to reference line numbers I've also created a gist):
We can see that first notifications go out only once, as expected, however when those notifications get repeated, for example on line 67 to 69, they are sent simultaneously by two different Alertmanagers. This happens multiple times throughout the test, sometimes even all three Alertmanager instances send the notification.
I have yet to validate my suspicion in the code, however, I believe this is happening because while for initial "group wait" the Alertmanager instances wait an additional member position in mesh times 5 seconds, I believe they don't have such behavior for repeating notifications, and they race between who manages to sent out and gossip and de-duplicates, which results in sometimes the notifications being de-duplicated and sometimes not.
Let me know what you think and whether anything I said doesn't make any sense 🙂.
I'll be sharing the tooling I'm building at some point, but it's very early stage and constantly changing right now, but I'm hoping it can become a useful tool to simulate scenarios for HA clusters.
@stuartnelson3 @fabxc (feel free to tag anyone who might be interested as well)
The text was updated successfully, but these errors were encountered: