-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated "resolved" notifications firing #844
Comments
A note on the configuration in general. Repeat interval should be generally really long in a sane setup – at the order of 3 to 48 hours. You will be notified in the group interval should anything change about the state of alerts. The repeat interval is only about unchanged state – anything in the minute range is typically way too noisy for this. Can you provide your Prometheus configuration as well. In particular the rule evaluation interval and alerting rule in question? |
You are sending from two Prometheus servers, right? Do they have different external labels? |
@fabxc - Sorry, I wasn't very clear on a couple of those things. The alert manager config above was just so that I could replicate the problem without waiting and waiting for hours. The normal config I run with is:
In reference to two servers I was highlighting that I have two completely independent, separate configurations of Prometheus/Alertmanager that end up triggering the same scenario. There's no transmission of metrics between them or sharing of an alertmamanger or anything like that. The common factor between them is that they're using the same exporter and alert rules. I'm not sure how it's relevant, but the exporter in question that triggers the scenario is https://github.com/Braedon/prometheus-es-exporter (version 0.3.0). I'm running with Prometheus configs of:
The target is scraped via a standard file_sd_config target, that just configures a few labels. The alert rule is like so:
I hope that's useful information. |
So you have two Prometheus servers but they are not sending to the same Alertmanager cluster? Really having trouble reproducing this in any way. The logs you've shown above seem to be truncated in the front. Can you provide the full thing? |
I'll do some significant digging into establishing whether there are two sources of the alerts, but I can't see how. If not, I'll reproduce and include all of the config and logs. |
@damomurf Thanks for your help! I will move this issue to milestone v0.8. As we are having difficulties reproducing this bug, and as it is not a mission critical issue we will release v0.7 without it for now. Don't get me wrong. We are still very much interested in fixing this issue. |
I am also experiencing repeated "resolved" notifications with version=0.5.1, branch=master, revision=0ea1cac51e6a620ec09d053f0484b97932b5c902. Anonymized configuration snippets follow. Alertmanager configuration:
Prometheus alerts configuration:
Prometheus monitoring configuration:
I do not have high-detail logs currently, will try to get some. Nothing in the "normal" logs. |
@sandersaares Are you able to test out v0.7.1 in your environment and try to reproduce the issue? |
Yes, doing that right now. Is there anything special I should do to get the most debug info out of it? (beyond just log.level) |
Meanwhile, one thing I noticed was that the "job" label collection was different in each of the repeated "resolved" notifications. Here are the ones it sent (together with all the firing notifications). You can see some sort of pattern here but I can't quite understand what it means.
|
I am unable to reproduce with 0.7.1. However, the behavior was also not consistently problematic with 0.5.1 - I first saw it yesterday and have been using the software for quite some time now. Will monitor behavior and report back if I have any more info to add - perhaps there are some hidden factors that affect the situation. |
@sandersaares Thanks for reporting. Let us know if you are facing any further issues! @damomurf Any news from your side? Are you still facing these issues with v0.7.1? |
@mxinden I'll deploy v0.7.1 first thing in my morning and definitely report back. |
I'll need to leave it for at least 24 hours to get a read on some regular (ugh, yes - regular) alerts that were triggering the issue. However, initial indications are that v0.7.1 looks better. |
After running this for several days under v0.7.1, it definitely seems like the repeat resolved notification situation has been largely solved. I recall seeing one instance of a repeat, but believe there could have been additional reasons for that. v0.7.1 shows "Error: Timeout exceeded" on the Silences page in the UI for me, but I'll raise that as another issue. |
@damomurf Thanks for testing this out on v0.7.1! I will close here for now, as it seems you issue has been resolved. Please feel free to reopen if it occurs again. |
Hi all. Tried on v 0.8.0 and also 0.7.1 Testing with simple config
and alert.rules
|
@dIoMeDs according to @damomurf this has been fixed since v0.7.1. Please open up a new issue with more details on how we can reproduce your issue. Preferably with the help of the webhook and the send-alerts script. |
I'm seeing repeating "resolved" notification firing into notification endpoints from the same alert rule, with two different Prometheus installations. The alerts seem to repeat 5 times after the initial notification and then stop. It seems to affect every version of Alertmanager right back to 0.5.0 - the latest version that doesn't exhibit the problem is 0.4.2.
I'm assuming this relates to #815.
I've also tried with an alertmanager built from master and observe the same behaviour. The offending alert, once resolved, seems to notify every repeat_interval. The following example config below was an attempt to replicate it with much smaller timeframes.
Once the alert really resolves in Prometheus, I see nothing in either Prometheus or Alertmanager UIs.
The following is the alert manager log in debug mode for the repeating notifications.
Route and Receivers config:
The text was updated successfully, but these errors were encountered: