Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time of day based alert routing/notification #876

Closed
brian-brazil opened this issue Jun 20, 2017 · 98 comments
Closed

Time of day based alert routing/notification #876

brian-brazil opened this issue Jun 20, 2017 · 98 comments

Comments

@brian-brazil
Copy link
Contributor

We've had numerous requests for routing alerts based on the time of day/week. This issue is to track those.

@tyrken
Copy link

tyrken commented Feb 7, 2018

So, @brian-brazil (@fabxc ?), could you provide any design guidance on how to implement said feature, as I think I'd like to suppress all 'severity: warning' alerts overnight rather than putting loads of time-based rule duplication into my Prometheus alert rules. I really don't want to do that as warnings are still valid & worth warning about if I go looking for current Alerts active - I just don't want to be woken up for them. They also shouldn't seem to resolve every evening & start again in the morning.

After being burnt wasting effort on #709 I want some suggestion up front of what might be accepted from the maintainers.

@ghost
Copy link

ghost commented Feb 20, 2018

Indeed, it would be nice to see whether it is within the scope of AM. It'd be great to have this feature.

@ghost
Copy link

ghost commented Mar 31, 2018

I would like to be able to have time of day, or day of week influence which receiver an alert is sent to. i.e. - during daytime/business hours, alerts might go to a slack channel, vs during night/weekends, same alerts might go to pagerduty, or on e-mail for the current on-call person.

From the Alertmanager perspective, it could be nice to use existing label matching to control routing to different receivers based on datetime.

i.e. -

- match_re:
      alertname: MyAlert
      day: '(Monday|Tuesday|Wednesday|Thursday|Friday)'
   receiver: slack_appteam
- match_re:
      alertname: MyAlert
      day: '(Saturday|Sunday)'
   receiver: pagerduty_appteam

Something similar for time of day? It's a little tricker, and in routing, it would be nice keep things simple... i.e. - match_re on a label like 'time_window: business_hours', but I'm not sure how to get that meaningful label in there from the alert manager perspective without some sort of relabeling within alertmanager itself, and prometheus passing along an alert date/time. I'm a Prometheus/Alertmanager newbie, so apologies in advance if I'm missing something obvious here.

An approach that generates meaningful date/time labels on the alerts, means those labels could also be used in inhibition as per @tyrken 's request to inhibit warnings for some or all alerts overnight.

One of my main drivers for this is to not introduce time based rule duplication in all my prometheus alerts, as that feels cumbersome.

hh pushed a commit to ii/alertmanager that referenced this issue Mar 31, 2018
Netstat is 40% of the metrics on my laptop, many of which
are highly detailed information about IP internals in the kernel.
~300 such metrics on every machine in your fleet is excessive,
so focus on key metrics by default, overridable by the user.

Fixes prometheus#515

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
@jmangs
Copy link

jmangs commented Apr 5, 2018

The ability to subdue events based on the day and/or time would be really useful. I don't know what the best approach is here, but for Sensu they follow a re-usable pattern tied to the handler (receiver): subdue-attributes

@etiennechabert
Copy link

+1 : Just here to says it would be a lovely feature to be able to sleep well during the night and have some alerts only during business hours / days

@stuartnelson3
Copy link
Contributor

Adding https://golang.org/pkg/time/#Weekday for future reference. This could be implemented as a pipeline step that filters based on a defined day/time range. All times would be done in UTC.

@danielmotaleite
Copy link

This is a important missing feature! Either enable alerts during some time ranges, or allow recursive silent rules

Is there any workaround to silent staging/test/QA alerts during the night, but still receive then during the day?

@simonpasquier
Copy link
Member

@danielmotaleite I have a solution for this based on inhibition rules that doesn't require any change to AlertManager. I'll post the blog post address here once it is out.

@fessmage
Copy link

fessmage commented Oct 3, 2018

AlertManager definitely needs a way for setting silence hours in config file. With label targeting, like it is in inhibit_rules.

@charles-mims-sp
Copy link

@simonpasquier - waiting for that blog link!

@blanet
Copy link

blanet commented Nov 9, 2018

@simonpasquier I am thirsty for this. 👍

@luphaz
Copy link

luphaz commented Nov 12, 2018

Hello
Here is a starting point that we have used to silence alerts outside of office hours :

vector(1)
and on()
(
                  6 < hour(vector(time()))
and
                 hour(vector(time())) < 19 
and 
                   0 < day_of_week(vector(time()))
and 
                   day_of_week(vector(time())) < 6
)

Of course it's GMT based so does not take into account summer/winter times, neither bank holidays.

Hope it might helps others.

You should just have to replace vector(1) with the prometheus expression/aggregation you need

@roidelapluie
Copy link
Member

roidelapluie commented Nov 13, 2018

https://gist.github.com/roidelapluie/8c67e9c8fb18b310a4a90cb92a23056b

Our solution, with GMT and days off.

Then you do:

vector(1) and on() business_hour

That takes holidays in consideration.

@roidelapluie
Copy link
Member

PS: about daily_saving_time_belgium: yes it works.

@Tom-Fawcett
Copy link
Contributor

I've written a blog post on how I solved my use case - link

@roidelapluie
Copy link
Member

@Tom-Fawcett this is so great!

@ghost
Copy link

ghost commented Dec 12, 2018

@roidelapluie
One gotcha of the pure recording ruled based approach appears to be that a currently firing alert stops firing when that time of day boundary is crossed. It is then marked as resolved and potentially triggers a resolved notification which can be confusing to the responders, and looking back at the history in the TSDB.

Have you encountered this? If there's no good workaround, I plan to try the approach @Tom-Fawcett wrote up. It seems like it would avoid that particular issue.

@roidelapluie
Copy link
Member

Yes we have switched to inhibition now!! so much easier!! :)

@michael-doubez
Copy link

michael-doubez commented Dec 26, 2018

I was considering the development of a calendar exporter. It would produce simple on/off status based on calendar rules.

It would be easier to handle specific cases (multiple time zone, reception rules, non-gregorian calendar) and any number of integrations could be considered.

IMHO it would be an elegant solution but at the cost of database space for dummy metrics.
What do you think ?

@roidelapluie
Copy link
Member

The main problem is that with such a thing, if it is down, Prometheus will fire many alerts. Maybe we could do a binary/script that would generate files suitable for alerting rules. Because it will be more reliable

@michael-doubez
Copy link

Good point. I guess the same code able to generate metrics would be able to generate such a recording rule (in simple cases). Or, it could send the corresponding inhibition requests.

In my line of work (exchange market access for financial institutions), we have a lot of checks related to calendar, across multiple timezones. So it wouldn't be limited to alert inhibition, we also expect events to occur within a specific time frame.

@danielmotaleite
Copy link

danielmotaleite commented Mar 15, 2019

@roidelapluie @michael-doubez just like any other exporter, you should have redundant instances in different zones running, so if one fails, you still get data from the other way

The idea of a exporter outputting data and time based rules is actually not bad, but developing one with enough features may be tricky! :)

@zgfh
Copy link
Contributor

zgfh commented Jul 5, 2019

I am thirsty for this too. 👍

@roidelapluie
Copy link
Member

Timezones are not supported in go on Windows.

@tyrken
Copy link

tyrken commented Oct 9, 2020

Alertmanager uses Golang 1.14, but 1.15 has an option to embed the timezone data - does that work on Windows? See https://golang.org/doc/go1.15#time/tzdata

@brian-brazil
Copy link
Contributor Author

Not really, it was out of date when it was added and hasn't been updated since. Even if was promptly updated, you could still be easily talking a year for an update to propagate out given Go and AM release cycles which is far too long - ignoring all the other problems with embedding data such as being forced to upgrade.

@benridley
Copy link
Contributor

Yeah proper timezone support will have to wait until Go parses the OS provided timezone files on Windows. There's an open issue for this in Go, so hopefully there's progress soon. We can always add the feature relatively easily as soon as support is added.

@zerkms
Copy link

zerkms commented Oct 9, 2020

But is not having any TZ support still better than no support whatsoever?
Timezones don't change frequently in majority of regions.

@roidelapluie
Copy link
Member

But is not having any TZ support still better than no support whatsoever?
Timezones don't change frequently majority of regions.

Il we go out today it will be out of sync for europe in about 6 months

@zerkms
Copy link

zerkms commented Oct 9, 2020

@roidelapluie is a change in timezones for the whole Europe scheduled next April?

@roidelapluie
Copy link
Member

@roidelapluie is a change in timezones for the whole Europe scheduled next April?

It seems like they moved it to 2022, but yes, it should be the end of DST here.

@brian-brazil
Copy link
Contributor Author

There's countries where you often get zero notice of a change, and more generally timezone changes happen more frequently than you'd think. Canada is in the middle of one for example (the relevant law hadn't passed yet, but they were planning on it last I looked).

@benridley
Copy link
Contributor

benridley commented Oct 13, 2020

Pull request here: #2393

@hartfordfive
Copy link

hartfordfive commented Nov 13, 2020

Glad to see there's a PR open for this feature although it seems it's only for muting alerts between specific time periods. Are there any updates with regards to allowing for different alerting routes depending on the date of week and/or time? Something like was described in this comment would be great.

@benridley
Copy link
Contributor

Hi @hartfordfive, this was initially discussed in the design draft, but it was determined to be too problematic to change routing between time periods because there's a lot of important behaviour tied to routes, for example when time intervals change should a flurry of new alerts fire to the newly active route? Should a flurry of resolved alerts be sent to the old one?

For this reason, routing remains static and muting is applied to them in the above design. However you should be able to achieve most of the same outcomes by muting routes and using the continue option to specify alternate paths for the same alert.

@licsuj27
Copy link

licsuj27 commented Dec 1, 2020

Hi,

I was wondering if anyone can help me. So i followed this post and from the design document, if I understood correctly i set the below in alertmanager.yml
mute_time_intervals:

  • name: business_hours
    time_intervals:
    • weekdays: ['monday:friday']
      times:
      • start_time: "09:00"
        end_time: "17:00"

I started with this to test, however every time i am restarting alertmanager, I am getting the below error:

msg="Loading configuration file failed" file=/etc/alertmanager/alertmanager.yml err="yaml: unmarshal errors:\n line 1: field mute_time_intervals not found in type config.plain

Am i missing something?
Following this doc : https://docs.google.com/document/d/1pf-rPDQUGJUHazyr5vanTO6ft3loNZO9UoVpvhShFtA/edit

@swoga
Copy link

swoga commented Dec 1, 2020

Hi @justin27c,
this feature is not part of any official release, you need to compile alertmanager yourself with PR #2393, but I don't know about the stability.

@Zhang21
Copy link

Zhang21 commented Dec 29, 2020

hours 0-6 silence:

# default utc time
# my cst time(+8)
node_load5 > 8 and ON() (hour() < 16 or hour() > 22)

@beorn7 beorn7 closed this as completed Mar 1, 2021
@beorn7
Copy link
Member

beorn7 commented Mar 1, 2021

Implemented in #2393

@bimmerkiev
Copy link

bimmerkiev commented May 14, 2021

Hi all.
I'm trying to use that feature with release version 0.22.0-rc.1, but I have no luck with it.
Here is my alertmanager.yml:

#
# Ansible managed
#

global:
  resolve_timeout: 3m

mute_time_intervals:
  - name: business_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
        - start_time: "09:00"
          end_time: "11:00"

templates:
- '/etc/alertmanager/templates/*.tmpl'
receivers:
- name: pagerduty
  pagerduty_configs:
  - client_url: http://х.х.х.х:9093/
    description: '{{ if .CommonAnnotations.summary }}{{ .CommonAnnotations.summary
      }}{{ end }}'
    routing_key: ххх
    severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity | toLower
      }}{{ end }}'

route:
  group_by:
  - alertname
  - cluster
  - service
  - env
  group_interval: 5m
  group_wait: 30s
  receiver: pagerduty
  repeat_interval: 4h
  routes:
  - group_wait: 10s
    match:
      severity: loww
      time_intervals:
        - business_hours
      receiver: pagerduty

But with that config alertmanager don't want to start.
Part of log:

level=error ts=2021-05-14T13:38:06.667Z caller=coordinator.go:118 component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/alertmanager.yml err="yaml: unmarshal errors:\n  line 44: cannot unmarshal !!seq into string"

Can anyone help me?
Thanks

@licsuj27
Copy link

@bimmerkiev - can you paste the whole log file to check which is line 44?

@bimmerkiev
Copy link

@justin27c sure. fixed.
It's about
" - business_hours"

@licsuj27
Copy link

Can you change the below?

     time_intervals:
        - business_hours
      receiver: pagerduty
to 
mute_time_intervals:
     - business_hours
  receiver: pagerduty

@bimmerkiev
Copy link

Can you change the below?

     time_intervals:
        - business_hours
      receiver: pagerduty
to 
mute_time_intervals:
     - business_hours
  receiver: pagerduty

Already tried that - no luck. The same error

@bimmerkiev
Copy link

If I try this:

  routes:
  - group_wait: 10s
    match:
      severity: loww
    mute_time_intervals:
        - business_hours
      receiver: pagerduty

I'm receiving:

level=error ts=2021-05-14T13:59:24.090Z caller=coordinator.go:118 component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/alertmanager.yml err="yaml: line 43: did not find expected key"

If like this:

  routes:
  - group_wait: 10s
    match:
      severity: loww
      mute_time_intervals:
        - business_hours
      receiver: pagerduty

The result:

level=error ts=2021-05-14T14:00:06.573Z caller=coordinator.go:118 component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/alertmanager.yml err="yaml: unmarshal errors:\n  line 43: cannot unmarshal !!seq into string"

@licsuj27
Copy link

licsuj27 commented May 14, 2021

I have the below config and working fine:

mute_time_intervals:
  - name: business_hours
    time_intervals:
      - weekdays: ['monday':'friday']
        times:
        - start_time: '18:00'
          end_time: '22:00'
		  

route:
  group_by: ['alertname']
  group_wait: 30s
  routes:
  - match:
      severity: low
    receiver: team-receiver
    mute_time_intervals:
      - business_hours

@benridley
Copy link
Contributor

Hi @bimmerkiev, you were almost there with this version:

  routes:
  - group_wait: 10s
    match:
      severity: loww
    mute_time_intervals:
        - business_hours
      receiver: pagerduty

The problem is that the receiver was indented, so Alertmanager was getting confused because it should be at the top level alongside mute_time_intervals, group wait etc. This config should be ok:

  routes:
  - group_wait: 10s
    match:
      severity: loww
    mute_time_intervals:
      - business_hours
    receiver: pagerduty

@bimmerkiev
Copy link

Hi @bimmerkiev, you were almost there with this version:

  routes:
  - group_wait: 10s
    match:
      severity: loww
    mute_time_intervals:
        - business_hours
      receiver: pagerduty

The problem is that the receiver was indented, so Alertmanager was getting confused because it should be at the top level alongside mute_time_intervals, group wait etc. This config should be ok:

  routes:
  - group_wait: 10s
    match:
      severity: loww
    mute_time_intervals:
      - business_hours
    receiver: pagerduty

I'm really appreciate for your help.
It works now

@michaelrodas
Copy link

Thank you all for your ideas, this is the expression I came up with in order to trigger the alert only during working hours on week days: vector(1) and on() (day_of_week() > 0 and day_of_week() < 6) and on() (hour() > 8 and hour() < 18) where 'vector(1)' is the query for the metric I'm using.
Hope it's useful to you, best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests