Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to fix non unique GroupKeys? #3817

Open
grobinson-grafana opened this issue Apr 24, 2024 · 7 comments
Open

How to fix non unique GroupKeys? #3817

grobinson-grafana opened this issue Apr 24, 2024 · 7 comments
Labels

Comments

@grobinson-grafana
Copy link
Contributor

grobinson-grafana commented Apr 24, 2024

Background

GroupKey is a string that is derived from the matchers in the route (including any parent routes) and the labels in the group_by of the matching route.

There are a number of cases where GroupKey is not unique and its possible for two (or more) different groups to have the same GroupKey. The following YAML shows a configuration containing two routes that create groups with the same GroupKey:

receivers:
  - name: test1
  - name: test2
route:
  receiver: test1
  routes:
    - receiver: test1
      matchers:
        - foo=bar
      continue: true
    - receiver: test2
      matchers:
        - foo=bar
      mute_time_intervals:
        - name: weekends
      continue: true

The reason a user might have such a configuration is to mute notifications on the weekends, but still send webhooks to an issue tracker.

Problems

This creates a number of problems:

  1. In the case the receiver is also the same, then groups from these routes also share notification logs.
  2. If we want to use the GroupKey to mark groups (e.g. mute status) (Muted alerts are not suppressed in API #3513) then active groups can be incorrectly marked as muted.

Solutions

I've been thinking on the issue of the GroupKey being non-unique for a little while, and if its possible to make the GroupKey both stable and unique without adding new fields to the configuration file.

What do I mean by stable? I mean that the GroupKey should not depend on the position of the route relative to its siblings. This is the reason why we cannot use the RouteID when calculating the GroupKey, because as soon as the route is moved higher or lower in the configuration file the group's notification log is invalidated causing repeat notifications.

One option we have is to use a non cryptographic hash function (for example fnv) to calculate a hash using extra metadata from the route that is not included in the GroupKey at present. For example:

  1. Receiver name
  2. Mute time intervals
  3. Active time intervals
  4. Child routes

This would change how GroupKey is calculated from being a top-down path of routes from parent to child (i.e. route0/route1/route...N/groupLabels) to a bottom-up Merkle tree where the parents are derived from the hash of their children.

This means that when either the receiver, matchers, group by labels or active or mute time intervals are changed in a child route, the notification logs for the parent upwards are invalidated. This might be undesirable. The other issue I see with this solution is observability. Without adequate tools, it will be very hard to see which group is flushed in the logs just from its hash.

So perhaps it's better to just add an (optional) name to routes? If the name is absent, then Alertmanager uses the current mechanism (which can be non unique in some cases). If a unique GroupKey is required, then a name can be set on the route to discriminate between them.

grobinson-grafana added a commit to grobinson-grafana/alertmanager that referenced this issue Apr 25, 2024
This commit adds (optional) names to routes to fix prometheus#3817.
In the case where a user has two routes with the same
receiver, matchers and group_by, a name can be used to
ensure their groups have unique group keys and avoid using
the same nflog.

Signed-off-by: George Robinson <george.robinson@grafana.com>
@grobinson-grafana
Copy link
Contributor Author

grobinson-grafana commented Apr 29, 2024

Another option is to add the receiver name to the RouteKey. This would be the least invasive change and fixes all but one case of non unique GroupKeys where the receiver name, matchers and group by are the same. In this case, the aggregation groups also share nflogs, even if some of the routes have active or mute time intervals.

Here is an example of how this could look for the following configuration file:

receivers:
  - name: test1
  - name: test2
route:
  receiver: test1
  routes:
    - receiver: test1
      matchers:
        - foo=bar
      continue: true
    - receiver: test2
      matchers:
        - foo=bar
      mute_time_intervals:
        - name: weekends
      continue: true

Without the change:

{}/{foo="bar"}:{}
{}/{foo="bar"}:{}

With the change:

(recv="test1",matchers={})/(recv=test1,matchers={foo="bar"}):{}
(recv="test1",matchers={})/(recv=test2,matchers={foo="bar"}):{}

This could also be shortened to something like:

(test1,{})/(test1,{foo="bar"}):{}
(test1,{})/(test2,{foo="bar"}):{}

If we choose this option, we would also want to change the nflog interface to remove r *pb.Receiver as the receiver name is now included in gkey: We still need this as other metadata is required from *pb.Receiver.

func (l *Log) Log(r *pb.Receiver, gkey string, firingAlerts, resolvedAlerts []uint64, expiry time.Duration) error {

grobinson-grafana added a commit to grobinson-grafana/alertmanager that referenced this issue Apr 29, 2024
This commit adds the receiver name to the route key to reduce the
chances of having non unique group keys (prometheus#3817). Like the previous
version, it does not guarantee the group key is unique, however
it does make it collisions less likely to occur.

Signed-off-by: George Robinson <george.robinson@grafana.com>
@zecke
Copy link
Contributor

zecke commented May 4, 2024

Any chance the example needs to have a "continue" to match the second route?

@grobinson-grafana
Copy link
Contributor Author

Yes it does :) I missed that in the example!

@zecke
Copy link
Contributor

zecke commented May 5, 2024

Yes it does :) I missed that in the example!

What if we don't attempt to fix this without requiring extra configuration? What if a future version of AM refuses to accept a config where both GroupKey and Receiver are identical for two siblings? What is the users intention in creating such config?

We could offer a user a way to differentiate these two routes (extraGroupKey: newRoute)

@grobinson-grafana
Copy link
Contributor Author

What if we don't attempt to fix this without requiring extra configuration? What if a future version of AM refuses to accept a config where both GroupKey and Receiver are identical for two siblings?

This is something that I've considered too. I like it a lot because it means we don't need to add extra configuration, but I'm also concerned about breaking configurations.

We could offer a user a way to differentiate these two routes

If we choose this option I propose adding a "name" field to the route, similar to what we have for receivers.

@KA-ROM
Copy link

KA-ROM commented Jun 28, 2024

Can we add a deduplication key instead? The default Pagerduty receiver has it. Something similar seems easier to create than keeping track of the intricacies of how hashed values from other fields track over time/user changes on configuration.

@grobinson-grafana
Copy link
Contributor Author

Can we add a deduplication key instead? The default Pagerduty receiver has it. Something similar seems easier to create than keeping track of the intricacies of how hashed values from other fields track over time/user changes on configuration.

Yeah that's one of the options (named routes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants