Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(metric_alerts): Fix bug where transitioning from critical -> warning can still trigger a resolve action #29345

Merged
merged 1 commit into from
Oct 15, 2021

Conversation

wedamija
Copy link
Member

Was investigating #27161, although I suspect that it's
already resolved since it's from back in July and we pushed out a fix in August.

I happened to figure out the error we see on our own alert in #team-engineering though. Just to
cover this issue again:

  • Metric goes above the critical threshold, so we fire a critical alert
  • Metric goes below critical but remains above warning. This should result in firing the warning
    threshold again, but actually it triggers a resolve.

There are two parts to this issue.

  1. Our metric alert is misconfigured, and doesn't have an action set up for warning. So warning
    thresholds will literally never fire or do anything. I have fixed this.
  2. If the critical/warning actions are different, then the de-duping that we have in place won't
    work. In our case, since critical has an action and warning doesn't, we end up just using the
    critical action, which is tied to the resolve. This results in us showing a resolve notification.
    But this could also happen if the user set up a critical action on slack and a warning action on email.

To fix this, when we transition from critical -> warning I only include the incident trigger
associated with the active warning trigger in fired_incident_triggers, and leave the critical
trigger out. This means that if someone configures an alert with both warning and critical triggers,
but fails to set up an action on the warning trigger, then the only time the alert will fire is when
the critical alert initially fires.

I think this is most correct and realistically it doesn't make sense to set up a threshold without
an action. We should probably prevent alerts from being created like this in the UI.

@wedamija wedamija requested review from a team and ahmedetefy October 15, 2021 00:59
…ing can still trigger a resolve action

Was investigating #27161, although I suspect that it's
already resolved since it's from back in July and we pushed out a fix in Auguest.

I happened to figure out the error we see on our own alert in #team-engineering though. Just to
cover this issue again:
 - Metric goes above the critical threshold, so we fire a critical alert
 - Metric goes below critical but remains above warning. This should result in firing the warning
   threshold again, but actually it triggers a resolve.

There are two parts to this issue.
1. Our metric alert is misconfigured, and doesn't have an action set up for warning. So warning
thresholds will literally never fire or do anything. I have fixed this.
2. If the critical/warning actions are different, then the de-duping that we have in place won't
work. In our case, since critical has an action and warning doesn't, we end up just using the
critical action, which is tied to the resolve. This results in us showing a resolve notification.

To fix this, when we transition from critical -> warning I only include the incident trigger
associated with the active warning trigger in `fired_incident_triggers`, and leave the critical
trigger out. This means that if someone configures an alert with both warning and critical triggers,
but fails to set up an action on the warning trigger, then the only time the alert will fire is when
the critical alert initially fires.

I think this is most correct and realistically it doesn't make sense to set up a threshold without
an action. We should probably prevent alerts from being created like this in the UI.
@@ -167,8 +165,7 @@ def find_and_fire_active_warning_trigger(self, alert_operator, aggregation_value
and alert_operator(aggregation_value, current_trigger.alert_threshold)
):
metrics.incr("incidents.alert_rules.threshold", tags={"type": "alert"})
active_warning_it = self.trigger_alert_threshold(current_trigger, aggregation_value)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict we don't actually need to do this. All this logic already happens when we resolve the critical trigger. We just want to find the active warning trigger if available and return it so that we can use it to perform any actions

Copy link
Contributor

@ahmedetefy ahmedetefy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@wedamija wedamija merged commit 654d319 into master Oct 15, 2021
@wedamija wedamija deleted the danf/metric_alerts_fix_critical_resolve branch October 15, 2021 17:48
@github-actions github-actions bot locked and limited conversation to collaborators Oct 31, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants