modify check_escalation_finished_task task #1266

joeyorlando · 2023-02-01T11:53:24Z

What this PR does

This PR:

modifies the check_escalation_finished_task celery task to:
- do stricter escalation validation based on the alert group's
  escalation snapshot (see the audit_alert_group_escalation method in engine/apps/alerts/tasks/check_escalation_finished.py for the validation logic)
- use a read-only database for querying alert-groups if one is configured, otherwise use the "default" one
- ping a configurable heartbeat (new env var ALERT_GROUP_ESCALATION_AUDITOR_CELERY_TASK_HEARTBEAT_URL added)
- increase the task frequency from every 10 to every 13 minutes (this can be configured via an env variable)
- adds public documentation on how to configure this auditor task
modifies the local celery startup command to properly take into consideration all celery related env vars (similar to the ones we use in engine/celery_with_exporter.sh; this made it easier to enable celery beat locally for testing)
removes the following code:
- removes references to AlertGroup.estimate_escalation_finish_time and
  marks the model field as deprecated using the django-deprecate-fields library. This field was only used for the previous version of this validation task
- EscalationSnapshotMixin.calculate_eta_for_finish_escalation was only used to calculate the value for AlertGroup.estimate_escalation_finish_time
- calculate_escalation_finish_time celery task

Which issue(s) this PR fixes

https://github.com/grafana/oncall-private/issues/1558

Checklist

Tests updated
Documentation added
CHANGELOG.md updated

Matvey-Kuk

Now this task is failing in case of problems. We rely on it to blow our rabbit which should trigger monitoring. I wish this task to start explicitly sending alerts about problems to our OnCall instance. Also would be nice to make it check with heartbeat to make sure it's not failing silently. Also it's pretty heavy so retrying it is dangerous. Even 10 retrying tasks could damage DB.

Could you please check with @iskhakov about other probable improvements?

iskhakov

Here are a few thoughts on this task

engine/apps/alerts/tasks/check_escalation_finished.py

engine/apps/alerts/escalation_snapshot/snapshot_classes/escalation_snapshot.py

joeyorlando · 2023-02-02T19:02:52Z

engine/apps/alerts/escalation_snapshot/escalation_snapshot_mixin.py

@@ -87,8 +84,7 @@ def build_raw_escalation_snapshot(self) -> dict:
            'next_step_eta': '2021-10-18T10:28:28.890369Z
        }
        """
-
-        escalation_snapshot = None
+        data = {}


this means we will no longer be setting AlertGroup.raw_escalation_snapshot to None. Instead:

>>> from apps.alerts.escalation_snapshot.snapshot_classes import EscalationSnapshot >>> EscalationSnapshot.serializer({}).data {'channel_filter_snapshot': None, 'escalation_chain_snapshot': None, 'last_active_escalation_policy_order': None, 'escalation_policies_snapshots': [], 'slack_channel_id': None, 'pause_escalation': False, 'next_step_eta': None}

Please check if there will be no issues in places where we check raw_escalation_snapshot or escalation_snapshot since it also won't be None.
For example, we may encounter issues in escalate alert group and incident log builder (not sure, need to check)

I've added a suite of tests for the EscalationSnapshotMixin class (commit), so all should be good there.

But it seems there are a few spots that would need to be further refactored:

apps/alerts/incident_log_builder/incident_log_builder.py

apps/alerts/tasks/escalate_alert_group.py

apps/alerts/tasks/notify_all.py

apps/alerts/tasks/notify_group.py

The main thing is how to properly refactor the checks like:

if escalation_snapshot is not None: # do something

@Ferril would you mind giving some guidance on this (or hoping in to help refactor these few spots? 😄 )

In the case incident_log_builder.py:
we can check if there are any escalation policies in escalation snapshot instead. By this checking we avoid unnecessary requests to db.

In escalate_alert_group.py:
if we remove this check, escalation should go like there are just no escalation policies. I'm not sure if we need to make any distinction between "no escalation policies" and "no escalation chain". If it is important, we can check if escalation_chain in the snapshot is not None. The difference is in additional log and is_escalation_finished flag.

In cases
apps/alerts/tasks/notify_all.py
apps/alerts/tasks/notify_group.py
it looks like this check doesn't make any sense now because we work with escalation snapshot like it is not None a few lines above in the same task 🤔

for the incident_log_builder.py case, this should be addressed in this commit.

If I understand your comments correctly on the other three files, those should be safe?

joeyorlando · 2023-02-02T19:03:14Z

engine/apps/alerts/escalation_snapshot/serializers/escalation_snapshot.py

+    channel_filter_snapshot = ChannelFilterSnapshotSerializer(allow_null=True, default=None)
+    escalation_chain_snapshot = EscalationChainSnapshotSerializer(allow_null=True, default=None)
    last_active_escalation_policy_order = serializers.IntegerField(allow_null=True, default=None)
-    escalation_policies_snapshots = EscalationPolicySnapshotSerializer(many=True)
-    slack_channel_id = serializers.CharField(allow_null=True)
+    escalation_policies_snapshots = EscalationPolicySnapshotSerializer(many=True, default=list)
+    slack_channel_id = serializers.CharField(allow_null=True, default=None)


related to this change

engine/apps/alerts/tasks/check_escalation_finished.py

joeyorlando · 2023-02-03T11:55:09Z

engine/apps/alerts/tests/test_escalation_snapshot.py

-@pytest.fixture()
-def escalation_snapshot_test_setup(
-    make_organization_and_user,
-    make_user_for_organization,
-    make_alert_receive_channel,
-    make_channel_filter,
-    make_escalation_chain,
-    make_escalation_policy,
-    make_alert_group,
-):
-    organization, user_1 = make_organization_and_user()
-    user_2 = make_user_for_organization(organization)
-
-    alert_receive_channel = make_alert_receive_channel(organization)
-
-    escalation_chain = make_escalation_chain(organization)
-    channel_filter = make_channel_filter(
-        alert_receive_channel,
-        escalation_chain=escalation_chain,
-        notification_backends={"BACKEND": {"channel_id": "abc123"}},
-    )
-
-    notify_to_multiple_users_step = make_escalation_policy(
-        escalation_chain=channel_filter.escalation_chain,
-        escalation_policy_step=EscalationPolicy.STEP_NOTIFY_MULTIPLE_USERS,
-    )
-    notify_to_multiple_users_step.notify_to_users_queue.set([user_1, user_2])
-    wait_step = make_escalation_policy(
-        escalation_chain=channel_filter.escalation_chain,
-        escalation_policy_step=EscalationPolicy.STEP_WAIT,
-        wait_delay=EscalationPolicy.FIFTEEN_MINUTES,
-    )
-    # random time for test
-    from_time = datetime.time(10, 30)
-    to_time = datetime.time(18, 45)
-    notify_if_time_step = make_escalation_policy(
-        escalation_chain=channel_filter.escalation_chain,
-        escalation_policy_step=EscalationPolicy.STEP_NOTIFY_IF_TIME,
-        from_time=from_time,
-        to_time=to_time,
-    )
-
-    alert_group = make_alert_group(alert_receive_channel, channel_filter=channel_filter)
-    alert_group.raw_escalation_snapshot = alert_group.build_raw_escalation_snapshot()
-    alert_group.save()
-    return alert_group, notify_to_multiple_users_step, wait_step, notify_if_time_step
-
-


this was moved to apps/alerts/tests/conftest.py so that it could be used by some other tests within that directory

joeyorlando · 2023-02-03T11:56:16Z

engine/apps/alerts/tests/conftest.py

+@pytest.fixture()
+def escalation_snapshot_test_setup(
+    make_organization_and_user,
+    make_user_for_organization,
+    make_alert_receive_channel,
+    make_channel_filter,
+    make_escalation_chain,
+    make_escalation_policy,
+    make_alert_group,
+):
+    organization, user_1 = make_organization_and_user()
+    user_2 = make_user_for_organization(organization)
+
+    alert_receive_channel = make_alert_receive_channel(organization)
+
+    escalation_chain = make_escalation_chain(organization)
+    channel_filter = make_channel_filter(
+        alert_receive_channel,
+        escalation_chain=escalation_chain,
+        notification_backends={"BACKEND": {"channel_id": "abc123"}},
+    )
+
+    notify_to_multiple_users_step = make_escalation_policy(
+        escalation_chain=channel_filter.escalation_chain,
+        escalation_policy_step=EscalationPolicy.STEP_NOTIFY_MULTIPLE_USERS,
+    )
+    notify_to_multiple_users_step.notify_to_users_queue.set([user_1, user_2])
+    wait_step = make_escalation_policy(
+        escalation_chain=channel_filter.escalation_chain,
+        escalation_policy_step=EscalationPolicy.STEP_WAIT,
+        wait_delay=EscalationPolicy.FIFTEEN_MINUTES,
+    )
+    # random time for test
+    from_time = datetime.time(10, 30)
+    to_time = datetime.time(18, 45)
+    notify_if_time_step = make_escalation_policy(
+        escalation_chain=channel_filter.escalation_chain,
+        escalation_policy_step=EscalationPolicy.STEP_NOTIFY_IF_TIME,
+        from_time=from_time,
+        to_time=to_time,
+    )
+
+    alert_group = make_alert_group(alert_receive_channel, channel_filter=channel_filter)
+    alert_group.raw_escalation_snapshot = alert_group.build_raw_escalation_snapshot()
+    alert_group.save()
+    return alert_group, notify_to_multiple_users_step, wait_step, notify_if_time_step


this was moved from engine/apps/alerts/tests/test_escalation_snapshot.py so that it could be used by some other tests within that directory

engine/apps/alerts/tasks/check_escalation_finished.py

joeyorlando · 2023-02-03T11:57:40Z

engine/apps/alerts/tasks/check_escalation_finished.py

+    for executed_escalation_policy_snapshot in executed_escalation_policy_snapshots:
+        escalation_policy_id = executed_escalation_policy_snapshot.id
+
+        # TODO: is it valid to only check for the finished log record type here?


question here

We create this log if all escalation steps were executed or escalation chain didn't have any steps. As far as I know we don't create it only on step Resolve incident automatically, but in this case alert group would have status resolved, so it doesn't count.

got it 👍

so it should be okay to just check for this log record type? I wasn't sure if I might also want to look for the "pending" type

What do you mean by "pending type"?

I think for all successfully finished escalations this log should exist.
But there can be a case, when escalation was finished, this log was created, and then user started escalation again (for example, resolve/unresolve). Seems like in this case it is possible to have the log record TYPE_ESCALATION_FINISHED and unfinished escalation at the same time 🤔

What do you mean by "pending type"?

ah sorry, ignore this. I thought there was a AlertGroupLogRecord.TYPE_ESCALATION_PENDING log record type, doesn't look like there is.

But there can be a case, when escalation was finished, this log was created, and then user started escalation again (for example, resolve/unresolve). Seems like in this case it is possible to have the log record TYPE_ESCALATION_FINISHED and unfinished escalation at the same time

do you have any suggestions for the query in has_finished_log_record method on how to handle this edge case?

I don't have any suggestions how to do it in one query 😞 The main thing is that we should check if escalation wasn't triggered again after this log, like there are no TYPE_ESCALATION_TRIGGERED or TYPE_ESCALATION_FAILED logs after TYPE_ESCALATION_FINISHED log

I don’t think any of the alert groups being audited by this task will have a TYPE_ESCALATION_FINISHED associated with them because the initial task query, filters for alert groups where is_escalation_finished=False, so I assume we’d never have this log record in that case?

Yes, you are right

engine/apps/alerts/tasks/check_escalation_finished.py

Ferril · 2023-03-14T09:51:44Z

engine/apps/alerts/escalation_snapshot/snapshot_classes/escalation_policy_snapshot.py

+        return AlertGroupLogRecord.objects.filter(
+            escalation_policy_id=self.id,
+            alert_group_id=alert_group_id,
+            type=AlertGroupLogRecord.TYPE_ESCALATION_FINISHED,


This type of logs doesn't have escalation_policy_id because we create it when there are no more escalation steps left

good catch 👍 this should be addressed in this commit.

I'll instead check that each executed escalation policy step has an AlertGroupLogRecord.TYPE_ESCALATION_TRIGGERED associated with it.

- use read-only database for its alert group query - do stricter escalation validation based on the alert group's escalation snapshot - ping a configurable heartbeat

to properly take into consideration all celery related env vars

- EscalationSnapshotMixin.calculate_eta_for_finish_escalation - calculate_escalation_finish_time celery task - removes references to AlertGroup.estimate_escalation_finish_time and marks the model field as deprecated

Matvey-Kuk · 2023-02-06T09:57:25Z

CHANGELOG.md

@@ -25,6 +25,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Improve alerts and alert group endpoint response time in internal API with caching ([1261](https://github.com/grafana/oncall/pull/1261))
 - Optimize alert and alert group public api endpoints and add filter by id ([1274](https://github.com/grafana/oncall/pull/1274)
 - Added Coming Soon for iOS on Mobile App screen
+- Modified `check_escalation_finished_task` celery task to use read-only databases for its query, if one is defined +
+  make the validation logic stricter + ping a configurable heartbeat on successful completion of this task


Matvey-Kuk · 2023-02-06T09:57:59Z

docs/sources/open-source/_index.md

@@ -234,3 +234,23 @@ For Grafana OnCall OSS, the mobile app QR code includes an authentication token
 Your Grafana OnCall OSS instance should be reachable from the same network as your mobile device, preferably from the internet.

 For more information, see [Grafana OnCall mobile app]({{< relref "../mobile-app" >}})
+
+## Alert Group Escalation Auditor


May be add comments about how to use logs to check if auditor works good?

good suggestion! addressed in this commit

understand the log output from this task

joeyorlando added the pr:no public docs Added to a PR that does not require public documentation updates label Feb 1, 2023

joeyorlando requested a review from a team February 1, 2023 11:53

Ferril approved these changes Feb 1, 2023

View reviewed changes

Matvey-Kuk suggested changes Feb 1, 2023

View reviewed changes

joeyorlando removed the pr:no public docs Added to a PR that does not require public documentation updates label Feb 1, 2023

joeyorlando requested review from a team, Matvey-Kuk and iskhakov February 1, 2023 15:41

iskhakov reviewed Feb 2, 2023

View reviewed changes

engine/apps/alerts/tasks/check_escalation_finished.py Outdated Show resolved Hide resolved

engine/apps/alerts/tasks/check_escalation_finished.py Show resolved Hide resolved

engine/apps/alerts/tasks/check_escalation_finished.py Show resolved Hide resolved

joeyorlando force-pushed the jorlando/refactor-check-escalation-finished-task branch 3 times, most recently from ec6a803 to 54af14a Compare February 2, 2023 18:27

joeyorlando commented Feb 2, 2023

View reviewed changes

engine/apps/alerts/escalation_snapshot/snapshot_classes/escalation_snapshot.py Outdated Show resolved Hide resolved

joeyorlando commented Feb 2, 2023

View reviewed changes

engine/apps/alerts/escalation_snapshot/snapshot_classes/escalation_snapshot.py Outdated Show resolved Hide resolved

joeyorlando commented Feb 2, 2023

View reviewed changes