Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify check_escalation_finished_task task #1266

Merged
merged 14 commits into from
Mar 17, 2023

Conversation

joeyorlando
Copy link
Contributor

@joeyorlando joeyorlando commented Feb 1, 2023

What this PR does

This PR:

  • modifies the check_escalation_finished_task celery task to:
    • do stricter escalation validation based on the alert group's
      escalation snapshot (see the audit_alert_group_escalation method in engine/apps/alerts/tasks/check_escalation_finished.py for the validation logic)
    • use a read-only database for querying alert-groups if one is configured, otherwise use the "default" one
    • ping a configurable heartbeat (new env var ALERT_GROUP_ESCALATION_AUDITOR_CELERY_TASK_HEARTBEAT_URL added)
    • increase the task frequency from every 10 to every 13 minutes (this can be configured via an env variable)
    • adds public documentation on how to configure this auditor task
  • modifies the local celery startup command to properly take into consideration all celery related env vars (similar to the ones we use in engine/celery_with_exporter.sh; this made it easier to enable celery beat locally for testing)
  • removes the following code:
    • removes references to AlertGroup.estimate_escalation_finish_time and
      marks the model field as deprecated using the django-deprecate-fields library. This field was only used for the previous version of this validation task
    • EscalationSnapshotMixin.calculate_eta_for_finish_escalation was only used to calculate the value for AlertGroup.estimate_escalation_finish_time
    • calculate_escalation_finish_time celery task

Which issue(s) this PR fixes

https://github.com/grafana/oncall-private/issues/1558

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated

@joeyorlando joeyorlando added the pr:no public docs Added to a PR that does not require public documentation updates label Feb 1, 2023
@joeyorlando joeyorlando requested a review from a team February 1, 2023 11:53
Copy link
Contributor

@Matvey-Kuk Matvey-Kuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this task is failing in case of problems. We rely on it to blow our rabbit which should trigger monitoring. I wish this task to start explicitly sending alerts about problems to our OnCall instance. Also would be nice to make it check with heartbeat to make sure it's not failing silently. Also it's pretty heavy so retrying it is dangerous. Even 10 retrying tasks could damage DB.

Could you please check with @iskhakov about other probable improvements?

@joeyorlando joeyorlando removed the pr:no public docs Added to a PR that does not require public documentation updates label Feb 1, 2023
@joeyorlando joeyorlando requested review from a team, Matvey-Kuk and iskhakov February 1, 2023 15:41
Copy link
Contributor

@iskhakov iskhakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few thoughts on this task

@joeyorlando joeyorlando force-pushed the jorlando/refactor-check-escalation-finished-task branch 3 times, most recently from ec6a803 to 54af14a Compare February 2, 2023 18:27
@@ -87,8 +84,7 @@ def build_raw_escalation_snapshot(self) -> dict:
'next_step_eta': '2021-10-18T10:28:28.890369Z
}
"""

escalation_snapshot = None
data = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means we will no longer be setting AlertGroup.raw_escalation_snapshot to None. Instead:

>>> from apps.alerts.escalation_snapshot.snapshot_classes import EscalationSnapshot
>>> EscalationSnapshot.serializer({}).data
{'channel_filter_snapshot': None, 'escalation_chain_snapshot': None, 'last_active_escalation_policy_order': None, 'escalation_policies_snapshots': [], 'slack_channel_id': None, 'pause_escalation': False, 'next_step_eta': None}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if there will be no issues in places where we check raw_escalation_snapshot or escalation_snapshot since it also won't be None.
For example, we may encounter issues in escalate alert group and incident log builder (not sure, need to check)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a suite of tests for the EscalationSnapshotMixin class (commit), so all should be good there.

But it seems there are a few spots that would need to be further refactored:

The main thing is how to properly refactor the checks like:

if escalation_snapshot is not None:
   # do something

@Ferril would you mind giving some guidance on this (or hoping in to help refactor these few spots? 😄 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case incident_log_builder.py:
we can check if there are any escalation policies in escalation snapshot instead. By this checking we avoid unnecessary requests to db.

In escalate_alert_group.py:
if we remove this check, escalation should go like there are just no escalation policies. I'm not sure if we need to make any distinction between "no escalation policies" and "no escalation chain". If it is important, we can check if escalation_chain in the snapshot is not None. The difference is in additional log and is_escalation_finished flag.

In cases
apps/alerts/tasks/notify_all.py
apps/alerts/tasks/notify_group.py
it looks like this check doesn't make any sense now because we work with escalation snapshot like it is not None a few lines above in the same task 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the incident_log_builder.py case, this should be addressed in this commit.

If I understand your comments correctly on the other three files, those should be safe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right 👍

Comment on lines +11 to +15
channel_filter_snapshot = ChannelFilterSnapshotSerializer(allow_null=True, default=None)
escalation_chain_snapshot = EscalationChainSnapshotSerializer(allow_null=True, default=None)
last_active_escalation_policy_order = serializers.IntegerField(allow_null=True, default=None)
escalation_policies_snapshots = EscalationPolicySnapshotSerializer(many=True)
slack_channel_id = serializers.CharField(allow_null=True)
escalation_policies_snapshots = EscalationPolicySnapshotSerializer(many=True, default=list)
slack_channel_id = serializers.CharField(allow_null=True, default=None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to this change

@joeyorlando joeyorlando force-pushed the jorlando/refactor-check-escalation-finished-task branch from 8f352a1 to 4e14fc6 Compare February 3, 2023 11:52
Comment on lines -14 to -61
@pytest.fixture()
def escalation_snapshot_test_setup(
make_organization_and_user,
make_user_for_organization,
make_alert_receive_channel,
make_channel_filter,
make_escalation_chain,
make_escalation_policy,
make_alert_group,
):
organization, user_1 = make_organization_and_user()
user_2 = make_user_for_organization(organization)

alert_receive_channel = make_alert_receive_channel(organization)

escalation_chain = make_escalation_chain(organization)
channel_filter = make_channel_filter(
alert_receive_channel,
escalation_chain=escalation_chain,
notification_backends={"BACKEND": {"channel_id": "abc123"}},
)

notify_to_multiple_users_step = make_escalation_policy(
escalation_chain=channel_filter.escalation_chain,
escalation_policy_step=EscalationPolicy.STEP_NOTIFY_MULTIPLE_USERS,
)
notify_to_multiple_users_step.notify_to_users_queue.set([user_1, user_2])
wait_step = make_escalation_policy(
escalation_chain=channel_filter.escalation_chain,
escalation_policy_step=EscalationPolicy.STEP_WAIT,
wait_delay=EscalationPolicy.FIFTEEN_MINUTES,
)
# random time for test
from_time = datetime.time(10, 30)
to_time = datetime.time(18, 45)
notify_if_time_step = make_escalation_policy(
escalation_chain=channel_filter.escalation_chain,
escalation_policy_step=EscalationPolicy.STEP_NOTIFY_IF_TIME,
from_time=from_time,
to_time=to_time,
)

alert_group = make_alert_group(alert_receive_channel, channel_filter=channel_filter)
alert_group.raw_escalation_snapshot = alert_group.build_raw_escalation_snapshot()
alert_group.save()
return alert_group, notify_to_multiple_users_step, wait_step, notify_if_time_step


Copy link
Contributor Author

@joeyorlando joeyorlando Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was moved to apps/alerts/tests/conftest.py so that it could be used by some other tests within that directory

Comment on lines +17 to +62
@pytest.fixture()
def escalation_snapshot_test_setup(
make_organization_and_user,
make_user_for_organization,
make_alert_receive_channel,
make_channel_filter,
make_escalation_chain,
make_escalation_policy,
make_alert_group,
):
organization, user_1 = make_organization_and_user()
user_2 = make_user_for_organization(organization)

alert_receive_channel = make_alert_receive_channel(organization)

escalation_chain = make_escalation_chain(organization)
channel_filter = make_channel_filter(
alert_receive_channel,
escalation_chain=escalation_chain,
notification_backends={"BACKEND": {"channel_id": "abc123"}},
)

notify_to_multiple_users_step = make_escalation_policy(
escalation_chain=channel_filter.escalation_chain,
escalation_policy_step=EscalationPolicy.STEP_NOTIFY_MULTIPLE_USERS,
)
notify_to_multiple_users_step.notify_to_users_queue.set([user_1, user_2])
wait_step = make_escalation_policy(
escalation_chain=channel_filter.escalation_chain,
escalation_policy_step=EscalationPolicy.STEP_WAIT,
wait_delay=EscalationPolicy.FIFTEEN_MINUTES,
)
# random time for test
from_time = datetime.time(10, 30)
to_time = datetime.time(18, 45)
notify_if_time_step = make_escalation_policy(
escalation_chain=channel_filter.escalation_chain,
escalation_policy_step=EscalationPolicy.STEP_NOTIFY_IF_TIME,
from_time=from_time,
to_time=to_time,
)

alert_group = make_alert_group(alert_receive_channel, channel_filter=channel_filter)
alert_group.raw_escalation_snapshot = alert_group.build_raw_escalation_snapshot()
alert_group.save()
return alert_group, notify_to_multiple_users_step, wait_step, notify_if_time_step
Copy link
Contributor Author

@joeyorlando joeyorlando Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was moved from engine/apps/alerts/tests/test_escalation_snapshot.py so that it could be used by some other tests within that directory

for executed_escalation_policy_snapshot in executed_escalation_policy_snapshots:
escalation_policy_id = executed_escalation_policy_snapshot.id

# TODO: is it valid to only check for the finished log record type here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create this log if all escalation steps were executed or escalation chain didn't have any steps. As far as I know we don't create it only on step Resolve incident automatically, but in this case alert group would have status resolved, so it doesn't count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it 👍

so it should be okay to just check for this log record type? I wasn't sure if I might also want to look for the "pending" type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "pending type"?

I think for all successfully finished escalations this log should exist.
But there can be a case, when escalation was finished, this log was created, and then user started escalation again (for example, resolve/unresolve). Seems like in this case it is possible to have the log record TYPE_ESCALATION_FINISHED and unfinished escalation at the same time 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "pending type"?

ah sorry, ignore this. I thought there was a AlertGroupLogRecord.TYPE_ESCALATION_PENDING log record type, doesn't look like there is.

But there can be a case, when escalation was finished, this log was created, and then user started escalation again (for example, resolve/unresolve). Seems like in this case it is possible to have the log record TYPE_ESCALATION_FINISHED and unfinished escalation at the same time

do you have any suggestions for the query in has_finished_log_record method on how to handle this edge case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any suggestions how to do it in one query 😞 The main thing is that we should check if escalation wasn't triggered again after this log, like there are no TYPE_ESCALATION_TRIGGERED or TYPE_ESCALATION_FAILED logs after TYPE_ESCALATION_FINISHED log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think any of the alert groups being audited by this task will have a TYPE_ESCALATION_FINISHED associated with them because the initial task query, filters for alert groups where is_escalation_finished=False, so I assume we’d never have this log record in that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right

@joeyorlando joeyorlando force-pushed the jorlando/refactor-check-escalation-finished-task branch from 4e14fc6 to 56bfa43 Compare February 3, 2023 11:58
@joeyorlando joeyorlando requested a review from Ferril February 3, 2023 11:59
@joeyorlando joeyorlando force-pushed the jorlando/refactor-check-escalation-finished-task branch from 4439206 to ead899f Compare February 3, 2023 12:02
@joeyorlando joeyorlando force-pushed the jorlando/refactor-check-escalation-finished-task branch from be8fd3d to bd7b693 Compare February 23, 2023 10:31
return AlertGroupLogRecord.objects.filter(
escalation_policy_id=self.id,
alert_group_id=alert_group_id,
type=AlertGroupLogRecord.TYPE_ESCALATION_FINISHED,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of logs doesn't have escalation_policy_id because we create it when there are no more escalation steps left

Copy link
Contributor Author

@joeyorlando joeyorlando Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 👍 this should be addressed in this commit.

I'll instead check that each executed escalation policy step has an AlertGroupLogRecord.TYPE_ESCALATION_TRIGGERED associated with it.

@joeyorlando joeyorlando force-pushed the jorlando/refactor-check-escalation-finished-task branch 2 times, most recently from 15506f8 to c78e45f Compare March 15, 2023 07:04
- use read-only database for its alert group query
- do stricter escalation validation based on the alert group's
escalation snapshot
- ping a configurable heartbeat
to properly take into consideration
all celery related env vars
- EscalationSnapshotMixin.calculate_eta_for_finish_escalation
- calculate_escalation_finish_time celery task
- removes references to AlertGroup.estimate_escalation_finish_time and
marks the model field as deprecated
@joeyorlando joeyorlando force-pushed the jorlando/refactor-check-escalation-finished-task branch from 07e8446 to 6e8ebd9 Compare March 17, 2023 09:46
@joeyorlando joeyorlando enabled auto-merge March 17, 2023 09:47
CHANGELOG.md Outdated
@@ -25,6 +25,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Improve alerts and alert group endpoint response time in internal API with caching ([1261](https://github.com/grafana/oncall/pull/1261))
- Optimize alert and alert group public api endpoints and add filter by id ([1274](https://github.com/grafana/oncall/pull/1274)
- Added Coming Soon for iOS on Mobile App screen
- Modified `check_escalation_finished_task` celery task to use read-only databases for its query, if one is defined +
make the validation logic stricter + ping a configurable heartbeat on successful completion of this task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@@ -234,3 +234,23 @@ For Grafana OnCall OSS, the mobile app QR code includes an authentication token
Your Grafana OnCall OSS instance should be reachable from the same network as your mobile device, preferably from the internet.

For more information, see [Grafana OnCall mobile app]({{< relref "../mobile-app" >}})

## Alert Group Escalation Auditor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be add comments about how to use logs to check if auditor works good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion! addressed in this commit

@joeyorlando joeyorlando disabled auto-merge March 17, 2023 09:50
understand the log output from this task
@joeyorlando joeyorlando enabled auto-merge March 17, 2023 09:54
@joeyorlando joeyorlando disabled auto-merge March 17, 2023 10:00
@joeyorlando joeyorlando enabled auto-merge March 17, 2023 10:02
@joeyorlando joeyorlando added this pull request to the merge queue Mar 17, 2023
Merged via the queue into dev with commit 4d655df Mar 17, 2023
@joeyorlando joeyorlando deleted the jorlando/refactor-check-escalation-finished-task branch March 17, 2023 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants