-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modify check_escalation_finished_task task #1266
modify check_escalation_finished_task task #1266
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this task is failing in case of problems. We rely on it to blow our rabbit which should trigger monitoring. I wish this task to start explicitly sending alerts about problems to our OnCall instance. Also would be nice to make it check with heartbeat to make sure it's not failing silently. Also it's pretty heavy so retrying it is dangerous. Even 10 retrying tasks could damage DB.
Could you please check with @iskhakov about other probable improvements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a few thoughts on this task
ec6a803
to
54af14a
Compare
engine/apps/alerts/escalation_snapshot/snapshot_classes/escalation_snapshot.py
Outdated
Show resolved
Hide resolved
engine/apps/alerts/escalation_snapshot/snapshot_classes/escalation_snapshot.py
Outdated
Show resolved
Hide resolved
@@ -87,8 +84,7 @@ def build_raw_escalation_snapshot(self) -> dict: | |||
'next_step_eta': '2021-10-18T10:28:28.890369Z | |||
} | |||
""" | |||
|
|||
escalation_snapshot = None | |||
data = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means we will no longer be setting AlertGroup.raw_escalation_snapshot
to None
. Instead:
>>> from apps.alerts.escalation_snapshot.snapshot_classes import EscalationSnapshot
>>> EscalationSnapshot.serializer({}).data
{'channel_filter_snapshot': None, 'escalation_chain_snapshot': None, 'last_active_escalation_policy_order': None, 'escalation_policies_snapshots': [], 'slack_channel_id': None, 'pause_escalation': False, 'next_step_eta': None}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check if there will be no issues in places where we check raw_escalation_snapshot
or escalation_snapshot
since it also won't be None
.
For example, we may encounter issues in escalate alert group and incident log builder (not sure, need to check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a suite of tests for the EscalationSnapshotMixin
class (commit), so all should be good there.
But it seems there are a few spots that would need to be further refactored:
apps/alerts/incident_log_builder/incident_log_builder.py
apps/alerts/tasks/escalate_alert_group.py
apps/alerts/tasks/notify_all.py
apps/alerts/tasks/notify_group.py
The main thing is how to properly refactor the checks like:
if escalation_snapshot is not None:
# do something
@Ferril would you mind giving some guidance on this (or hoping in to help refactor these few spots? 😄 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case incident_log_builder.py
:
we can check if there are any escalation policies in escalation snapshot instead. By this checking we avoid unnecessary requests to db.
In escalate_alert_group.py
:
if we remove this check, escalation should go like there are just no escalation policies. I'm not sure if we need to make any distinction between "no escalation policies" and "no escalation chain". If it is important, we can check if escalation_chain
in the snapshot is not None
. The difference is in additional log and is_escalation_finished
flag.
In cases
apps/alerts/tasks/notify_all.py
apps/alerts/tasks/notify_group.py
it looks like this check doesn't make any sense now because we work with escalation snapshot like it is not None
a few lines above in the same task 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the incident_log_builder.py
case, this should be addressed in this commit.
If I understand your comments correctly on the other three files, those should be safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right 👍
channel_filter_snapshot = ChannelFilterSnapshotSerializer(allow_null=True, default=None) | ||
escalation_chain_snapshot = EscalationChainSnapshotSerializer(allow_null=True, default=None) | ||
last_active_escalation_policy_order = serializers.IntegerField(allow_null=True, default=None) | ||
escalation_policies_snapshots = EscalationPolicySnapshotSerializer(many=True) | ||
slack_channel_id = serializers.CharField(allow_null=True) | ||
escalation_policies_snapshots = EscalationPolicySnapshotSerializer(many=True, default=list) | ||
slack_channel_id = serializers.CharField(allow_null=True, default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to this change
8f352a1
to
4e14fc6
Compare
@pytest.fixture() | ||
def escalation_snapshot_test_setup( | ||
make_organization_and_user, | ||
make_user_for_organization, | ||
make_alert_receive_channel, | ||
make_channel_filter, | ||
make_escalation_chain, | ||
make_escalation_policy, | ||
make_alert_group, | ||
): | ||
organization, user_1 = make_organization_and_user() | ||
user_2 = make_user_for_organization(organization) | ||
|
||
alert_receive_channel = make_alert_receive_channel(organization) | ||
|
||
escalation_chain = make_escalation_chain(organization) | ||
channel_filter = make_channel_filter( | ||
alert_receive_channel, | ||
escalation_chain=escalation_chain, | ||
notification_backends={"BACKEND": {"channel_id": "abc123"}}, | ||
) | ||
|
||
notify_to_multiple_users_step = make_escalation_policy( | ||
escalation_chain=channel_filter.escalation_chain, | ||
escalation_policy_step=EscalationPolicy.STEP_NOTIFY_MULTIPLE_USERS, | ||
) | ||
notify_to_multiple_users_step.notify_to_users_queue.set([user_1, user_2]) | ||
wait_step = make_escalation_policy( | ||
escalation_chain=channel_filter.escalation_chain, | ||
escalation_policy_step=EscalationPolicy.STEP_WAIT, | ||
wait_delay=EscalationPolicy.FIFTEEN_MINUTES, | ||
) | ||
# random time for test | ||
from_time = datetime.time(10, 30) | ||
to_time = datetime.time(18, 45) | ||
notify_if_time_step = make_escalation_policy( | ||
escalation_chain=channel_filter.escalation_chain, | ||
escalation_policy_step=EscalationPolicy.STEP_NOTIFY_IF_TIME, | ||
from_time=from_time, | ||
to_time=to_time, | ||
) | ||
|
||
alert_group = make_alert_group(alert_receive_channel, channel_filter=channel_filter) | ||
alert_group.raw_escalation_snapshot = alert_group.build_raw_escalation_snapshot() | ||
alert_group.save() | ||
return alert_group, notify_to_multiple_users_step, wait_step, notify_if_time_step | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was moved to apps/alerts/tests/conftest.py
so that it could be used by some other tests within that directory
@pytest.fixture() | ||
def escalation_snapshot_test_setup( | ||
make_organization_and_user, | ||
make_user_for_organization, | ||
make_alert_receive_channel, | ||
make_channel_filter, | ||
make_escalation_chain, | ||
make_escalation_policy, | ||
make_alert_group, | ||
): | ||
organization, user_1 = make_organization_and_user() | ||
user_2 = make_user_for_organization(organization) | ||
|
||
alert_receive_channel = make_alert_receive_channel(organization) | ||
|
||
escalation_chain = make_escalation_chain(organization) | ||
channel_filter = make_channel_filter( | ||
alert_receive_channel, | ||
escalation_chain=escalation_chain, | ||
notification_backends={"BACKEND": {"channel_id": "abc123"}}, | ||
) | ||
|
||
notify_to_multiple_users_step = make_escalation_policy( | ||
escalation_chain=channel_filter.escalation_chain, | ||
escalation_policy_step=EscalationPolicy.STEP_NOTIFY_MULTIPLE_USERS, | ||
) | ||
notify_to_multiple_users_step.notify_to_users_queue.set([user_1, user_2]) | ||
wait_step = make_escalation_policy( | ||
escalation_chain=channel_filter.escalation_chain, | ||
escalation_policy_step=EscalationPolicy.STEP_WAIT, | ||
wait_delay=EscalationPolicy.FIFTEEN_MINUTES, | ||
) | ||
# random time for test | ||
from_time = datetime.time(10, 30) | ||
to_time = datetime.time(18, 45) | ||
notify_if_time_step = make_escalation_policy( | ||
escalation_chain=channel_filter.escalation_chain, | ||
escalation_policy_step=EscalationPolicy.STEP_NOTIFY_IF_TIME, | ||
from_time=from_time, | ||
to_time=to_time, | ||
) | ||
|
||
alert_group = make_alert_group(alert_receive_channel, channel_filter=channel_filter) | ||
alert_group.raw_escalation_snapshot = alert_group.build_raw_escalation_snapshot() | ||
alert_group.save() | ||
return alert_group, notify_to_multiple_users_step, wait_step, notify_if_time_step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was moved from engine/apps/alerts/tests/test_escalation_snapshot.py
so that it could be used by some other tests within that directory
for executed_escalation_policy_snapshot in executed_escalation_policy_snapshots: | ||
escalation_policy_id = executed_escalation_policy_snapshot.id | ||
|
||
# TODO: is it valid to only check for the finished log record type here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create this log if all escalation steps were executed or escalation chain didn't have any steps. As far as I know we don't create it only on step Resolve incident automatically
, but in this case alert group would have status resolved
, so it doesn't count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it 👍
so it should be okay to just check for this log record type? I wasn't sure if I might also want to look for the "pending" type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "pending type"?
I think for all successfully finished escalations this log should exist.
But there can be a case, when escalation was finished, this log was created, and then user started escalation again (for example, resolve/unresolve). Seems like in this case it is possible to have the log record TYPE_ESCALATION_FINISHED
and unfinished escalation at the same time 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "pending type"?
ah sorry, ignore this. I thought there was a AlertGroupLogRecord.TYPE_ESCALATION_PENDING
log record type, doesn't look like there is.
But there can be a case, when escalation was finished, this log was created, and then user started escalation again (for example, resolve/unresolve). Seems like in this case it is possible to have the log record TYPE_ESCALATION_FINISHED and unfinished escalation at the same time
do you have any suggestions for the query in has_finished_log_record
method on how to handle this edge case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any suggestions how to do it in one query 😞 The main thing is that we should check if escalation wasn't triggered again after this log, like there are no TYPE_ESCALATION_TRIGGERED
or TYPE_ESCALATION_FAILED
logs after TYPE_ESCALATION_FINISHED
log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think any of the alert groups being audited by this task will have a TYPE_ESCALATION_FINISHED
associated with them because the initial task query, filters for alert groups where is_escalation_finished=False
, so I assume we’d never have this log record in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right
4e14fc6
to
56bfa43
Compare
4439206
to
ead899f
Compare
be8fd3d
to
bd7b693
Compare
return AlertGroupLogRecord.objects.filter( | ||
escalation_policy_id=self.id, | ||
alert_group_id=alert_group_id, | ||
type=AlertGroupLogRecord.TYPE_ESCALATION_FINISHED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type of logs doesn't have escalation_policy_id
because we create it when there are no more escalation steps left
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch 👍 this should be addressed in this commit.
I'll instead check that each executed escalation policy step has an AlertGroupLogRecord.TYPE_ESCALATION_TRIGGERED
associated with it.
15506f8
to
c78e45f
Compare
- use read-only database for its alert group query - do stricter escalation validation based on the alert group's escalation snapshot - ping a configurable heartbeat
to properly take into consideration all celery related env vars
- EscalationSnapshotMixin.calculate_eta_for_finish_escalation - calculate_escalation_finish_time celery task - removes references to AlertGroup.estimate_escalation_finish_time and marks the model field as deprecated
07e8446
to
6e8ebd9
Compare
CHANGELOG.md
Outdated
@@ -25,6 +25,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
- Improve alerts and alert group endpoint response time in internal API with caching ([1261](https://github.com/grafana/oncall/pull/1261)) | |||
- Optimize alert and alert group public api endpoints and add filter by id ([1274](https://github.com/grafana/oncall/pull/1274) | |||
- Added Coming Soon for iOS on Mobile App screen | |||
- Modified `check_escalation_finished_task` celery task to use read-only databases for its query, if one is defined + | |||
make the validation logic stricter + ping a configurable heartbeat on successful completion of this task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
@@ -234,3 +234,23 @@ For Grafana OnCall OSS, the mobile app QR code includes an authentication token | |||
Your Grafana OnCall OSS instance should be reachable from the same network as your mobile device, preferably from the internet. | |||
|
|||
For more information, see [Grafana OnCall mobile app]({{< relref "../mobile-app" >}}) | |||
|
|||
## Alert Group Escalation Auditor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be add comments about how to use logs to check if auditor works good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion! addressed in this commit
understand the log output from this task
What this PR does
This PR:
check_escalation_finished_task
celery task to:escalation snapshot (see the
audit_alert_group_escalation
method inengine/apps/alerts/tasks/check_escalation_finished.py
for the validation logic)ALERT_GROUP_ESCALATION_AUDITOR_CELERY_TASK_HEARTBEAT_URL
added)engine/celery_with_exporter.sh
; this made it easier to enablecelery beat
locally for testing)AlertGroup.estimate_escalation_finish_time
andmarks the model field as deprecated using the
django-deprecate-fields
library. This field was only used for the previous version of this validation taskEscalationSnapshotMixin.calculate_eta_for_finish_escalation
was only used to calculate the value forAlertGroup.estimate_escalation_finish_time
calculate_escalation_finish_time
celery taskWhich issue(s) this PR fixes
https://github.com/grafana/oncall-private/issues/1558
Checklist
CHANGELOG.md
updated