Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rule SO is updated twice after running and is subject to OCC issues #135846

Closed
mikecote opened this issue Jul 6, 2022 · 3 comments · Fixed by #136148
Closed

Rule SO is updated twice after running and is subject to OCC issues #135846

mikecote opened this issue Jul 6, 2022 · 3 comments · Fixed by #136148
Assignees
Labels
Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Jul 6, 2022

When working on #135663, I noticed there was a place in alerting's task runner that forgot to await when I noticed Kibana was crashing randomly on 409 errors. After some searching, I saw we recently added a new update to the rule SO here https://github.com/elastic/kibana/blob/main/x-pack/plugins/alerting/server/task_runner/task_runner.ts#L461 causing a race condition with the existing partial update.

The new update should be merged with the existing update to the rule (here) for a few reasons:

  • The other update isn't subject to OCC issues. By design, we let the system partially update attributes of a rule without subject to OCC issues (leaving HTTP API requests subject to OCC issues if the system updated the rule at the same time).
  • It would be more efficient (0-1s faster) to re-use the existing partial update as it doesn't await for an Elasticsearch refresh before continuing the run.
@mikecote mikecote added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Jul 6, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote
Copy link
Contributor Author

mikecote commented Jul 6, 2022

To clarify OCC, it happens when you pass in the read version in your update statement => https://www.elastic.co/guide/en/elasticsearch/reference/current/optimistic-concurrency-control.html.

For example, here: https://github.com/elastic/kibana/blob/main/x-pack/plugins/alerting/server/rules_client/rules_client.ts#L2280 otherwise Elasticsearch will overwrite anything even when conflicts occurred (which we don't want when doing full document updates).

@Zacqary
Copy link
Contributor

Zacqary commented Jul 7, 2022

Two possible ways to solve this:

  • Create a task that runs every 1m. Iterates through all rules, updates all their isSnoozedUntil times. Runs clearedExpiredSnoozes to garbage collect. Easy to implement, potentially doesn't scale as more rules get created
  • Snoozes spawn tasks. Schedule a task on a new snooze's dtstart to set the isSnoozedUntil time. Then schedule a task to clear the isSnoozedUntil time when the snooze ends, and to then either reschedule it (if it's a recurring snooze) or clear it (if it's expired and all recurrences have passed). More difficult to implement, as we'd have to store task IDs and potentially cancel/reschedule them if a snooze gets edited, but probably scales to an infinite number of rules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants