Rule SO is updated twice after running and is subject to OCC issues #135846

mikecote · 2022-07-06T18:14:03Z

When working on #135663, I noticed there was a place in alerting's task runner that forgot to await when I noticed Kibana was crashing randomly on 409 errors. After some searching, I saw we recently added a new update to the rule SO here https://github.com/elastic/kibana/blob/main/x-pack/plugins/alerting/server/task_runner/task_runner.ts#L461 causing a race condition with the existing partial update.

The new update should be merged with the existing update to the rule (here) for a few reasons:

The other update isn't subject to OCC issues. By design, we let the system partially update attributes of a rule without subject to OCC issues (leaving HTTP API requests subject to OCC issues if the system updated the rule at the same time).
It would be more efficient (0-1s faster) to re-use the existing partial update as it doesn't await for an Elasticsearch refresh before continuing the run.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-07-06T18:14:06Z

Pinging @elastic/response-ops (Team:ResponseOps)

mikecote · 2022-07-06T18:23:02Z

To clarify OCC, it happens when you pass in the read version in your update statement => https://www.elastic.co/guide/en/elasticsearch/reference/current/optimistic-concurrency-control.html.

For example, here: https://github.com/elastic/kibana/blob/main/x-pack/plugins/alerting/server/rules_client/rules_client.ts#L2280 otherwise Elasticsearch will overwrite anything even when conflicts occurred (which we don't want when doing full document updates).

Zacqary · 2022-07-07T20:46:24Z

Two possible ways to solve this:

Create a task that runs every 1m. Iterates through all rules, updates all their isSnoozedUntil times. Runs clearedExpiredSnoozes to garbage collect. Easy to implement, potentially doesn't scale as more rules get created
Snoozes spawn tasks. Schedule a task on a new snooze's dtstart to set the isSnoozedUntil time. Then schedule a task to clear the isSnoozedUntil time when the snooze ends, and to then either reschedule it (if it's a recurring snooze) or clear it (if it's expired and all recurrences have passed). More difficult to implement, as we'd have to store task IDs and potentially cancel/reschedule them if a snooze gets edited, but probably scales to an infinite number of rules

mikecote added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Jul 6, 2022

mikecote added this to AppEx: ResponseOps - Rules & Alerts Management Jul 6, 2022

mikecote moved this to Awaiting Triage in AppEx: ResponseOps - Rules & Alerts Management Jul 6, 2022

mikecote mentioned this issue Jul 6, 2022

[RAM] Garbage collect expired snoozes #135271

Merged

1 task

XavierM moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Rules & Alerts Management Jul 11, 2022

XavierM self-assigned this Jul 11, 2022

XavierM mentioned this issue Jul 11, 2022

[RAM] Remove isSnoozeUntil and calculate on the fly for get/find #136148

Merged

XavierM closed this as completed in #136148 Jul 26, 2022

Repository owner moved this from Todo to Done in AppEx: ResponseOps - Rules & Alerts Management Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rule SO is updated twice after running and is subject to OCC issues #135846

Rule SO is updated twice after running and is subject to OCC issues #135846

mikecote commented Jul 6, 2022

elasticmachine commented Jul 6, 2022

mikecote commented Jul 6, 2022 •

edited

Loading

Zacqary commented Jul 7, 2022

Rule SO is updated twice after running and is subject to OCC issues #135846

Rule SO is updated twice after running and is subject to OCC issues #135846

Comments

mikecote commented Jul 6, 2022

elasticmachine commented Jul 6, 2022

mikecote commented Jul 6, 2022 • edited Loading

Zacqary commented Jul 7, 2022

mikecote commented Jul 6, 2022 •

edited

Loading