[Actions] A build up of failed actions can cause issues when upgrading Kibana #96577

gmmorris · 2021-04-08T14:26:10Z

Kibana version: 7.9 - 7.12

Describe the bug:

context

When Tasks fail in Task Manager we keep the failed task SO around (in the .kibana_task_manager index) to aid in investigating the failure.
When Actions enqueue an execution they schedule a task and create an action_task_params SO which stores the parameters for the deferred execution.

The bug itself

If an actions has been mis configured on a rule, what might happen is that the Actions plugin will continuously enqueue these actions, and their underlying task will then fail.
The end result is that both the task SO and the action_task_params SO will remain in the .kibana_task_manager
and .kibana indices indefinitely.

We've seen some customers where these orphaned SOs grow to the millions - causing SO migrations to hang next time they try to upgrade.

Steps to reproduce:

Create a rule that always has active alerts (such as the AlwaysFiring Rule Type in the Alerting example plugin)
Attach an ES Index action with the following invalid JSON:

{
  "": ""
}

This will cause the action execution to fail (we really should make that input invalid 😆 )

Save the rule with a small interval so you can accumulate failures and see them appear in the event log or server log.
You should see orphaned SOs (2 for each failure) start to accumulate

Expected behavior:

I'm not 100% sure what the expected behaviour is, but we shouldn't be causing upgrade failures just for the sake of making task failures easier to debug.

My suggestion is this:
Introduce a cleanup task into the Actions plugin.
This task could run once a day/week and clean up any orphaned action_task_params by collecting all of the tasks that have status:failed and type:action:..., and then mapping to their action_task_params by using the actionTaskParamsId field in the task params.

One important thing to keep in mind:
This should be a temporary fix, as the work being done in this issue should remove the persisted action SOs and tasks all together.
This means that we can introduce this bug fix in 7.13 (for example) and then remove the fix once this change is delivered.
That would help us reduce the support burden caused by this bug sooner, without having the broader approach fixed.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-04-08T14:26:26Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote · 2021-04-08T16:17:39Z

I wonder if we should simply make the action task run successfully every time instead of throwing at these two places here: https://github.com/elastic/kibana/blob/master/x-pack/plugins/actions/server/lib/task_runner_factory.ts#L121-L137.

This way we know the event log will capture execution failures for actions and we can let the task and action_task_params documents get cleaned up automatically.

pmuellr · 2021-04-12T15:11:05Z

I wonder if we should simply make the action task run successfully every time instead of throwing

That seems fine. We'd be missing the actual params, which are not currently reflected in the event log docs, but I don't believe we really ever needed these in practice for diagnostics.

We also will need to "clean up" the failed action task params and task documents for these, at some point, otherwise they'll get copied into every subsequent migrated .kibana index. Maybe we also do a delete by query for these, at Kibana startup?

mikecote · 2021-04-12T17:30:51Z

That seems fine. We'd be missing the actual params, which are not currently reflected in the event log docs, but I don't believe we really ever needed these in practice for diagnostics.

We also will need to "clean up" the failed action task params and task documents for these, at some point, otherwise they'll get copied into every subsequent migrated .kibana index. Maybe we also do a delete by query for these, at Kibana startup?

Yeah, I'm thinking two things here:

Prevent future executions from failing the task / not cleaning up the action_task_params
Develop a cleanup mechanism for existing failures

gmmorris added the bug Fixes for quality problems that affect the customer experience label Apr 8, 2021

botelastic bot added the needs-team Issues missing a team label label Apr 8, 2021

gmmorris added Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.13.0 labels Apr 8, 2021

botelastic bot removed the needs-team Issues missing a team label label Apr 8, 2021

gmmorris mentioned this issue Apr 8, 2021

[Task Manager] Task Type deprecation isn't possible #96581

Closed

mikecote self-assigned this Apr 12, 2021

mikecote mentioned this issue Apr 13, 2021

Create task to cleanup action execution failures #96971

Merged

mikecote closed this as completed in #96971 Apr 20, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Actions] A build up of failed actions can cause issues when upgrading Kibana #96577

[Actions] A build up of failed actions can cause issues when upgrading Kibana #96577

gmmorris commented Apr 8, 2021

elasticmachine commented Apr 8, 2021

mikecote commented Apr 8, 2021

pmuellr commented Apr 12, 2021

mikecote commented Apr 12, 2021

[Actions] A build up of failed actions can cause issues when upgrading Kibana #96577

[Actions] A build up of failed actions can cause issues when upgrading Kibana #96577

Comments

gmmorris commented Apr 8, 2021

context

The bug itself

elasticmachine commented Apr 8, 2021

mikecote commented Apr 8, 2021

pmuellr commented Apr 12, 2021

mikecote commented Apr 12, 2021