Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Actions] A build up of failed actions can cause issues when upgrading Kibana #96577

Closed
gmmorris opened this issue Apr 8, 2021 · 4 comments · Fixed by #96971
Closed

[Actions] A build up of failed actions can cause issues when upgrading Kibana #96577

gmmorris opened this issue Apr 8, 2021 · 4 comments · Fixed by #96971
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.13.0

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Apr 8, 2021

Kibana version: 7.9 - 7.12

Describe the bug:

context

When Tasks fail in Task Manager we keep the failed task SO around (in the .kibana_task_manager index) to aid in investigating the failure.
When Actions enqueue an execution they schedule a task and create an action_task_params SO which stores the parameters for the deferred execution.

The bug itself

If an actions has been mis configured on a rule, what might happen is that the Actions plugin will continuously enqueue these actions, and their underlying task will then fail.
The end result is that both the task SO and the action_task_params SO will remain in the .kibana_task_manager
and .kibana indices indefinitely.

We've seen some customers where these orphaned SOs grow to the millions - causing SO migrations to hang next time they try to upgrade.

Steps to reproduce:

  1. Create a rule that always has active alerts (such as the AlwaysFiring Rule Type in the Alerting example plugin)
  2. Attach an ES Index action with the following invalid JSON:
{
  "": ""
}

This will cause the action execution to fail (we really should make that input invalid 😆 )

  1. Save the rule with a small interval so you can accumulate failures and see them appear in the event log or server log.
  2. You should see orphaned SOs (2 for each failure) start to accumulate

Expected behavior:

I'm not 100% sure what the expected behaviour is, but we shouldn't be causing upgrade failures just for the sake of making task failures easier to debug.

My suggestion is this:
Introduce a cleanup task into the Actions plugin.
This task could run once a day/week and clean up any orphaned action_task_params by collecting all of the tasks that have status:failed and type:action:..., and then mapping to their action_task_params by using the actionTaskParamsId field in the task params.

One important thing to keep in mind:
This should be a temporary fix, as the work being done in this issue should remove the persisted action SOs and tasks all together.
This means that we can introduce this bug fix in 7.13 (for example) and then remove the fix once this change is delivered.
That would help us reduce the support burden caused by this bug sooner, without having the broader approach fixed.

@gmmorris gmmorris added the bug Fixes for quality problems that affect the customer experience label Apr 8, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Apr 8, 2021
@gmmorris gmmorris added Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.13.0 labels Apr 8, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Apr 8, 2021
@mikecote
Copy link
Contributor

mikecote commented Apr 8, 2021

I wonder if we should simply make the action task run successfully every time instead of throwing at these two places here: https://github.com/elastic/kibana/blob/master/x-pack/plugins/actions/server/lib/task_runner_factory.ts#L121-L137.

This way we know the event log will capture execution failures for actions and we can let the task and action_task_params documents get cleaned up automatically.

@pmuellr
Copy link
Member

pmuellr commented Apr 12, 2021

I wonder if we should simply make the action task run successfully every time instead of throwing

That seems fine. We'd be missing the actual params, which are not currently reflected in the event log docs, but I don't believe we really ever needed these in practice for diagnostics.

We also will need to "clean up" the failed action task params and task documents for these, at some point, otherwise they'll get copied into every subsequent migrated .kibana index. Maybe we also do a delete by query for these, at Kibana startup?

@mikecote mikecote self-assigned this Apr 12, 2021
@mikecote
Copy link
Contributor

That seems fine. We'd be missing the actual params, which are not currently reflected in the event log docs, but I don't believe we really ever needed these in practice for diagnostics.

We also will need to "clean up" the failed action task params and task documents for these, at some point, otherwise they'll get copied into every subsequent migrated .kibana index. Maybe we also do a delete by query for these, at Kibana startup?

Yeah, I'm thinking two things here:

  1. Prevent future executions from failing the task / not cleaning up the action_task_params
  2. Develop a cleanup mechanism for existing failures

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.13.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants