-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Actions] A build up of failed actions can cause issues when upgrading Kibana #96577
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
I wonder if we should simply make the action task run successfully every time instead of throwing at these two places here: https://github.com/elastic/kibana/blob/master/x-pack/plugins/actions/server/lib/task_runner_factory.ts#L121-L137. This way we know the event log will capture execution failures for actions and we can let the task and action_task_params documents get cleaned up automatically. |
That seems fine. We'd be missing the actual params, which are not currently reflected in the event log docs, but I don't believe we really ever needed these in practice for diagnostics. We also will need to "clean up" the failed action task params and task documents for these, at some point, otherwise they'll get copied into every subsequent migrated |
Yeah, I'm thinking two things here:
|
Kibana version: 7.9 - 7.12
Describe the bug:
context
When Tasks fail in Task Manager we keep the failed task SO around (in the
.kibana_task_manager
index) to aid in investigating the failure.When Actions enqueue an execution they schedule a task and create an
action_task_params
SO which stores the parameters for the deferred execution.The bug itself
If an actions has been mis configured on a rule, what might happen is that the Actions plugin will continuously enqueue these actions, and their underlying task will then fail.
The end result is that both the
task
SO and theaction_task_params
SO will remain in the.kibana_task_manager
and
.kibana
indices indefinitely.We've seen some customers where these orphaned SOs grow to the millions - causing SO migrations to hang next time they try to upgrade.
Steps to reproduce:
This will cause the action execution to fail (we really should make that input invalid 😆 )
Expected behavior:
I'm not 100% sure what the expected behaviour is, but we shouldn't be causing upgrade failures just for the sake of making task failures easier to debug.
My suggestion is this:
Introduce a cleanup task into the Actions plugin.
This task could run once a day/week and clean up any orphaned
action_task_params
by collecting all of the tasks that havestatus:failed
andtype:action:...
, and then mapping to theiraction_task_params
by using theactionTaskParamsId
field in the task params.One important thing to keep in mind:
This should be a temporary fix, as the work being done in this issue should remove the persisted action SOs and tasks all together.
This means that we can introduce this bug fix in 7.13 (for example) and then remove the fix once this change is delivered.
That would help us reduce the support burden caused by this bug sooner, without having the broader approach fixed.
The text was updated successfully, but these errors were encountered: