Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch AllocatedTask registration failures #45300

Merged

Conversation

polyfractal
Copy link
Contributor

When a persistent task attempts to register an allocated task locally, this creates the Task object and starts tracking it locally. If there is a failure while initializing the task (but after the task is created), this is handled by a catch and subsequent error handling.

But if the task fails to be created because an exception is thrown in the task's ctor, this is uncaught and fails the cluster update thread. The ramification is that a persistent task remains in the cluster state, but is unable to create the allocated task, and the exception prevents other tasks "after" the "poisoned" task from because the task initialization loop exits early.

Because the allocated task is never created, the cancellation tools are not able to remove the persistent task and it is stuck as a zombie in the CS.

This commit adds exception handling around the task creation, and attempts to notify the master if there is a failure (so the "poisoned" persistent task can be removed). Even if this notification fails, the exception handling means the rest of the uninitialized tasks can proceed as normal.

Note: I'm not entirely sure if the completion notification is the correct approach, but it looked like the appropriate way to inform the master the persistent task should be removed. Rather unfamiliar with this area of code so open to any and all suggestions :)

When a persistent task attempts to register an allocated task locally,
this creates the Task object and starts tracking it locally.  If there
is a failure while initializing the task, this is handled by a catch
and subsequent error handling (canceling, unregistering, etc).

But if the task fails to be created because an exception is thrown
in the tasks ctor, this is uncaught and fails the cluster update
thread.  The ramification is that a persistent task remains in the
cluster state, but is unable to create the allocated task, and the
exception prevents other tasks "after" the poisoned task from starting
too.

Because the allocated task is never created, the cancellation tools
are not able to remove the persistent task and it is stuck as a
zombie in the CS.

This commit adds exception handling around the task creation,
and attempts to notify the master if there is a failure (so the
persistent task can be removed).  Even if this notification fails,
the exception handling means the rest of the uninitialized tasks
can proceed as normal.
@polyfractal polyfractal added the :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. label Aug 7, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have 3 more issues here

  1. existence of poison task - basically this shouldn't fail or it should fail in init and not in register
  2. the loop that iterates over tasks and calls startTask() is not resilient to single task failure, perhaps we need to wrap startTask() into a try/catch so one bad task doesn't prevent all other tasks from being started
  3. we have no robust way of cleaning registered but not started tasks

I think it might make sense to add 2) as part of this PR and address 1) and 3) in follow ups.

@polyfractal
Copy link
Contributor Author

Review comments addressed. I'll open issues for 1) and 3) so we don't lose track of them.

@polyfractal
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I added a few smaller comments to address but there is no need for another round.

@polyfractal
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

@polyfractal
Copy link
Contributor Author

@elasticmachine update branch

@polyfractal
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/2

@polyfractal polyfractal merged commit a7f6fea into elastic:master Aug 15, 2019
polyfractal added a commit that referenced this pull request Aug 15, 2019
When a persistent task attempts to register an allocated task locally,
this creates the Task object and starts tracking it locally.  If there
is a failure while initializing the task, this is handled by a catch
and subsequent error handling (canceling, unregistering, etc).

But if the task fails to be created because an exception is thrown
in the tasks ctor, this is uncaught and fails the cluster update
thread.  The ramification is that a persistent task remains in the
cluster state, but is unable to create the allocated task, and the
exception prevents other tasks "after" the poisoned task from starting
too.

Because the allocated task is never created, the cancellation tools
are not able to remove the persistent task and it is stuck as a
zombie in the CS.

This commit adds exception handling around the task creation,
and attempts to notify the master if there is a failure (so the
persistent task can be removed).  Even if this notification fails,
the exception handling means the rest of the uninitialized tasks
can proceed as normal.
@colings86 colings86 added the >bug label Aug 30, 2019
@DaveCTurner
Copy link
Contributor

@polyfractal should we backport this to 6.8 too? I was just looking at a customer case that I think would have been helped by this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. v7.4.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants