Catch AllocatedTask registration failures #45300

polyfractal · 2019-08-07T19:10:32Z

When a persistent task attempts to register an allocated task locally, this creates the Task object and starts tracking it locally. If there is a failure while initializing the task (but after the task is created), this is handled by a catch and subsequent error handling.

But if the task fails to be created because an exception is thrown in the task's ctor, this is uncaught and fails the cluster update thread. The ramification is that a persistent task remains in the cluster state, but is unable to create the allocated task, and the exception prevents other tasks "after" the "poisoned" task from because the task initialization loop exits early.

Because the allocated task is never created, the cancellation tools are not able to remove the persistent task and it is stuck as a zombie in the CS.

This commit adds exception handling around the task creation, and attempts to notify the master if there is a failure (so the "poisoned" persistent task can be removed). Even if this notification fails, the exception handling means the rest of the uninitialized tasks can proceed as normal.

Note: I'm not entirely sure if the completion notification is the correct approach, but it looked like the appropriate way to inform the master the persistent task should be removed. Rather unfamiliar with this area of code so open to any and all suggestions :)

When a persistent task attempts to register an allocated task locally, this creates the Task object and starts tracking it locally. If there is a failure while initializing the task, this is handled by a catch and subsequent error handling (canceling, unregistering, etc). But if the task fails to be created because an exception is thrown in the tasks ctor, this is uncaught and fails the cluster update thread. The ramification is that a persistent task remains in the cluster state, but is unable to create the allocated task, and the exception prevents other tasks "after" the poisoned task from starting too. Because the allocated task is never created, the cancellation tools are not able to remove the persistent task and it is stuck as a zombie in the CS. This commit adds exception handling around the task creation, and attempts to notify the master if there is a failure (so the persistent task can be removed). Even if this notification fails, the exception handling means the rest of the uninitialized tasks can proceed as normal.

elasticmachine · 2019-08-07T19:10:34Z

Pinging @elastic/es-distributed

imotov

I think we have 3 more issues here

existence of poison task - basically this shouldn't fail or it should fail in init and not in register
the loop that iterates over tasks and calls startTask() is not resilient to single task failure, perhaps we need to wrap startTask() into a try/catch so one bad task doesn't prevent all other tasks from being started
we have no robust way of cleaning registered but not started tasks

I think it might make sense to add 2) as part of this PR and address 1) and 3) in follow ups.

server/src/main/java/org/elasticsearch/persistent/PersistentTasksNodeService.java

polyfractal · 2019-08-12T15:26:19Z

Review comments addressed. I'll open issues for 1) and 3) so we don't lose track of them.

polyfractal · 2019-08-12T16:24:45Z

@elasticmachine update branch

henningandersen

LGTM.

I added a few smaller comments to address but there is no need for another round.

server/src/main/java/org/elasticsearch/persistent/PersistentTasksNodeService.java

server/src/test/java/org/elasticsearch/persistent/PersistentTasksNodeServiceTests.java

polyfractal · 2019-08-15T13:19:23Z

@elasticmachine run elasticsearch-ci/bwc

polyfractal · 2019-08-15T14:42:48Z

@elasticmachine update branch

polyfractal · 2019-08-15T17:33:36Z

@elasticmachine run elasticsearch-ci/2

When a persistent task attempts to register an allocated task locally, this creates the Task object and starts tracking it locally. If there is a failure while initializing the task, this is handled by a catch and subsequent error handling (canceling, unregistering, etc). But if the task fails to be created because an exception is thrown in the tasks ctor, this is uncaught and fails the cluster update thread. The ramification is that a persistent task remains in the cluster state, but is unable to create the allocated task, and the exception prevents other tasks "after" the poisoned task from starting too. Because the allocated task is never created, the cancellation tools are not able to remove the persistent task and it is stuck as a zombie in the CS. This commit adds exception handling around the task creation, and attempts to notify the master if there is a failure (so the persistent task can be removed). Even if this notification fails, the exception handling means the rest of the uninitialized tasks can proceed as normal.

DaveCTurner · 2020-07-11T11:00:23Z

@polyfractal should we backport this to 6.8 too? I was just looking at a customer case that I think would have been helped by this.

$polyfractal$

$@polyfractal$ polyfractal added the :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. label Aug 7, 2019

$@polyfractal$

checkstyle

5c1be1e

$@polyfractal$ polyfractal mentioned this pull request Aug 7, 2019

New rollup jobs are invisible and unusable #45247

Closed

ywelsch requested review from henningandersen and imotov August 8, 2019 06:49

imotov reviewed Aug 8, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/persistent/PersistentTasksNodeService.java Outdated Show resolved Hide resolved

$@polyfractal$

Address review comments

6e293be

Merge branch 'master' into more_robust_persistent_task_startup

58418e3

henningandersen approved these changes Aug 13, 2019

View reviewed changes

$@polyfractal$

Address review comments

5c6ca8e

Merge branch 'master' into more_robust_persistent_task_startup

1dd8f6a

$@polyfractal$ polyfractal added v7.4.0 v8.0.0 labels Aug 15, 2019

$@polyfractal$ polyfractal merged commit a7f6fea into elastic:master Aug 15, 2019

colings86 added the >bug label Aug 30, 2019

This was referenced Sep 3, 2019

Refactor AllocatedPersistentTask#init(), move rollup logic out of ctor #46288

Merged

Refactor AllocatedPersistentTask#init(), move rollup logic out of ctor (Redux) #46444

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch AllocatedTask registration failures #45300

Catch AllocatedTask registration failures #45300

$@polyfractal$ polyfractal commented Aug 7, 2019

elasticmachine commented Aug 7, 2019

imotov left a comment

polyfractal commented Aug 12, 2019

polyfractal commented Aug 12, 2019

henningandersen left a comment

polyfractal commented Aug 15, 2019

polyfractal commented Aug 15, 2019

polyfractal commented Aug 15, 2019

DaveCTurner commented Jul 11, 2020

Catch AllocatedTask registration failures #45300

Catch AllocatedTask registration failures #45300

Conversation

polyfractal commented Aug 7, 2019

elasticmachine commented Aug 7, 2019

imotov left a comment

Choose a reason for hiding this comment

polyfractal commented Aug 12, 2019

polyfractal commented Aug 12, 2019

henningandersen left a comment

Choose a reason for hiding this comment

polyfractal commented Aug 15, 2019

polyfractal commented Aug 15, 2019

polyfractal commented Aug 15, 2019

DaveCTurner commented Jul 11, 2020

$@polyfractal$ polyfractal commented Aug 7, 2019