-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch AllocatedTask registration failures #45300
Catch AllocatedTask registration failures #45300
Conversation
When a persistent task attempts to register an allocated task locally, this creates the Task object and starts tracking it locally. If there is a failure while initializing the task, this is handled by a catch and subsequent error handling (canceling, unregistering, etc). But if the task fails to be created because an exception is thrown in the tasks ctor, this is uncaught and fails the cluster update thread. The ramification is that a persistent task remains in the cluster state, but is unable to create the allocated task, and the exception prevents other tasks "after" the poisoned task from starting too. Because the allocated task is never created, the cancellation tools are not able to remove the persistent task and it is stuck as a zombie in the CS. This commit adds exception handling around the task creation, and attempts to notify the master if there is a failure (so the persistent task can be removed). Even if this notification fails, the exception handling means the rest of the uninitialized tasks can proceed as normal.
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have 3 more issues here
- existence of poison task - basically this shouldn't fail or it should fail in init and not in register
- the loop that iterates over tasks and calls startTask() is not resilient to single task failure, perhaps we need to wrap startTask() into a try/catch so one bad task doesn't prevent all other tasks from being started
- we have no robust way of cleaning registered but not started tasks
I think it might make sense to add 2) as part of this PR and address 1) and 3) in follow ups.
server/src/main/java/org/elasticsearch/persistent/PersistentTasksNodeService.java
Outdated
Show resolved
Hide resolved
Review comments addressed. I'll open issues for 1) and 3) so we don't lose track of them. |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I added a few smaller comments to address but there is no need for another round.
server/src/main/java/org/elasticsearch/persistent/PersistentTasksNodeService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/persistent/PersistentTasksNodeService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/persistent/PersistentTasksNodeService.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/persistent/PersistentTasksNodeService.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/persistent/PersistentTasksNodeServiceTests.java
Outdated
Show resolved
Hide resolved
@elasticmachine run elasticsearch-ci/bwc |
@elasticmachine update branch |
@elasticmachine run elasticsearch-ci/2 |
When a persistent task attempts to register an allocated task locally, this creates the Task object and starts tracking it locally. If there is a failure while initializing the task, this is handled by a catch and subsequent error handling (canceling, unregistering, etc). But if the task fails to be created because an exception is thrown in the tasks ctor, this is uncaught and fails the cluster update thread. The ramification is that a persistent task remains in the cluster state, but is unable to create the allocated task, and the exception prevents other tasks "after" the poisoned task from starting too. Because the allocated task is never created, the cancellation tools are not able to remove the persistent task and it is stuck as a zombie in the CS. This commit adds exception handling around the task creation, and attempts to notify the master if there is a failure (so the persistent task can be removed). Even if this notification fails, the exception handling means the rest of the uninitialized tasks can proceed as normal.
@polyfractal should we backport this to 6.8 too? I was just looking at a customer case that I think would have been helped by this. |
When a persistent task attempts to register an allocated task locally, this creates the Task object and starts tracking it locally. If there is a failure while initializing the task (but after the task is created), this is handled by a catch and subsequent error handling.
But if the task fails to be created because an exception is thrown in the task's ctor, this is uncaught and fails the cluster update thread. The ramification is that a persistent task remains in the cluster state, but is unable to create the allocated task, and the exception prevents other tasks "after" the "poisoned" task from because the task initialization loop exits early.
Because the allocated task is never created, the cancellation tools are not able to remove the persistent task and it is stuck as a zombie in the CS.
This commit adds exception handling around the task creation, and attempts to notify the master if there is a failure (so the "poisoned" persistent task can be removed). Even if this notification fails, the exception handling means the rest of the uninitialized tasks can proceed as normal.
Note: I'm not entirely sure if the completion notification is the correct approach, but it looked like the appropriate way to inform the master the persistent task should be removed. Rather unfamiliar with this area of code so open to any and all suggestions :)