[ResponseOps] Errors during marking tasks as running are not shown in metrics #191300

doakalexi · 2024-08-26T17:06:50Z

Summary

Errors are not shown in metrics when Elasticsearch returns an error during markAsRunning (changes status from claiming to running) operation in TaskManager. This PR updates the TaskManager to throw an error instead of just logging it.

Checklist

Unit or functional tests were updated or added to match the most common scenarios

To verify

Create an Always Firing rule.
Put the below code in the try block of TaskStore.bulkUpdate method to mimic markAsRunning

      const isMarkAsRunning = docs.some(
        (doc) =>
          doc.taskType === 'alerting:example.always-firing' &&
          doc.status === 'running' &&
          doc.retryAt !== null
      );
      if (isMarkAsRunning) {
        throw SavedObjectsErrorHelpers.decorateEsUnavailableError(new Error('test'));
      }

Verify that when the above error is thrown, it is reflected in metrics endpoint results.

doakalexi · 2024-08-26T17:06:57Z

/ci

doakalexi · 2024-08-26T17:16:52Z

/ci

elasticmachine · 2024-08-26T21:37:29Z

Pinging @elastic/response-ops (Team:ResponseOps)

ymao1 · 2024-08-27T15:57:36Z

x-pack/plugins/task_manager/server/task_pool/task_pool.ts

-              .catch((err) => this.handleFailureOfMarkAsRunning(taskRunner, err));
+              .catch((err) => {
+                this.handleFailureOfMarkAsRunning(taskRunner, err);
+                throw err;


throwing the error here will cause the entire polling cycle to error, which will correctly emit a task claim failure metric but i'm not sure that's the behavior we're looking for? if we claim 10 tasks and 1 of them is failed to be marked as running, I think we should still run the other 9, which I think throwing this error prevents.

I guess if the entire bulk update fails then none of the tasks would get updated anyway?

cc @mikecote. If we throw this error here, will it have any other downstream effects?

@ymao1

I explored a bit in the code and couldn't find any downstream effects.

The error thrown here gets caught and handled in this code:

kibana/x-pack/plugins/task_manager/server/polling/task_poller.ts

Line 69 in 1344d3b

subject.next(asPollingError<T>(e, PollingErrorType.WorkError));

If there's a way to make the successfully updated tasks run while still reporting the errors in our metrics, that would be great.

Can't we just emit an event here?

What would happen if there are requests having errors, calling this code path multiple times? If we emit multiple errors, will there be multiple counts in our metrics of task claims w/ error statuses?

I didn't get it. You mean something like runSoon calling this path, getting errors and calling again?
I think emitting error per request is expected.

btw, I think emitting in the line Alexi linked makes sense (If i am not missing anything)

Not sure if throwing here breaks the promise.all and stops the rest of the tasks being handled. But if we want to throw an error here we can switch to promise.allSettled in order to handle tasks separately.

I think emitting error per request is expected.

From my understanding, emitting an event per failure to update a task will correlate to one claim cycle failure for each task. So if you fail to update 4/10 tasks, you emit 4 events, and the metrics show 4 task claim cycle failures. Which we only want one record for this task claim cycle? 🤔

I meant per runSoon request :)

I think setting a variable like hasError=true in the catch block and emitting an event after the promise.all once would work.

Gotcha, I think we're all aligned.

Going to the original question about running the tasks that did successfully update.. @doakalexi do you know with the current approach if those tasks still run? or if we need to change some code?

I believe with the way it works currently, the bulk update will throw an error and sometimes the bulkUpdate will successfully complete and individual tasks will have an error field. When the bulk update throws an error it will be caught and thrown again in the catch that I updated in this PR, but I think this is fine bc there are no tasks that were successfully updated, they all fail. In the second case, those tasks that successfully update will still run and the failed ones with the error field are handled in the bulkUpdate code. I am not totally sure we need to change the code, but pls let me know if I am misunderstanding or wrong

doakalexi · 2024-09-03T17:44:12Z

@mikecote helped me with the testing and I wanted to share what we did in case @ymao1 or @ersin-erdal want to verify.

diff --git a/x-pack/plugins/task_manager/server/task_pool/task_pool.ts b/x-pack/plugins/task_manager/server/task_pool/task_pool.ts
index 217b03135f5..2d2028e8e6e 100644
--- a/x-pack/plugins/task_manager/server/task_pool/task_pool.ts
+++ b/x-pack/plugins/task_manager/server/task_pool/task_pool.ts
@@ -137,7 +137,9 @@ export class TaskPool {
       availableCapacity
     );

+    let counter = 0;
     if (tasksToRun.length) {
+      console.log(`*** Mark as running ${tasksToRun.length} task(s)`);
       await Promise.all(
         tasksToRun
           .filter(
@@ -147,6 +149,11 @@ export class TaskPool {
               )
           )
           .map(async (taskRunner) => {
+            if (counter++ % 2 !== 0) {
+              console.log(`*** Going to fail markTaskAsRunning() for ${taskRunner.id}`);
+              throw new Error('oops');
+            }
+            console.log(`*** Going to succeed markTaskAsRunning() for ${taskRunner.id}`);
             // We use taskRunner.taskExecutionId instead of taskRunner.id as key for the task pool map because
             // task cancellation is a non-blocking procedure. We calculate the expiration and immediately remove
             // the task from the task pool. There is a race condition that can occur when a recurring tasks's schedule
diff --git a/x-pack/plugins/task_manager/server/task_running/task_runner.ts b/x-pack/plugins/task_manager/server/task_running/task_runner.ts
index bfcabed9f6e..12812288f5c 100644
--- a/x-pack/plugins/task_manager/server/task_running/task_runner.ts
+++ b/x-pack/plugins/task_manager/server/task_running/task_runner.ts
@@ -372,6 +372,7 @@ export class TaskManagerRunner implements TaskRunner {
         description: 'run task',
       };

+      console.log(`*** Running task ${this.id}`);
       const result = await this.executionContext.withContext(ctx, () =>
         withSpan({ name: 'run', type: 'task manager' }, () => this.task!.run())
       );

The output should look something like this

*** Mark as running 5 task(s)
*** Going to succeed markTaskAsRunning() for endpoint:complete-external-response-actions-1.0.0
*** Going to fail markTaskAsRunning() for apm-source-map-migration-task-id
*** Going to succeed markTaskAsRunning() for Actions-actions_telemetry
*** Going to fail markTaskAsRunning() for Dashboard-dashboard_telemetry
*** Going to succeed markTaskAsRunning() for observabilityAIAssistant:indexQueuedDocumentsTask
[2024-09-03T13:29:24.674-04:00][ERROR][plugins.taskManager] Failed to poll for work: Error: oops
*** Running task endpoint:complete-external-response-actions-1.0.0
*** Running task Actions-actions_telemetry
*** Running task observabilityAIAssistant:indexQueuedDocumentsTask

ymao1

LGTM

kibana-ci · 2024-09-04T16:26:10Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 58b2bb5

Failed CI Steps

Jest Tests #14

Test Failures

[job] [logs] Jest Tests #14 / EditableMarkdown Save button click calls onSaveContent and onChangeEditable when text area value changed

Metrics [docs]

✅ unchanged

History

💛 Build #231713 was flaky efd79a2
💚 Build #230934 succeeded a226f3e
💛 Build #229946 was flaky de5a811
💚 Build #229878 succeeded 8212d4b

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Show errors from marking a task as running the metrics

76a4d9c

Updating tests

8212d4b

doakalexi changed the title ~~Show errors from marking a task as running the metrics~~ [ResponseOps] Errors during marking tasks as running are not shown in metrics Aug 26, 2024

doakalexi added release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0 labels Aug 26, 2024

doakalexi marked this pull request as ready for review August 26, 2024 21:37

doakalexi requested a review from a team as a code owner August 26, 2024 21:37

doakalexi requested review from ymao1 and adcoelho August 26, 2024 21:37

Merge branch 'main' into taskmanager/show-set-task-as-running-errors

de5a811

ymao1 reviewed Aug 27, 2024

View reviewed changes

doakalexi added 2 commits August 29, 2024 13:05

Merge branch 'main' into taskmanager/show-set-task-as-running-errors

a226f3e

Merge branch 'main' into taskmanager/show-set-task-as-running-errors

efd79a2

ymao1 approved these changes Sep 4, 2024

View reviewed changes

Merge branch 'main' into taskmanager/show-set-task-as-running-errors

58b2bb5

doakalexi merged commit 866a6c9 into elastic:main Sep 4, 2024
39 checks passed

kibanamachine added the backport:skip This commit does not require backporting label Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResponseOps] Errors during marking tasks as running are not shown in metrics #191300

[ResponseOps] Errors during marking tasks as running are not shown in metrics #191300

doakalexi commented Aug 26, 2024 •

edited

Loading

doakalexi commented Aug 26, 2024

doakalexi commented Aug 26, 2024

elasticmachine commented Aug 26, 2024

ymao1 Aug 27, 2024

ymao1 Aug 27, 2024

ymao1 Aug 27, 2024

mikecote Aug 28, 2024

ersin-erdal Aug 28, 2024

ersin-erdal Sep 2, 2024 •

edited

Loading

mikecote Sep 3, 2024 •

edited

Loading

ersin-erdal Sep 3, 2024

mikecote Sep 3, 2024

doakalexi Sep 3, 2024

doakalexi commented Sep 3, 2024 •

edited

Loading

ymao1 left a comment

kibana-ci commented Sep 4, 2024

[ResponseOps] Errors during marking tasks as running are not shown in metrics #191300

[ResponseOps] Errors during marking tasks as running are not shown in metrics #191300

Conversation

doakalexi commented Aug 26, 2024 • edited Loading

Summary

Checklist

To verify

doakalexi commented Aug 26, 2024

doakalexi commented Aug 26, 2024

elasticmachine commented Aug 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ersin-erdal Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

mikecote Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

doakalexi commented Sep 3, 2024 • edited Loading

ymao1 left a comment

Choose a reason for hiding this comment

kibana-ci commented Sep 4, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

doakalexi commented Aug 26, 2024 •

edited

Loading

ersin-erdal Sep 2, 2024 •

edited

Loading

mikecote Sep 3, 2024 •

edited

Loading

doakalexi commented Sep 3, 2024 •

edited

Loading