Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps] Errors during marking tasks as running are not shown in metrics #191300

Merged

Conversation

doakalexi
Copy link
Contributor

@doakalexi doakalexi commented Aug 26, 2024

Resolves #184171

Summary

Errors are not shown in metrics when Elasticsearch returns an error during markAsRunning (changes status from claiming to running) operation in TaskManager. This PR updates the TaskManager to throw an error instead of just logging it.

Checklist

To verify

  1. Create an Always Firing rule.
  2. Put the below code in the try block of TaskStore.bulkUpdate method to mimic markAsRunning
      const isMarkAsRunning = docs.some(
        (doc) =>
          doc.taskType === 'alerting:example.always-firing' &&
          doc.status === 'running' &&
          doc.retryAt !== null
      );
      if (isMarkAsRunning) {
        throw SavedObjectsErrorHelpers.decorateEsUnavailableError(new Error('test'));
      }
  1. Verify that when the above error is thrown, it is reflected in metrics endpoint results.

@doakalexi
Copy link
Contributor Author

/ci

@doakalexi
Copy link
Contributor Author

/ci

@doakalexi doakalexi changed the title Show errors from marking a task as running the metrics [ResponseOps] Errors during marking tasks as running are not shown in metrics Aug 26, 2024
@doakalexi doakalexi added release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0 labels Aug 26, 2024
@doakalexi doakalexi marked this pull request as ready for review August 26, 2024 21:37
@doakalexi doakalexi requested a review from a team as a code owner August 26, 2024 21:37
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@doakalexi doakalexi requested review from ymao1 and adcoelho August 26, 2024 21:37
.catch((err) => this.handleFailureOfMarkAsRunning(taskRunner, err));
.catch((err) => {
this.handleFailureOfMarkAsRunning(taskRunner, err);
throw err;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throwing the error here will cause the entire polling cycle to error, which will correctly emit a task claim failure metric but i'm not sure that's the behavior we're looking for? if we claim 10 tasks and 1 of them is failed to be marked as running, I think we should still run the other 9, which I think throwing this error prevents.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if the entire bulk update fails then none of the tasks would get updated anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mikecote. If we throw this error here, will it have any other downstream effects?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ymao1

I explored a bit in the code and couldn't find any downstream effects.

The error thrown here gets caught and handled in this code:

subject.next(asPollingError<T>(e, PollingErrorType.WorkError));

If there's a way to make the successfully updated tasks run while still reporting the errors in our metrics, that would be great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just emit an event here?

Copy link
Contributor

@ersin-erdal ersin-erdal Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if there are requests having errors, calling this code path multiple times? If we emit multiple errors, will there be multiple counts in our metrics of task claims w/ error statuses?

I didn't get it. You mean something like runSoon calling this path, getting errors and calling again?
I think emitting error per request is expected.

btw, I think emitting in the line Alexi linked makes sense (If i am not missing anything)

Not sure if throwing here breaks the promise.all and stops the rest of the tasks being handled. But if we want to throw an error here we can switch to promise.allSettled in order to handle tasks separately.

Copy link
Contributor

@mikecote mikecote Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think emitting error per request is expected.

From my understanding, emitting an event per failure to update a task will correlate to one claim cycle failure for each task. So if you fail to update 4/10 tasks, you emit 4 events, and the metrics show 4 task claim cycle failures. Which we only want one record for this task claim cycle? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant per runSoon request :)

I think setting a variable like hasError=true in the catch block and emitting an event after the promise.all once would work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I think we're all aligned.

Going to the original question about running the tasks that did successfully update.. @doakalexi do you know with the current approach if those tasks still run? or if we need to change some code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe with the way it works currently, the bulk update will throw an error and sometimes the bulkUpdate will successfully complete and individual tasks will have an error field. When the bulk update throws an error it will be caught and thrown again in the catch that I updated in this PR, but I think this is fine bc there are no tasks that were successfully updated, they all fail. In the second case, those tasks that successfully update will still run and the failed ones with the error field are handled in the bulkUpdate code. I am not totally sure we need to change the code, but pls let me know if I am misunderstanding or wrong

@doakalexi
Copy link
Contributor Author

doakalexi commented Sep 3, 2024

@mikecote helped me with the testing and I wanted to share what we did in case @ymao1 or @ersin-erdal want to verify.

diff --git a/x-pack/plugins/task_manager/server/task_pool/task_pool.ts b/x-pack/plugins/task_manager/server/task_pool/task_pool.ts
index 217b03135f5..2d2028e8e6e 100644
--- a/x-pack/plugins/task_manager/server/task_pool/task_pool.ts
+++ b/x-pack/plugins/task_manager/server/task_pool/task_pool.ts
@@ -137,7 +137,9 @@ export class TaskPool {
       availableCapacity
     );

+    let counter = 0;
     if (tasksToRun.length) {
+      console.log(`*** Mark as running ${tasksToRun.length} task(s)`);
       await Promise.all(
         tasksToRun
           .filter(
@@ -147,6 +149,11 @@ export class TaskPool {
               )
           )
           .map(async (taskRunner) => {
+            if (counter++ % 2 !== 0) {
+              console.log(`*** Going to fail markTaskAsRunning() for ${taskRunner.id}`);
+              throw new Error('oops');
+            }
+            console.log(`*** Going to succeed markTaskAsRunning() for ${taskRunner.id}`);
             // We use taskRunner.taskExecutionId instead of taskRunner.id as key for the task pool map because
             // task cancellation is a non-blocking procedure. We calculate the expiration and immediately remove
             // the task from the task pool. There is a race condition that can occur when a recurring tasks's schedule
diff --git a/x-pack/plugins/task_manager/server/task_running/task_runner.ts b/x-pack/plugins/task_manager/server/task_running/task_runner.ts
index bfcabed9f6e..12812288f5c 100644
--- a/x-pack/plugins/task_manager/server/task_running/task_runner.ts
+++ b/x-pack/plugins/task_manager/server/task_running/task_runner.ts
@@ -372,6 +372,7 @@ export class TaskManagerRunner implements TaskRunner {
         description: 'run task',
       };

+      console.log(`*** Running task ${this.id}`);
       const result = await this.executionContext.withContext(ctx, () =>
         withSpan({ name: 'run', type: 'task manager' }, () => this.task!.run())
       );
       

The output should look something like this

*** Mark as running 5 task(s)
*** Going to succeed markTaskAsRunning() for endpoint:complete-external-response-actions-1.0.0
*** Going to fail markTaskAsRunning() for apm-source-map-migration-task-id
*** Going to succeed markTaskAsRunning() for Actions-actions_telemetry
*** Going to fail markTaskAsRunning() for Dashboard-dashboard_telemetry
*** Going to succeed markTaskAsRunning() for observabilityAIAssistant:indexQueuedDocumentsTask
[2024-09-03T13:29:24.674-04:00][ERROR][plugins.taskManager] Failed to poll for work: Error: oops
*** Running task endpoint:complete-external-response-actions-1.0.0
*** Running task Actions-actions_telemetry
*** Running task observabilityAIAssistant:indexQueuedDocumentsTask

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #14 / EditableMarkdown Save button click calls onSaveContent and onChangeEditable when text area value changed

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@doakalexi doakalexi merged commit 866a6c9 into elastic:main Sep 4, 2024
39 checks passed
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Errors during marking tasks as running are not shown in metrics
7 participants