-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ResponseOps] Errors during marking tasks as running are not shown in metrics #191300
[ResponseOps] Errors during marking tasks as running are not shown in metrics #191300
Conversation
/ci |
/ci |
Pinging @elastic/response-ops (Team:ResponseOps) |
.catch((err) => this.handleFailureOfMarkAsRunning(taskRunner, err)); | ||
.catch((err) => { | ||
this.handleFailureOfMarkAsRunning(taskRunner, err); | ||
throw err; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throwing the error here will cause the entire polling cycle to error, which will correctly emit a task claim failure metric but i'm not sure that's the behavior we're looking for? if we claim 10 tasks and 1 of them is failed to be marked as running, I think we should still run the other 9, which I think throwing this error prevents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if the entire bulk update fails then none of the tasks would get updated anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mikecote. If we throw this error here, will it have any other downstream effects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explored a bit in the code and couldn't find any downstream effects.
The error thrown here gets caught and handled in this code:
subject.next(asPollingError<T>(e, PollingErrorType.WorkError)); |
If there's a way to make the successfully updated tasks run while still reporting the errors in our metrics, that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just emit an event here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if there are requests having errors, calling this code path multiple times? If we emit multiple errors, will there be multiple counts in our metrics of task claims w/ error statuses?
I didn't get it. You mean something like runSoon
calling this path, getting errors and calling again?
I think emitting error per request is expected.
btw, I think emitting in the line Alexi linked makes sense (If i am not missing anything)
Not sure if throwing here breaks the promise.all
and stops the rest of the tasks being handled. But if we want to throw an error here we can switch to promise.allSettled
in order to handle tasks separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think emitting error per request is expected.
From my understanding, emitting an event per failure to update a task will correlate to one claim cycle failure for each task. So if you fail to update 4/10 tasks, you emit 4 events, and the metrics show 4 task claim cycle failures. Which we only want one record for this task claim cycle? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant per runSoon request :)
I think setting a variable like hasError=true
in the catch block and emitting an event after the promise.all
once would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I think we're all aligned.
Going to the original question about running the tasks that did successfully update.. @doakalexi do you know with the current approach if those tasks still run? or if we need to change some code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe with the way it works currently, the bulk update will throw an error and sometimes the bulkUpdate will successfully complete and individual tasks will have an error
field. When the bulk update throws an error it will be caught and thrown again in the catch that I updated in this PR, but I think this is fine bc there are no tasks that were successfully updated, they all fail. In the second case, those tasks that successfully update will still run and the failed ones with the error
field are handled in the bulkUpdate code. I am not totally sure we need to change the code, but pls let me know if I am misunderstanding or wrong
@mikecote helped me with the testing and I wanted to share what we did in case @ymao1 or @ersin-erdal want to verify.
The output should look something like this
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
To update your PR or re-run it, just comment with: |
Resolves #184171
Summary
Errors are not shown in metrics when Elasticsearch returns an error during
markAsRunning
(changes status from claiming to running) operation in TaskManager. This PR updates the TaskManager to throw an error instead of just logging it.Checklist
To verify