-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Detecting search queries which have been cancelled but continues to run for long. #6953
Comments
How will this interact with slow log feature? https://opensearch.org/docs/1.2/opensearch/logs/#slow-logs |
This is not meant to interact with slow log feature. As part of this proposal we only want to detect stats(count) around queries which were attempted for cancellation but still continues to run for a while. There is no way currently to do this. This will help us identify the effectiveness of search backpressure which internally tries to cancel queries. We might later plan to dump such queries along with stacktrace but not use slow log for this. |
I guess I was asking a leading question. It feels to me that slow queries are the same thing as long running queries. In this proposal you're adding an API that returns information about long running queries, so can they also serve information about what is currently recorded as a slow query (and dumped in the logs)? |
I just changed the description to avoid confusion. Yes, slow logs are same as long running queries based on a defined threshold. |
@sgup432 I understand now. I think I would prefer |
Yeah adding Regarding Also I would be correcting above stats to include current, total count. Total would help us to visualize it over time and compare with previous value.
|
This would be duplicate information (you can calculate running time from the timestamp), so that's another anti-pattern for an API. Return timestamps and let any UX consuming this data decide how to display it, including showing a duration. Speaking of displaying, getting this timestamp makes it possible to display a correct duration live without querying the API again.
Yes, I mean renaming How about a
I am not sure I understand all kinds of query counts. What kids are there? |
We already have an existing search stats under indices stats something like below:
So I don't think we should have duplicate search stats apart from this. And regarding updating this section with cancellation stats, that is also not possible as IndicesStats is maintained in memory by IndicesService. And I don't see anyway for our service(SearchCancellationMonitoringService) to update this search stats from outside. Let me know if you have more suggestion. But as we are using a separate service to track such queries, we will need to create a new stats section specific to this use case.
Name it explicitly |
I feel pretty strongly about returning a timestamp and not a duration for the reasons above, but I am happy to be convinced otherwise if other maintainers feel that it's not important - maybe @nknize or @andrross?
Okay, so that's an implementation problem not an API design. Implementation is obviously important, but shouldn't be driving our design decisions. In theory, at least. What work do you expect As an example from above - we don't separate "fetch_total" into a "fetch_stats". So what's the reasoning to separate "post_cancellation_stats"? |
@sgup432 you should PR what you have, especially if parts of the discussion above are non-controversial (e.g. "Introduction of a new field in Task API to detect time elapsed for cancelled query"), and we can keep discussing in the PR whether that should be a duration or a timestamp - I'm sure we can resolve that quickly either way. |
Why not use the timeout? my guess is that it was not documented correctly, but there is a way to have queries time out reliably after a given time. #986 |
Thanks for putting up this proposal, @sgup432! I agree that task cancellations (and timeouts) are ineffective against non-cooperating code flows that do not check the 'cancelled' status regularly. This is analogous to threads that do not check the 'interrupted' status. I'm more interested in knowing how you plan to achieve 'hard' cancellations, considering I like the idea of identifying tasks that were cancelled long ago. Please note that 'cancelled at' should be measured using a wall clock ( Instead of creating a new SearchCancellationMonitoringService, how about we plug this functionality into the existing TaskCancellationService? That we can generalise this for all types of task cancellations. You can have a long polling interval and still catch long-cancelled tasks by plugging into the task completion hook provided by the task resource tracking framework. |
I agree with both @dblock and @ketanv3 here regarding not including the running time for all the reasons cited above. |
@dblock Yeah will PR task API related changes. |
@ketanv3 Thanks for a detailed review! Appreciate your continued eyes on this. |
Took a look at TaskCancellationService and also agree on the part of making it generic. It has a functionality to cancel tasks when needed. In our case we want to repeatedly detect long cancelled tasks running since X(10 for now) seconds. I see below option of achieving this: Extend TaskCancellationService functionality by also adding a capability to schedule it every Y seconds to detect long cancelled tasks running since X seconds. To achieve this we need to inherit from AbstractLifeCycleComponent and then schedule it accordingly. We can a generic task cancellation stats like below:
And can add a setting through which we can enable monitoring of different type of tasks in the future. Will keep SearchShardTask for now.
Cons
I am not in favor of this approach with reason mentioned in the cons. We can though add common logic in TaskCancellationService and make SearchCancellationMonitoring(and rest) extend it. This way each task can have its own intervals and we will also achieve code reusability.
I am not sure whether this was an extension to above TaskCancellationService from your side. But if I understand correctly, you are suggesting to hook into task completion interface to track long cancelled tasks. If yes, then I don't think this will be efficient as some tasks continue to run even for an hour or sometimes never get completed gracefully because opensearch process is restarted in such cases to avoid havoc caused by these long running tasks. So we need a way for sure to schedule a service regularly to keep track of such tasks. |
IndicesService collect these stats(query_total, fetch_total etc) via listener based approach. Where after every phase is done, listener is triggered and stats are collected. This approach will not work very well for our use case. As for us, we need to wait for atleast X seconds after a task is cancelled, check if it is still running and then collect stats for it. Also as @ketanv3 mentioned, we can plan to track this for all kind of tasks and make it more generic. Through this approach, it would make more sense to add a separate stats(named as Let me know if this makes sense. |
@sgup432, I like the ideas presented overall.
Do we have a use-case of monitoring different types of tasks at different intervals? Does that require separate monitoring threads for each 'cancellable' task type? In that case, you can explore hierarchical timing wheels to schedule work at varying intervals while still keeping the monitoring code single-threaded. But again, this is an implementation detail that can be ironed out when the PRs are out.
I was pointing out that we may not have to go with very short monitoring intervals, which would otherwise put strain on the cluster. We can combine a longer interval with plugging into that hook to catch tasks that complete before the monitoring cycle. |
@ketanv3 We don't have a use case right now regarding different intervals for different tasks. But I guess we need to design in such a way so it accommodates those needs int the future? And thanks, will look into the link you provided. And yes, we can discuss implementation related details in the PR. |
@sgup432 Is there anything pending for this ? If not, then lets close this issue. Looks like we have made changes for the detection part |
@sohami Yeah nothing else pending for this. Closing this out. |
Co-Author: @jainankitk
Problem Statement
As of now in OpenSearch, we have a best effort mechanism to cancel long running search queries via task cancel API
POST _tasks/<task_id>/_cancel
. We can refer this as soft cancellation as this mechanism works by setting a cancellable flag as true and then requires search thread to check on itself at regular intervals. As we have seen in the past, this mechanism doesn’t work all the time as there are some cases where search query continues to run in lucene tight loops and causes cluster instability.While we eventually want to solve this problem by doing a "hard" cancellation, we first want to add a mechanism to detect such queries which were cancelled and still continues to run for a while. With the introduction of Search backpressure, there would be a lot of such cases and we need a mechanism to track them. This will also help us test the effectiveness of existing search cancellation, identify tight loops within OS/lucene and eventually build a case to hard cancel rogues queries.
Solution
There are two parts to the solution.
We will only focus on the Part 1 for now.
Introduction of a new field in Task API to detect time elapsed for cancelled query
We will introduce a new variable as cancellationTime in CancellableTask. Every time a task is cancelled, we will capture this timestamp which will be later used to calculate time elapsed after this task was cancelled. We will also append this information in Tasks API by introducing a new field running_time_since_cancellation_nanos.
This will give us additional information to display such tasks especially for search.
Separate monitoring thread to detect such queries
As part of this approach, we will have a monitor thread(out of generic threadpool) within opensearch running every Y seconds and extracting all such search tasks which have been running since X seconds after cancellation. Here Y << X.
In every such run, this monitoring thread will fetch all the current running search shard tasks using
TaskResourceTrackingService::getResourceAwareTasks
here. We will then filter out tasks which have been cancelled and have been running for more than X seconds using.X ~ 10s - Reason to keep this value so that it is long enough to detect tight loops within OS/lucene and short enough to avoid missing long running queries after cancellation. Configurable.
Y ~ 5s - As X is 10s, keeping this 5 so that we don’t check too frequently and keeping enough buffer to incorporate execution time of detection. Configurable.
Below are the components that will be newly introduced:
SearchCancellationMonitoringService: This will run every Y seconds and fetch all current running searchShardTasks. And then filter tasks which have been cancelled and running since X seconds. Will store such tasks count in local variable.
SearchCancellationMonitoringStats: Maintains all stats related to search cancellation from above. Sample response for: _nodes/stats/search_cancellation
Using above approach, we will have some insights on such queries. OS clients can also custom cron to call node/stats API to monitor such queries if required.
Future plan
We can eventually have a logic to dump such queries into log as well(Part 2). And a long term plan is to hard cancel such queries but requires more deep dive as there seems no easy way for now to do that.
Describe alternatives you've considered
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: