-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic task cancellation mechanism is invalid on coordinating node's reduce phase #70347
Comments
@DaveCTurner Would you please have a look at this issue and give a help? Thanks |
Pinging @elastic/es-distributed (Team:Distributed) |
Pinging @elastic/es-search (Team:Search) |
Hi @lqbilbo We have an improvement in 7.7 that cancels search requests more quickly. Would be great if you can give it a try. Thanks very much for your interest in Elasticsearch. |
@dnhatn I don't think this improvement can address this problem. the performance of a single shard is good, the response time is only around 200ms, the main bottleneck occurs on the reduce phase in the coordinator rather than datanode, specially internalterms reduce. take the query that this issue mentioned as a example, internalterms reduce take 20s+, it's very slow. we add some debug information, found out it only reduce from about 192920 buckets to 96458
So two major problem that we need to solve.
About the first problem, we can add cancel check point in the inside of consumeBucketsAndMaybeBreak method. once creating a new bucket , we also check if the current task is canceled. if checking on creating a single bucket is a big overhead, we can check every multiple buckets previously, we compare elasticsearch with apache doris. take terms bucket aggregation query as example. if the response contains more than 60 thousands buckets, it will run until timeout, but apache doris can be easy to handle. a query that build 507157 buckets on apache doris only take 219ms
|
Elasticsearch version: (7.6.2)
Plugins installed: [analysis-ik]
JVM version (openjdk 13.0.2):
OS version (Linux n23-161-209 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u6 (2018-10-08) x86_64 GNU/Linux):
Description of the problem including expected versus actual behavior:
We expect the long-running aggregation search task will be cancelled immediately when the channel closed(#43332), but the actual behavior is that when we cancel the client request on the query phase it can cancel the task as we expected and throw the exception '...Caused by: java.lang.IllegalStateException: Task cancelled before it started: by user request' and the search task is paused. But if we cancel the client request during reduce phase, the log will be 'Received ban for the parent [N78L0bWWQEuSLPvitGDBxw:10714] on the node [N78L0bWWQEuSLPvitGDBxw], reason: [by user request]' but the search task is still running.
Steps to reproduce:
Please include a minimal but complete recreation of the problem,
including (e.g.) index creation, mappings, settings, query etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.
Provide logs (if relevant): I take notes about the logs on coordinating node.
(1) when I cancel the client request on query phase, the log is:
the search task is cancelled while I cancel the client request, so I don't take notes of time.
(2) when I cancel the client request on reduce phase, the log is:
the process lasts nearly 35 seconds.
(3) If I don't cancel the client request, the response is:
the process lasts nearly 36 seconds. So you can see when I cancel the client request on reduce phase, it still cost such a long time as I don't cancel the client request.
The text was updated successfully, but these errors were encountered: