Concurrent Searching #80693

LifeIsStrange · 2021-11-14T16:48:53Z

It looks like Elasticsearch is currently missing support for concurrent searching and that this is a major prospective optimization to reduce latency.

Digression: this is not the topic at hand but I personally consider the fork OpenSearch to be an historical tragedy of duplication of work and a segregation of improvements. What would be even more anti-utilitarist would be to blindly ignore each other advances towards making better search.

Lucene concurrent search APIs are probably among the most important feature/optimization currently not leveraged by Elasticsearch.
The fork OpenSearch has a pull request implementing support for it.

See also the corresponding issue

I believe it would be great for end users if you could take inspiration from the PR and work on this feature.

Great blog about the Lucene feature:
https://blog.mikemccandless.com/2019/10/concurrent-query-execution-in-apache.html?m=1
See also those performance results:
https://engineeringblog.yelp.com/2021/09/nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine.html

elasticmachine · 2021-11-18T18:37:48Z

Pinging @elastic/es-search (Team:Search)

mayya-sharipova · 2021-11-18T20:19:09Z

This has been discussed multiple times before, and here is a good summary why have not done it so far.

In short, we focus parallelization on processing multiple requests concurrently thus increasing throughput, rather than parallelizing a single request.

@LifeIsStrange I am curios what is your use case where you need to improve the latency of a single request?

LifeIsStrange · 2021-11-18T20:56:44Z

I am not very qualified to defend the optimization but maybe that @mikemccand could do some justification?

Improving the latency of a single very complex request seems like a real use case for niche advanced uses (maybe some kinds of scientific computing?). Actually it's not just for a single queries, it should benefit any project that has complex queries but in not very high number (and most websites in fact are likely to have a relatively low number of search queries but can have arbitrary complexity)
inter query and intra query parallelism seems to me like a complementary problem and having parallelism at both level shall in theory yield better latency for common workloads.
sure intra query parallelism might in some case adds overhead but 1) it would probably be an opt-in optimization and can be only applied in a fine-grained manner (I don't know if ElasticSearch/lucene has a performance cost model (a popular feature of e.g compilers where they have 1) a time/resource budget for optimizations, and 2) estimate the time/complexity of something), for ELK, if complexity of a query could be estimated (note that it has not to be only an a priori estimate, it can be refined empirically and dynamically through measurements like PGO) then if complexity score of a query has a certain pattern or is above a provided or determined threshold, then and eventually only then, leverage concurrent queries (intra query parallelism). You could even attempt speculative optimizations (trying to make a query concurrent, if it does not yield an improvement, fallback to single thread or adjust number of threads), like a JIT.
for completeness sake, an option to disable inter query parallelism would be nice or to be able to give precedence of one over the other depending on things such as latency vs throughput user preference?

Note btw that the cost of parallelism can be much lower than Java threads, either by leveraging a minimal amount of Kotlin code (using Coroutines) or through the soon to come Loom green (user space) threads.
Off-topic: I really think ELK should consider leveraging the Vector API.

Edit:
So I have read the original justification and:

We have found that combining concurrent requests with parallel segment readers does not offer any performance benefits and actually can hurt performance because of the cost of context switching so to keep the model simple we prefer to concentrate on keeping each search request processed on a single thread but support processing multiple search requests concurrently.

Firstly this was tested in 2018, Lucene releases frequently optimize/lower the cost of intra parallel queries.
Secondly what has been tested (probably) was to use both kinds of parallelism to their maximum, instead testing the various optimizations I propose (only apply intra concurrency based on a query cost complexity threshold and budget of available resources, refining the cost heuristic value empirically dynamically through measurements and doing speculative optimization or at the very least automatically, shrinking the number of thread or removing all intra parallelism for a given query if and only if it doesn't provide a speedup). With those optimizations, the resulting result might yield significant performance improvements.
Moreover, in any case I believe the feature should be at least exposed to the end user as an optin as I am pretty sure there are use cases where it leads to performance improvements (maybe with both kinds of parallelism of by disabling inter query parallelism). As a reminder, see this Yelps benchmark
BTW the current WIP prototype (which do not implement the additional optimizations I mention) shows promising results.

LifeIsStrange · 2022-12-14T15:34:04Z

@javanna I see it has been privately discussed in march, any feedback about what was said?

BTW as a little update to my original comment,

the opensearch implementation has been merged
Java Virtual threads are here, since the stable Java 19 release (as a feature preview, AKA contrary to an incubator JEP, a preview should be expected to have a mostly stable API. Therefore, (although I would prefer Kotlin) you can already tests virtual threads as a way to diminish the cost of simultaneous inter and intra parallelism. Note that the existing inter parallelism could still use regular threads.
But as a reminder, as explained in the original comment, even the regular threads cost is a non-issue, in the case you would use a time/complexity dynamic allocation budget, with automatic shrinkage and optionally speculative threading (growable) and ideally empiric real time metrics feedback (such as total cpu use vs througput/latency evolution). Then one could even implement a cache for which optimal configuration to perform for a given query. Let's say you have a recurrent complex query and that after said monitoring, it has been determined that using X (virtual or not) threads for intra parallelism and Y for inter. and that this configuration is the fastests found for said query (e.g. 50% faster than inter para alone), then the bonus idea would be to cache this configuration (key: query Z UUID, value: optimal configuration for said query, or hints).
Then (if cache not invalidated/expired), in the next hour, the same query is sent, and we can directly use the optimal configuration rather than wait again for the JIT metrics to find back the optimal conf.
Note this optimization is less important than the other mentionned.

javanna · 2022-12-14T16:42:01Z

@LifeIsStrange not much to add, this is quite high in the list of things we want to do. The list is long :) We'd prefer not to compromise and apply concurrency only to the query, but also to aggregations.

Elasticsearch has historically performed search sequentially across the segments. Lucene supports parallelizing search across segments when collecting hits (via collector managers) as well as when rewriting certain queries (e.g. knn queries). Elasticsearch is now ready to support concurrency within a single shard too. Search is already performed using collector managers, and the last piece that is missing is providing an executor to the index searcher so that it can offload the concurrent computation to it. This commit introduces a secondary executor, used exclusively to execute the concurrent bits of search. The search threads are still the ones that coordinate the search (where the caller search will originate from), but the actual work will be offloaded to the newly introduced executor. We are offloading not only parallel execution but also sequential execution, to make the workload more predictable, as it would be surprising to have bits of search executed in either of the two thread pools. Also, that would introduce the possibility to suddenly run a higher amount of heavy operations overall (some in the caller thread and some in the separate threads), which could overload the system as well as make sizing of thread pools more difficult. Note that fetch, together with other actions, is still executed in the search thread pool. This commit does not make the search thread pool merely a coordinating only thread pool, It does so only for what concerns the IndexSearcher#search operation itself, which is though a big portion of the different phases of search API execution. Given that the searcher blocks waiting for all tasks to be completed, we take a simple approach of introducing a bounded executor, which blocks whenever there are no threads to directly execute tasks. This simplifies handling of thread pool queue and rejections. In fact, we can guarantee that the secondary thread pool won't reject, and delegate queueing entirely to the search thread pool which is the entry point for every search operation anyway. The principle behind this is that if you got a slot in the search thread pool, you should be able to complete your search. As part of this commit we are also introducing the ability to cancel tasks that have not started yet, so that if any task throws an exception, other tasks are prevented from starting needless computation. This commit also enables concurrenct search execution in the DFS phase, which is going to improve resource usage as well as performance of knn queries which benefit from both concurrent rewrite and collection. We will enable concurrent execution for the query phase in a subsequent commit. Relates to elastic#80693 Relates to elastic#90700

This commit enables concurrent search execution in the DFS phase, which is going to improve resource usage as well as performance of knn queries which benefit from both concurrent rewrite and collection. We will enable concurrent execution for the query phase in a subsequent commit. While this commit does not introduce parallelism for the query phase, it introduces offloading sequential computation to the newly introduced executor. This is true both for situations where a single slice needs to be searched, as well as scenarios where a specific request does not support concurrency (currently only DFS phase does regardless of the request). Sequential collection is not offloaded only if the request includes aggregations that don't support offloading: composite, nested and cardinality as their post collection method must be executed in the same thread as the collection or we'll trip a lucene assertion that verifies that doc_values are pulled and consumed from the same thread. ## Technical details This commit introduces a secondary executor, used exclusively to execute the concurrent bits of search. The search threads are still the ones that coordinate the search (where the caller search will originate from), but the actual work will be offloaded to the newly introduced executor. We are offloading not only parallel execution but also sequential execution, to make the workload more predictable, as it would be surprising to have bits of search executed in either of the two thread pools. Also, that would introduce the possibility to suddenly run a higher amount of heavy operations overall (some in the caller thread and some in the separate threads), which could overload the system as well as make sizing of thread pools more difficult. Note that fetch, together with other actions, is still executed in the search thread pool. This commit does not make the search thread pool merely a coordinating only thread pool, It does so only for what concerns the IndexSearcher#search operation itself, which is though a big portion of the different phases of search API execution. Given that the searcher blocks waiting for all tasks to be completed, we take a simple approach of introducing a thread pool executor that has the same size as the existing search thread pool but relies on an unbounded queue. This simplifies handling of thread pool queue and rejections. In fact, we'd like to guarantee that the secondary thread pool won't reject, and delegate queuing entirely to the search thread pool which is the entry point for every search operation anyway. The principle behind this is that if you got a slot in the search thread pool, you should be able to complete your search, and rather quickly. As part of this commit we are also introducing the ability to cancel tasks that have not started yet, so that if any task throws an exception, other tasks are prevented from starting needless computation. Relates to #80693 Relates to #90700

This commit enables concurrent search execution in the DFS phase, which is going to improve resource usage as well as performance of knn queries which benefit from both concurrent rewrite and collection. We will enable concurrent execution for the query phase in a subsequent commit. While this commit does not introduce parallelism for the query phase, it introduces offloading sequential computation to the newly introduced executor. This is true both for situations where a single slice needs to be searched, as well as scenarios where a specific request does not support concurrency (currently only DFS phase does regardless of the request). Sequential collection is not offloaded only if the request includes aggregations that don't support offloading: composite, nested and cardinality as their post collection method must be executed in the same thread as the collection or we'll trip a lucene assertion that verifies that doc_values are pulled and consumed from the same thread. ## Technical details This commit introduces a secondary executor, used exclusively to execute the concurrent bits of search. The search threads are still the ones that coordinate the search (where the caller search will originate from), but the actual work will be offloaded to the newly introduced executor. We are offloading not only parallel execution but also sequential execution, to make the workload more predictable, as it would be surprising to have bits of search executed in either of the two thread pools. Also, that would introduce the possibility to suddenly run a higher amount of heavy operations overall (some in the caller thread and some in the separate threads), which could overload the system as well as make sizing of thread pools more difficult. Note that fetch, together with other actions, is still executed in the search thread pool. This commit does not make the search thread pool merely a coordinating only thread pool, It does so only for what concerns the IndexSearcher#search operation itself, which is though a big portion of the different phases of search API execution. Given that the searcher blocks waiting for all tasks to be completed, we take a simple approach of introducing a thread pool executor that has the same size as the existing search thread pool but relies on an unbounded queue. This simplifies handling of thread pool queue and rejections. In fact, we'd like to guarantee that the secondary thread pool won't reject, and delegate queuing entirely to the search thread pool which is the entry point for every search operation anyway. The principle behind this is that if you got a slot in the search thread pool, you should be able to complete your search, and rather quickly. As part of this commit we are also introducing the ability to cancel tasks that have not started yet, so that if any task throws an exception, other tasks are prevented from starting needless computation. Relates to elastic#80693 Relates to elastic#90700

LifeIsStrange added >enhancement needs:triage Requires assignment of a team area label labels Nov 14, 2021

nik9000 added :Search/Search Search-related issues that do not fall into other categories team-discuss and removed needs:triage Requires assignment of a team area label labels Nov 18, 2021

elasticmachine added the Team:Search Meta label for search team label Nov 18, 2021

javanna removed the team-discuss label Mar 30, 2022

jpountz mentioned this issue Oct 6, 2022

ANN search performance with multiple segments #90700

Open

javanna self-assigned this Apr 26, 2023

javanna mentioned this issue Jun 20, 2023

Concurrent search segments for particular queries. #96903

Closed

javanna mentioned this issue Aug 4, 2023

Introduce executor for concurrent search #98204

Merged

javanna mentioned this issue Oct 23, 2023

Enable query phase parallelism by default #101230

Merged

javanna closed this as completed in #101230 Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent Searching #80693

Concurrent Searching #80693

LifeIsStrange commented Nov 14, 2021 •

edited

Loading

elasticmachine commented Nov 18, 2021

mayya-sharipova commented Nov 18, 2021 •

edited

Loading

LifeIsStrange commented Nov 18, 2021 •

edited

Loading

LifeIsStrange commented Dec 14, 2022

javanna commented Dec 14, 2022

Concurrent Searching #80693

Concurrent Searching #80693

Comments

LifeIsStrange commented Nov 14, 2021 • edited Loading

elasticmachine commented Nov 18, 2021

mayya-sharipova commented Nov 18, 2021 • edited Loading

LifeIsStrange commented Nov 18, 2021 • edited Loading

LifeIsStrange commented Dec 14, 2022

javanna commented Dec 14, 2022

LifeIsStrange commented Nov 14, 2021 •

edited

Loading

mayya-sharipova commented Nov 18, 2021 •

edited

Loading

LifeIsStrange commented Nov 18, 2021 •

edited

Loading