Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent Searching #80693

Closed
LifeIsStrange opened this issue Nov 14, 2021 · 5 comments · Fixed by #101230
Closed

Concurrent Searching #80693

LifeIsStrange opened this issue Nov 14, 2021 · 5 comments · Fixed by #101230
Assignees
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@LifeIsStrange
Copy link

LifeIsStrange commented Nov 14, 2021

It looks like Elasticsearch is currently missing support for concurrent searching and that this is a major prospective optimization to reduce latency.

Digression: this is not the topic at hand but I personally consider the fork OpenSearch to be an historical tragedy of duplication of work and a segregation of improvements. What would be even more anti-utilitarist would be to blindly ignore each other advances towards making better search.

Lucene concurrent search APIs are probably among the most important feature/optimization currently not leveraged by Elasticsearch.
The fork OpenSearch has a pull request implementing support for it.

See also the corresponding issue

I believe it would be great for end users if you could take inspiration from the PR and work on this feature.

Great blog about the Lucene feature:
https://blog.mikemccandless.com/2019/10/concurrent-query-execution-in-apache.html?m=1
See also those performance results:
https://engineeringblog.yelp.com/2021/09/nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine.html

@LifeIsStrange LifeIsStrange added >enhancement needs:triage Requires assignment of a team area label labels Nov 14, 2021
@nik9000 nik9000 added :Search/Search Search-related issues that do not fall into other categories team-discuss and removed needs:triage Requires assignment of a team area label labels Nov 18, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Nov 18, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Nov 18, 2021

This has been discussed multiple times before, and here is a good summary why have not done it so far.

In short, we focus parallelization on processing multiple requests concurrently thus increasing throughput, rather than parallelizing a single request.

@LifeIsStrange I am curios what is your use case where you need to improve the latency of a single request?

@LifeIsStrange
Copy link
Author

LifeIsStrange commented Nov 18, 2021

I am not very qualified to defend the optimization but maybe that @mikemccand could do some justification?

  1. Improving the latency of a single very complex request seems like a real use case for niche advanced uses (maybe some kinds of scientific computing?). Actually it's not just for a single queries, it should benefit any project that has complex queries but in not very high number (and most websites in fact are likely to have a relatively low number of search queries but can have arbitrary complexity)
  2. inter query and intra query parallelism seems to me like a complementary problem and having parallelism at both level shall in theory yield better latency for common workloads.
  3. sure intra query parallelism might in some case adds overhead but 1) it would probably be an opt-in optimization and can be only applied in a fine-grained manner (I don't know if ElasticSearch/lucene has a performance cost model (a popular feature of e.g compilers where they have 1) a time/resource budget for optimizations, and 2) estimate the time/complexity of something), for ELK, if complexity of a query could be estimated (note that it has not to be only an a priori estimate, it can be refined empirically and dynamically through measurements like PGO) then if complexity score of a query has a certain pattern or is above a provided or determined threshold, then and eventually only then, leverage concurrent queries (intra query parallelism). You could even attempt speculative optimizations (trying to make a query concurrent, if it does not yield an improvement, fallback to single thread or adjust number of threads), like a JIT.
  4. for completeness sake, an option to disable inter query parallelism would be nice or to be able to give precedence of one over the other depending on things such as latency vs throughput user preference?

Note btw that the cost of parallelism can be much lower than Java threads, either by leveraging a minimal amount of Kotlin code (using Coroutines) or through the soon to come Loom green (user space) threads.
Off-topic: I really think ELK should consider leveraging the Vector API.

Edit:
So I have read the original justification and:

We have found that combining concurrent requests with parallel segment readers does not offer any performance benefits and actually can hurt performance because of the cost of context switching so to keep the model simple we prefer to concentrate on keeping each search request processed on a single thread but support processing multiple search requests concurrently.

Firstly this was tested in 2018, Lucene releases frequently optimize/lower the cost of intra parallel queries.
Secondly what has been tested (probably) was to use both kinds of parallelism to their maximum, instead testing the various optimizations I propose (only apply intra concurrency based on a query cost complexity threshold and budget of available resources, refining the cost heuristic value empirically dynamically through measurements and doing speculative optimization or at the very least automatically, shrinking the number of thread or removing all intra parallelism for a given query if and only if it doesn't provide a speedup). With those optimizations, the resulting result might yield significant performance improvements.
Moreover, in any case I believe the feature should be at least exposed to the end user as an optin as I am pretty sure there are use cases where it leads to performance improvements (maybe with both kinds of parallelism of by disabling inter query parallelism). As a reminder, see this Yelps benchmark
BTW the current WIP prototype (which do not implement the additional optimizations I mention) shows promising results.

@LifeIsStrange
Copy link
Author

@javanna I see it has been privately discussed in march, any feedback about what was said?

BTW as a little update to my original comment,

  1. the opensearch implementation has been merged
  2. Java Virtual threads are here, since the stable Java 19 release (as a feature preview, AKA contrary to an incubator JEP, a preview should be expected to have a mostly stable API. Therefore, (although I would prefer Kotlin) you can already tests virtual threads as a way to diminish the cost of simultaneous inter and intra parallelism. Note that the existing inter parallelism could still use regular threads.
    But as a reminder, as explained in the original comment, even the regular threads cost is a non-issue, in the case you would use a time/complexity dynamic allocation budget, with automatic shrinkage and optionally speculative threading (growable) and ideally empiric real time metrics feedback (such as total cpu use vs througput/latency evolution). Then one could even implement a cache for which optimal configuration to perform for a given query. Let's say you have a recurrent complex query and that after said monitoring, it has been determined that using X (virtual or not) threads for intra parallelism and Y for inter. and that this configuration is the fastests found for said query (e.g. 50% faster than inter para alone), then the bonus idea would be to cache this configuration (key: query Z UUID, value: optimal configuration for said query, or hints).
    Then (if cache not invalidated/expired), in the next hour, the same query is sent, and we can directly use the optimal configuration rather than wait again for the JIT metrics to find back the optimal conf.
    Note this optimization is less important than the other mentionned.

@javanna
Copy link
Member

javanna commented Dec 14, 2022

@LifeIsStrange not much to add, this is quite high in the list of things we want to do. The list is long :) We'd prefer not to compromise and apply concurrency only to the query, but also to aggregations.

@javanna javanna self-assigned this Apr 26, 2023
javanna added a commit to javanna/elasticsearch that referenced this issue Aug 4, 2023
Elasticsearch has historically performed search sequentially across the segments.
Lucene supports parallelizing search across segments when collecting hits (via collector managers)
as well as when rewriting certain queries (e.g. knn queries).

Elasticsearch is now ready to support concurrency within a single shard
too. Search is already performed using collector managers, and the last
piece that is missing is providing an executor to the index searcher
so that it can offload the concurrent computation to it.

This commit introduces a secondary executor, used exclusively to execute
the concurrent bits of search. The search threads are still the ones that
coordinate the search (where the caller search will originate from), but
the actual work will be offloaded to the newly introduced executor.

We are offloading not only parallel execution but also sequential
execution, to make the workload more predictable, as it would be
surprising to have bits of search executed in either of the two thread pools.
Also, that would introduce the possibility to suddenly run a higher
amount of heavy operations overall (some in the caller thread and some
in the separate threads), which could overload the system as well as
make sizing of thread pools more difficult.

Note that fetch, together with other actions,  is still executed in the
search thread pool. This commit does not make the search thread pool merely
a coordinating only thread pool, It does so only for what concerns the
IndexSearcher#search operation itself, which is though a big portion of
the different phases of search API execution.

Given that the searcher blocks waiting for all tasks to be completed,
we take a simple approach of introducing a bounded executor, which
blocks whenever there are no threads to directly execute tasks. This simplifies
handling of thread pool queue and rejections. In fact, we can guarantee
that the secondary thread pool won't reject, and delegate queueing
entirely to the search thread pool which is the entry point for every
search operation anyway. The principle behind this is that if you got a
slot in the search thread pool, you should be able to complete your
search.

As part of this commit we are also introducing the ability to cancel
tasks that have not started yet, so that if any task throws an exception,
other tasks are prevented from starting needless computation.

This commit also enables concurrenct search execution in the DFS phase,
which is going to improve resource usage as well as performance of knn
queries which benefit from both concurrent rewrite and collection.

We will enable concurrent execution for the query phase in a subsequent
commit.

Relates to elastic#80693
Relates to elastic#90700
javanna added a commit that referenced this issue Aug 10, 2023
This commit enables concurrent search execution in the DFS phase, which is going to improve resource usage as well as performance of knn queries which benefit from both concurrent rewrite and collection.

We will enable concurrent execution for the query phase in a subsequent commit. While this commit does not introduce parallelism for the query phase, it introduces offloading sequential computation to the newly introduced executor. This is true both for situations where a single slice needs to be searched, as well as scenarios where a specific request does not support concurrency (currently only DFS phase does regardless of the request). Sequential collection is not offloaded only if the request includes aggregations that don't support offloading: composite, nested and cardinality as their post collection method must be executed in the same thread as the collection or we'll trip a lucene assertion that verifies that doc_values are pulled and consumed from the same thread.

## Technical details

This commit introduces a secondary executor, used exclusively to execute the concurrent bits of search. The search threads are still the ones that coordinate the search (where the caller search will originate from), but the actual work will be offloaded to the newly introduced executor.

We are offloading not only parallel execution but also sequential execution, to make the workload more predictable, as it would be surprising to have bits of search executed in either of the two thread pools. Also, that would introduce the possibility to suddenly run a higher amount of heavy operations overall (some in the caller thread and some in the separate threads), which could overload the system as well as make sizing of thread pools more difficult.

Note that fetch, together with other actions,  is still executed in the search thread pool. This commit does not make the search thread pool merely a coordinating only thread pool, It does so only for what concerns the IndexSearcher#search operation itself, which is though a big portion of the different phases of search API execution.

Given that the searcher blocks waiting for all tasks to be completed, we take a simple approach of introducing a thread pool executor that has the same size as the existing search thread pool but relies on an unbounded queue. This simplifies handling of thread pool queue and rejections. In fact, we'd like to guarantee that the secondary thread pool won't reject, and delegate queuing entirely to the search thread pool which is the entry point for every search operation anyway. The principle behind this is that if you got a slot in the search thread pool, you should be able to complete your search, and rather quickly.

As part of this commit we are also introducing the ability to cancel tasks that have not started yet, so that if any task throws an exception, other tasks are prevented from starting needless computation.

Relates to #80693
Relates to #90700
csoulios pushed a commit to csoulios/elasticsearch that referenced this issue Aug 18, 2023
This commit enables concurrent search execution in the DFS phase, which is going to improve resource usage as well as performance of knn queries which benefit from both concurrent rewrite and collection.

We will enable concurrent execution for the query phase in a subsequent commit. While this commit does not introduce parallelism for the query phase, it introduces offloading sequential computation to the newly introduced executor. This is true both for situations where a single slice needs to be searched, as well as scenarios where a specific request does not support concurrency (currently only DFS phase does regardless of the request). Sequential collection is not offloaded only if the request includes aggregations that don't support offloading: composite, nested and cardinality as their post collection method must be executed in the same thread as the collection or we'll trip a lucene assertion that verifies that doc_values are pulled and consumed from the same thread.

## Technical details

This commit introduces a secondary executor, used exclusively to execute the concurrent bits of search. The search threads are still the ones that coordinate the search (where the caller search will originate from), but the actual work will be offloaded to the newly introduced executor.

We are offloading not only parallel execution but also sequential execution, to make the workload more predictable, as it would be surprising to have bits of search executed in either of the two thread pools. Also, that would introduce the possibility to suddenly run a higher amount of heavy operations overall (some in the caller thread and some in the separate threads), which could overload the system as well as make sizing of thread pools more difficult.

Note that fetch, together with other actions,  is still executed in the search thread pool. This commit does not make the search thread pool merely a coordinating only thread pool, It does so only for what concerns the IndexSearcher#search operation itself, which is though a big portion of the different phases of search API execution.

Given that the searcher blocks waiting for all tasks to be completed, we take a simple approach of introducing a thread pool executor that has the same size as the existing search thread pool but relies on an unbounded queue. This simplifies handling of thread pool queue and rejections. In fact, we'd like to guarantee that the secondary thread pool won't reject, and delegate queuing entirely to the search thread pool which is the entry point for every search operation anyway. The principle behind this is that if you got a slot in the search thread pool, you should be able to complete your search, and rather quickly.

As part of this commit we are also introducing the ability to cancel tasks that have not started yet, so that if any task throws an exception, other tasks are prevented from starting needless computation.

Relates to elastic#80693
Relates to elastic#90700
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants