-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize count()
for BooleanQuery disjunction
#12358
Comments
+1 to update the |
I checked the repository. When Tantitvy says it is 2 times faster then this is also unfair: The benchmark does not warm up the JVM. It runs the list of queries in cold, unoptimized state - COLD! Very unfair, sorry. See code:
So basically it runs the queries on a cold JVM so the first queries in the list are all running in interpreted mode, especially when the new vector features or mmapdir are used. Did I oversee something? |
OK, it seems to have some warmup in the python code, but I did not find the number of warmup queries. |
It's called tantivy. We do have a warmup phase.
Yes, you did oversee something. |
I found the python code (see my 2nd comment), but where's the number of warmup queries? |
It is given as an environnement variable. You can find the default value in the Makefile. It is 1 round. 1 round means run the entire benchmark once. Also the results shown are always showing the minimum of the n runs. Not the mean. |
Ok thanks, so it executes all queries once and then starts over in the same JVM? - I hope it does not start a new JVM each round. The default default config of mike's luceneutil benchmark is also not optimal with that and needs tuning (there are some discussions in recent commits about vector support and mmap). Especially with the brand new optimizations in recent Lucene 9.7 commits it gets more and more important to give Lucene enough time to warm itsself up. Hotspot is very slow with applying those optos (that is what Rust does better) and due to the incubation state of those vector optos, the garbage collector has much to do in early phases. So it is important to let run Lucene for a few minutes to let GC clean up the "optimization garbage" till it starts measuring in the same JVM. If you start a new JVM, you have to wait the warmup time again. Unfortunately number of queries is not best in measuring this, it should be a mix of number of queries and execution time to do warmup. The problem of many Lucene tests on the web is that they start after 100 or 200 queries with measuring or restart JVMs. So please excuse my complaint as the pure Java code did not show how it warms up as it was hidden behind the python code. Still the mix of several languages makes it hard to understand. |
Yes it is the same JVM. Also, do you have an idea of what is the best JVM and the best settings we should use. |
@uschindler I think the benchmark is quite fair at this point, from what I can tell -- we've been iterating on that repo working to pull over luceneutil's enwiki corpus + tasks, ensuring the returned top N hits are very nearly identical, ensuring warmup + N iterations and take the fastest run is happening, measuring time on the server not the client, running both on a single segment index (makes results more comparable but not necessarily realistic to production cases), sprinkle 2% deletes on both indices, statically sorting both indices by the same Many thanks to @Tony-X for pushing so hard on this. It does indeed run a single JVM, but I an of the opposite opinion -- we need to run multiple JVM iterations to see how much hotspot noise is hurting. Darned java/hotspot coddling! The Rust results look quite consistent from one run to the next. It's also simple to run -- fork/clone the repo and follow the README steps. I'd love to see results from more diverse hardware than just my home beast boxes lol.
Ha! Java and Python and Rust! (Edit: oh and also Makefile!!) Start at client.py -- I think it's quite understandable if you launch off from there. More so than luceneutil!!
Oh -- sorry -- you mean no capitalization? So |
Ahh -- gotchya -- only if the impl will be constant time is it allowed to run, else OK so it's a bit tricky to add this opto to Lucene ... we need an API change of some sort. It's as if when |
I think @uschindler is referring to the recent changes to let Lucene use Panama (rough SIMD access from way up in javaland) if the user opts in at runtime. We could test this (add We should fix the JVM heap size I suppose -- I'll try that to see if it alters results. |
Hi. It looks like the DoQuery.java code does not do a parallel throughput measurement, but instead it runs all queries in a single thread one after each other with Nanotime before and after (thanks for the fix, Mike). So we measure exactly duration of each query. So we should use also ParallelGC. The default G1GC works better when you hammer a server multithreaded, but if there's only one thread doing queries, ParallelGC is better. Of course a real world benchmark should also measure throughput by hammering a server with hundreds of parallel queries (many more than there are CPU cores) to saturate all CPU cores. Of course in such throughout scenarios I have seen sometimes single queries taking long time, but you need to also look at percentiles then. I know that lucene is very good in throughput measurements. I know this comment goes too far and beyond this issue, but we should really look at other scenarios than measuring the duration a query takes. About vector: this does not apply here because there's no vector search involved. Still with modern Java version like jdk-20 the warmup time in combination with parallelgc is higher due to tiered compilation. Please use java 20 for benchmarks to also see benefits from mmap, especially with indexes optimized to one segment. Also enable parallel GC, although it's not real world, but the benchmark isn't, too. Please do not pass any extra JVM args, except GC and heap size. |
Ahh great catch @uschindler! I just made that change in the benchy.
Right -- we are only measuring single threaded query latency (just like luceneutil). I agree a red-line test (saturate CPU) is also really important -- we have an issue open for this. It's a bit trickier to do... Note that neither tantivy nor Lucene is using more than one thread for each query, nor any OS level concurrency like async IO (I think?), so the "lightly loaded single query latency" metric should be fair to compare across both engines. |
Here is is also missing, that's the code I looked at: https://github.com/Tony-X/search-benchmark-game/blob/4402d42c906830e85d8d79a30ae776f204ade770/engines/lucene-9.5.0/query.sh |
@mikemccand No no. There is no specific capitalization :). I was correcting tantivity -> tantivy. |
too late in evening and on smartphone. Whatever Google decided to be the correct term :-) |
OH! Ha, OK, phew :) |
Oh yeah that does look scary at first, but that is the It's invoked from Lucene's Makefile in that same directory for the |
Small correction as I noticed you are on Lucene 9.5:
|
The goal of the original benchmark is to identify performance headrooms in tantivy. BTW I suspect changing the GC won't have any effect on the min statistics, but it does not hurt to try. I'd love to see a fair bench of multithreaded search throughput / indexing of course. |
The problem is that Java's default GC G1 brings up to 10% decrease in speed due to addition of barriers, but cleans up garbage much faster with shorter pauses. This is fine for multithreaded apps, but not single threaded apps doing wall clock measurements. |
Just caught up on this thread -- the design tenet of the current benchmark game is to measure time taken to do the same work in contention-free environment. As of now I'm still trying to build trust of the benchmarks so thank you for your evaluation and feedbacks @uschindler ! So far I believe there are doing the "same" work as I have chased down a few tokenization issues. Right now the indexes on both side have --
Regarding the JVM here is what we do now
Admittedly we haven't looked into playing with different JVM arguments. @mikemccand thanks for creating Tony-X/search-benchmark-game#37 to explore the heap sizes :) IMO, GC here is less of an issue since we measure the best latency (min) across 10 runs for each query (a slight favor for JVM). The probability that every 10 of 10 run of the same query hit an GC is very tiny. It would be great to share your insights about an optimal JVM setting for this case. |
This is not used in benchmarks. This is more like a debugging tool, e.g. you can run that script and type some queries. |
Just to keep everyone accurate. the original misspelling was "tantitvy" not "tantivity"! :)
|
@uschindler I updated the bench of the search benchmarks with:
|
Hi, thanks for crosschecking. 1 hour warmup is therefor not changing anything. Anyways, I'd use a newer JDK like 20. |
After roughly catching up with the threads, I would like to go back to
So we probably can have two count API,
Does that makes sense? |
I was wondering about a possible alternative approach that would consist of handling this optimization more at the collector level. E.g. we could add a new method to LeafCollector to take a set of doc IDs (pseudo code): void collect(DocIdSet set) {
// default implementation
for (int docID : set) {
collect(docID);
}
} Then void collect(DocIdSet set) {
// total hit count implementation
count += set.size(); // internally uses Long#bitCount rather than iterating bits one by one
} |
I like this idea! Sort of like |
+1, I think this is a better/cleaner approach |
I like it better too but I think it's going to need more work, e.g. we will need to remove the |
Maybe we need a |
`Scorable#docID()` exposes the document that is being collected, which makes it impossible to bulk-collect multiple documents at once. Relates apache#12358
It felt to me like it would be doable to remove |
`Scorable#docID()` exposes the document that is being collected, which makes it impossible to bulk-collect multiple documents at once. Relates #12358
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates apache#12358
I opened a proof of concept for the idea that I suggested above at #12415. |
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates #12358
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates #12358
Description
Context: we (Amazon customer facing product search team, and also AWS) are attempting to understand the amazing performance Tantivy (Rust search engine) has over Lucene, iterating in this GitHub repo. That repo is sort of a merger of Lucene's benchmarking code (luceneutil), including its tasks and
enwiki
corpus, and the open source Tantivy benchmark. Tantivy is impressively fast :)This issue is a spinoff from this fascinating comment by @fulmicoton, creator and maintainer of Tantivy.
Tantivy optimizes
count()
forBooleanQuery
disjunctions much like Lucene'sBooleanScorer
, by scoring in a windowed bitset of N docs at once, and then pop-counting the set bits in each window. This is not technically a sub-linear implementation: it is still linear, but I suspect with a smaller constant factor than the defaultcount()
fallback Lucene implements.Perhaps, for all cases where
BooleanQuery
uses the windowedBooleanScorer
, we could also implement thiscount()
optimization.From my read of Lucene's
BooleanWeight.count
, I don't think Lucene has this optimization? Maybe we should port over Tantivy's optimization? It should make disjunctive counting quite a bit faster?The text was updated successfully, but these errors were encountered: