-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery #12312
base: main
Are you sure you want to change the base?
Conversation
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java
Outdated
Show resolved
Hide resolved
thanks for getting this started! Will be interested to see how the use of |
Here's what I'm seeing so far in benchmarking... I took a custom benchmarking approach for this, similar to #12151 and other related issues. I did this because, 1) we don't really have benchmark coverage in The benchmark indexes geonames data (~12MM records). It includes an ID field and a Country Code field (both postings and doc values for each). The ID field is a primary key. The benchmark tasks break down into:
For each task, there are four runs:
I ran postings- and docvalues-based approaches since the term dictionary implementations are different. In general, the two approaches demonstrate similar latency characteristics, but the automaton approach is a bit worse on the PK field. I dug in a bit with a profiler and I think we're just seeing the overhead of building the automaton. I think this overhead is showing up because the PK query processing is so cheap in general vs. the other tasks that can "hide" the overhead. So... as of now, I don't see any performance benefits to moving to this approach, and maybe see some regressions. On the other hand, it would be nice to move to this implementation so we could have codec-dependent intersection techniques, which would help address issues like #12280 (bloom filter implementation could have a specific intersection implementation that leverages the bloom filter). I'll try to run some benchmarks on our Amazon product search application next week just to gather some additional data points. Maybe we can learn more about how this technique might behave on another benchmark data set. Here are the benchmark results (numbers are query time in ms):
|
hmm, disappointing. Was hoping to see gains on the terms dictionary since it optimizes Of course docvalues impl doesn't optimize |
Hmm... not sure if I've got something setup incorrectly with my JFR settings, but trying to dig into the other tasks, I can't even get the relevant methods to show up in the profiled calls. I attached a debugger and made sure the call flow is what I expect, but I'm not seeing any of the term intersection logic in what's getting profiled. My theory is that the term intersection is so trivial relative to the actually work of computing the postings disjunction that it's just not showing up in the samples, but maybe there's a setting I'm missing that's not sampling in a fine-enough grain? Not sure. But as far as I can tell, with the non-PK tasks, it looks like the term intersection may just be such a trivial part of the query cost that it doesn't matter what approach we take. |
Description
I started experimenting with #12176 to see if we can get any benefits out of having
TermInSetQuery
extendAutomatonQuery
instead ofMultiTermQuery
. I'm opening this only as a draft for now so that I can "show my work" alongside some benchmark results as I have them.