[DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery #12312

gsmiller · 2023-05-18T23:42:45Z

Description

I started experimenting with #12176 to see if we can get any benefits out of having TermInSetQuery extend AutomatonQuery instead of MultiTermQuery. I'm opening this only as a draft for now so that I can "show my work" alongside some benchmark results as I have them.

lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java

rmuir · 2023-05-19T16:15:17Z

thanks for getting this started! Will be interested to see how the use of Terms.intersect impacts the performance.

gsmiller · 2023-05-19T19:55:34Z

Here's what I'm seeing so far in benchmarking...

I took a custom benchmarking approach for this, similar to #12151 and other related issues. I did this because, 1) we don't really have benchmark coverage in luceneutil (I know, we should address this!), and 2) it lets me test a number of specific scenarios. Honestly, it was just the easiest path for me to get some numbers. My benchmarked is here:
TiSBench.java.txt. If you're interested in running it, you need a little setup. Most of it is obvious, but I'm happy to answer questions if helpful.

The benchmark indexes geonames data (~12MM records). It includes an ID field and a Country Code field (both postings and doc values for each). The ID field is a primary key. The benchmark tasks break down into:

Term disjunctions over the country code field, with varying cardinality: "all" country codes (254 terms), "medium cardinality" country codes (20 different terms), and "low cardinality" country codes (10 different terms). The medium/low cardinality tasks also break down into a set with high and low costs (i.e., common terms and rare terms).
Term disjunctions over the ID field. There are high/medium/low versions of this task (500 terms, 20 terms, 10 terms)

For each task, there are four runs:

Typical postings-based approach with the current TermInSetQuery that extends MultiTermQuery (using prefix coding of terms and the ping-pong intersection with index terms)
Postings-based approach with a new TermInSetQuery that extends AutomatonQuery (delegating to Term#intersect to intersect query terms with index terms)
DocValues-based approach with the current TermInSetQuery
DocValues-based approach with the new TermInSetQuery

I ran postings- and docvalues-based approaches since the term dictionary implementations are different.

In general, the two approaches demonstrate similar latency characteristics, but the automaton approach is a bit worse on the PK field. I dug in a bit with a profiler and I think we're just seeing the overhead of building the automaton. I think this overhead is showing up because the PK query processing is so cheap in general vs. the other tasks that can "hide" the overhead.

So... as of now, I don't see any performance benefits to moving to this approach, and maybe see some regressions. On the other hand, it would be nice to move to this implementation so we could have codec-dependent intersection techniques, which would help address issues like #12280 (bloom filter implementation could have a specific intersection implementation that leverages the bloom filter).

I'll try to run some benchmarks on our Amazon product search application next week just to gather some additional data points. Maybe we can learn more about how this technique might behave on another benchmark data set.

Here are the benchmark results (numbers are query time in ms):

Task	Postings MTQ	Postings Automata	DV MTQ	DV Automata
All Country Code Filter Terms	507.47	506.98	82.43	104.75
Medium Cardinality + High Cost Country Code Filter Terms	328.24	328.16	167.88	167.87
Low Cardinality + High Cost Country Code Filter Terms	70.85	70.93	169.59	169.71
Medium Cardinality + Low Cost Country Code Filter Terms	0.31	0.27	73.97	73.99
Low Cardinality + Low Cost Country Code Filter Terms	0.42	0.41	73.85	73.99
High Cardinality PK Filter Terms	6.34	8.57	127.34	129.63
Medium Cardinality PK Filter Terms	0.46	0.64	110.12	110.51
Low Cardinality PK Filter Terms	0.19	0.31	66.41	66.51

rmuir · 2023-05-19T20:02:33Z

hmm, disappointing. Was hoping to see gains on the terms dictionary since it optimizes intersect. wonder what is going on.

Of course docvalues impl doesn't optimize intersect in any way, so it isn't surprising that we don't see gains there, but it is something we could improve later.

gsmiller · 2023-05-19T21:01:14Z

OK, here's a method profiler diff for the "High Cardinality PK" task, comparing two postings approaches—one that's using the current MultiTermQuery version, and one using AutomatonQuery ("TermInSetQueryLegacy" is our current version extending MultiTermQuery). The green colored frames show where the AutomatonQuery is spending less time compared to the MultiTermQuery version, and red is the opposite. Eyeballing this, it seems to confirm that it's a little more expensive to build the automaton than to do the prefix-coding, but a little less expensive to do the term intersection with the automaton approach (down in the codec's optimized intersect). The net/net though is that the AutomatonQuery approach in this case is slower by ~35%.

gsmiller · 2023-05-19T22:29:39Z

Hmm... not sure if I've got something setup incorrectly with my JFR settings, but trying to dig into the other tasks, I can't even get the relevant methods to show up in the profiled calls. I attached a debugger and made sure the call flow is what I expect, but I'm not seeing any of the term intersection logic in what's getting profiled. My theory is that the term intersection is so trivial relative to the actually work of computing the postings disjunction that it's just not showing up in the samples, but maybe there's a setting I'm missing that's not sampling in a fine-enough grain? Not sure. But as far as I can tell, with the non-PK tasks, it looks like the term intersection may just be such a trivial part of the query cost that it doesn't matter what approach we take.

rmuir reviewed May 19, 2023

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java Outdated Show resolved Hide resolved

This was referenced May 20, 2023

#12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton #12310

Merged

Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit #12320

Merged

gsmiller added 2 commits June 2, 2023 15:29

Initial version

9fa1e9d

temporarily remove broken test case

74459ee

gsmiller force-pushed the tis/fsa branch from eb68482 to 74459ee Compare June 2, 2023 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery #12312

[DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery #12312

gsmiller commented May 18, 2023

rmuir commented May 19, 2023

gsmiller commented May 19, 2023 •

edited

Loading

rmuir commented May 19, 2023

gsmiller commented May 19, 2023 •

edited

Loading

gsmiller commented May 19, 2023

[DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery #12312

Are you sure you want to change the base?

[DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery #12312

Conversation

gsmiller commented May 18, 2023

Description

rmuir commented May 19, 2023

gsmiller commented May 19, 2023 • edited Loading

rmuir commented May 19, 2023

gsmiller commented May 19, 2023 • edited Loading

gsmiller commented May 19, 2023

gsmiller commented May 19, 2023 •

edited

Loading

gsmiller commented May 19, 2023 •

edited

Loading