TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection #12176

rmuir · 2023-03-01T16:29:42Z

Description

TermInSetQuery currently "ping-pong" intersects a sorted list against the term dictionary.

Instead of sorted-list, it could possibly use Daciuk Mihov Automaton, which can be built in linear time. Then query could leverage Terms.intersect (e.g. TermInSetQuery could be an AutomatonQuery subclass).

This should give faster intersection of the terms, which is usually the heavy part of this query. For example BlockTree terms dictionary has a very efficient Terms.intersect that makes use of the underlying structure.

The annoying part: DaciukMihovAutomatonBuilder currently requires unicode strings and makes a UTF-32 automaton, which would then be converted to UTF-8 (binary) automaton via UTF32ToUTF8. But I think TermInSetQuery may allow arbitrary non-unicode binary strings?

In order to support arbitrarily binary terms (and to avoid conversions), the DaciukMihov code would have to modified, to support construction of a binary automaton directly. Probably this is actually simpler?

This is just an idea to get more performance, it hasn't been tested. feel free to close the issue if it doesnt work out.

The text was updated successfully, but these errors were encountered:

zhaih · 2023-03-08T21:42:13Z

Hey Robert this is an interesting idea, one of the problem we're facing seems related to this idea:
we're having ~200 terms from several fields and we're trying to do a big disjunction over them, one of the observations is that seekExact is taking quite a big chunk of time.
So I'm thinking if this Automaton based approach can be faster than sort and seek, then probably we can have get some performance.
I'll try to look further into this idea.

rmuir · 2023-03-08T22:43:15Z

so, one thing is, Terms.intersect() works across a single field.
and you definitely have to sort before adding terms to DaciukMihov (but then it works in linear time).

Sounds like you are currently just "blasting" and not using the seekCeil aka "ping-pong intersection" that TermInSetQuery does.

The advantage Terms.intersect has over such a "ping-pong" intersection, is that the terms dictionary implementation can intersect the list of terms faster... without hitting the disk as much. I think it makes better use of blocktree's index structure. IIRC it basically made term intersection for a lot of queries 2x faster than "ping-pong" because of this.

rmuir added the type:task label Mar 1, 2023

rmuir mentioned this issue May 9, 2023

Expose iterator over query terms in TermInSetQuery #12280

Closed

This was referenced May 17, 2023

Minor cleanup and improvements to DaciukMihovAutomatonBuilder #12305

Merged

[DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery #12312

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection #12176

TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection #12176

rmuir commented Mar 1, 2023

zhaih commented Mar 8, 2023

rmuir commented Mar 8, 2023

TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection #12176

TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection #12176

Comments

rmuir commented Mar 1, 2023

Description

zhaih commented Mar 8, 2023

rmuir commented Mar 8, 2023