Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection #12176

Open
rmuir opened this issue Mar 1, 2023 · 2 comments

Comments

@rmuir
Copy link
Member

rmuir commented Mar 1, 2023

Description

TermInSetQuery currently "ping-pong" intersects a sorted list against the term dictionary.

Instead of sorted-list, it could possibly use Daciuk Mihov Automaton, which can be built in linear time. Then query could leverage Terms.intersect (e.g. TermInSetQuery could be an AutomatonQuery subclass).

This should give faster intersection of the terms, which is usually the heavy part of this query. For example BlockTree terms dictionary has a very efficient Terms.intersect that makes use of the underlying structure.

The annoying part: DaciukMihovAutomatonBuilder currently requires unicode strings and makes a UTF-32 automaton, which would then be converted to UTF-8 (binary) automaton via UTF32ToUTF8. But I think TermInSetQuery may allow arbitrary non-unicode binary strings?

In order to support arbitrarily binary terms (and to avoid conversions), the DaciukMihov code would have to modified, to support construction of a binary automaton directly. Probably this is actually simpler?

This is just an idea to get more performance, it hasn't been tested. feel free to close the issue if it doesnt work out.

@rmuir rmuir added the type:task label Mar 1, 2023
@zhaih
Copy link
Contributor

zhaih commented Mar 8, 2023

Hey Robert this is an interesting idea, one of the problem we're facing seems related to this idea:
we're having ~200 terms from several fields and we're trying to do a big disjunction over them, one of the observations is that seekExact is taking quite a big chunk of time.
So I'm thinking if this Automaton based approach can be faster than sort and seek, then probably we can have get some performance.
I'll try to look further into this idea.

@rmuir
Copy link
Member Author

rmuir commented Mar 8, 2023

so, one thing is, Terms.intersect() works across a single field.
and you definitely have to sort before adding terms to DaciukMihov (but then it works in linear time).

Sounds like you are currently just "blasting" and not using the seekCeil aka "ping-pong intersection" that TermInSetQuery does.

The advantage Terms.intersect has over such a "ping-pong" intersection, is that the terms dictionary implementation can intersect the list of terms faster... without hitting the disk as much. I think it makes better use of blocktree's index structure. IIRC it basically made term intersection for a lot of queries 2x faster than "ping-pong" because of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants