You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TermInSetQuery currently "ping-pong" intersects a sorted list against the term dictionary.
Instead of sorted-list, it could possibly use Daciuk Mihov Automaton, which can be built in linear time. Then query could leverage Terms.intersect (e.g. TermInSetQuery could be an AutomatonQuery subclass).
This should give faster intersection of the terms, which is usually the heavy part of this query. For example BlockTree terms dictionary has a very efficient Terms.intersect that makes use of the underlying structure.
The annoying part: DaciukMihovAutomatonBuilder currently requires unicode strings and makes a UTF-32 automaton, which would then be converted to UTF-8 (binary) automaton via UTF32ToUTF8. But I think TermInSetQuery may allow arbitrary non-unicode binary strings?
In order to support arbitrarily binary terms (and to avoid conversions), the DaciukMihov code would have to modified, to support construction of a binary automaton directly. Probably this is actually simpler?
This is just an idea to get more performance, it hasn't been tested. feel free to close the issue if it doesnt work out.
The text was updated successfully, but these errors were encountered:
Hey Robert this is an interesting idea, one of the problem we're facing seems related to this idea:
we're having ~200 terms from several fields and we're trying to do a big disjunction over them, one of the observations is that seekExact is taking quite a big chunk of time.
So I'm thinking if this Automaton based approach can be faster than sort and seek, then probably we can have get some performance.
I'll try to look further into this idea.
so, one thing is, Terms.intersect() works across a single field.
and you definitely have to sort before adding terms to DaciukMihov (but then it works in linear time).
Sounds like you are currently just "blasting" and not using the seekCeil aka "ping-pong intersection" that TermInSetQuery does.
The advantage Terms.intersect has over such a "ping-pong" intersection, is that the terms dictionary implementation can intersect the list of terms faster... without hitting the disk as much. I think it makes better use of blocktree's index structure. IIRC it basically made term intersection for a lot of queries 2x faster than "ping-pong" because of this.
Description
TermInSetQuery currently "ping-pong" intersects a sorted list against the term dictionary.
Instead of sorted-list, it could possibly use Daciuk Mihov Automaton, which can be built in linear time. Then query could leverage
Terms.intersect
(e.g. TermInSetQuery could be an AutomatonQuery subclass).This should give faster intersection of the terms, which is usually the heavy part of this query. For example BlockTree terms dictionary has a very efficient
Terms.intersect
that makes use of the underlying structure.The annoying part:
DaciukMihovAutomatonBuilder
currently requires unicode strings and makes a UTF-32 automaton, which would then be converted to UTF-8 (binary) automaton viaUTF32ToUTF8
. But I thinkTermInSetQuery
may allow arbitrary non-unicode binary strings?In order to support arbitrarily binary terms (and to avoid conversions), the DaciukMihov code would have to modified, to support construction of a binary automaton directly. Probably this is actually simpler?
This is just an idea to get more performance, it hasn't been tested. feel free to close the issue if it doesnt work out.
The text was updated successfully, but these errors were encountered: