-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify TermInSetQuery to "self optimize" if doc values are available #12089
Modify TermInSetQuery to "self optimize" if doc values are available #12089
Conversation
Thanks for looking at this. I can alter benchmark from #12087 to test this case, honestly we could even just take the benchmark and index the numeric field as a string instead as a start :) In the case of numeric fields, we just had crazy query in the sandbox which is better as e.g. NumericDocValues.newSlowSetQuery. And then we hooked into IntField etc as newSetQuery with the IndexOrDocValuesQuery. For that one, we had to fix PointInSetQuery to support ScorerSupplier etc (but TermInSetQuery already has this cost estimation) I was naively thinking to try to the same approach with the DocValuesTermsQuery that is in sandbox... though I anticipated maybe more trickiness with inverted index as opposed to points. But maybe IndexOrDocValuesQuery would surprise me again, of course its probably worth exploring anyway, we could compare the approaches. I do like IndexOrDocValuesQuery for solving these problems and if we can improve it to keep it generic, i'd definitely be in favor of that. But whatever is fastest wins :) I do think its important to add these fields such as KeywordField and put this best-practice logic behind simple methods so that it is easier on the user. |
I modified the benchmark from #12087 to just use StringField instead of IntField. The queries are supposed to be "hard" in that I'm not trying to benchmark what is necessarily typical, instead target "hard" stuff that is more worst-case (e.g. we shouldn't cause regressions vs |
|
||
private final String field; | ||
// TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free | ||
private final BytesRef[] terms; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this is really bad for performance to be unsorted: it means we do a bunch of random access lookups in the terms dictionaries: looping over these unsorted terms and doing seekExact, looping over these unsorted terms and doing lookupOrd that could instead easily be sequential/more friendly.
Given that sometimes looking up all the terms is a very heavy cost for this thing, calling Arrays.sort()
in the ctor seems like an easy-win/no-brainer. As is the prefix-coding to keep the RAM usage lower in the worst-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree. I was testing some interesting use-cases though with our current TermInSetQuery
where we have 10,000+ terms (PK-type field), and the sorting it does actually came with a significant cost (and we got a pretty good win by removing it). But that's a bit of a different use-case maybe now that I think about it. We have a bloom filter in place as well, and a good share of the terms aren't actually in the index. So there's that...
OK, yeah, we probably ought to sort here in a general implementation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I think of worst-case, I'm assuming where these term dictionaries don't even fit in RAM. We should really always assume this stuff doesn't fit in RAM :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use this tool when benchmarking to help make sure index no longer fits in RAM: https://github.com/mikemccand/luceneutil/blob/b48e7f49b19c27367737436214cc1ce7e67ad32c/src/python/ramhog.c
or you can open up the computer and remove DIMMs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point/perspective. I'm convinced. It's easy enough to borrow those ideas from our existing TermInSetQuery
, so I'll do that. Thanks.
I think that's probably a good place to start honestly. I was thinking of introducing this in the sandbox module initially and not actually hooking it into |
@@ -57,4 +57,14 @@ public DisiWrapper(Scorer scorer) { | |||
matchCost = 0f; | |||
} | |||
} | |||
|
|||
public DisiWrapper(DocIdSetIterator iterator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is common to #12055
// complicating matters, we don't know how many of the terms are actually in the | ||
// segment (we assume they all are), and of course, our candidate size is also an | ||
// estimate: | ||
double avgTermAdvances = Math.min(leadCost, (double) t.getSumDocFreq() / t.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a very poor estimation depending on the use-case. If the field follows a Zipf distribution, this is making some pretty poor assumptions (that all terms have approximately equal lengths). I think this bit will be challenging to do well.
I found some time to come back to this and did some more benchmarking. I added a markdown file with some benchmark results in this PR for now just as a place to put it. It's here. This is all using a benchmark tool I wrote that's heavily based on something @rmuir did over in #12087. I'll figure how a place to upload that benchmark tool shortly for visibility. This is still very much a work in progress, but I thought the benchmark results were interesting and worth sharing out. |
thanks for the work benchmarking! you can rename to a .txt file and just attach it to a github comment, as one solution. yeah, this one looked to be trickier at a glance. simply indexing the same integer values as strings makes the queries quite a bit slower than what I saw with Points+DV on #12087. And not even using lots of query terms... |
That works! Here's how I benchmarked. One note if you're interested in running this is to make sure to shuffle the genomes data prior to running or you get very skewed results. This is because the file is sorted by country code, so a postings-based approach is heavily favored by the natural index sorting if you index the lines in this order. I'm actually a bit surprised/impressed that our existing |
I wonder for both these DV-queries (existing one in sandbox and new one here in sandbox), if they should really be using this #12087 is using a |
also personally i wouldn't target the sandbox at all when attacking this problem. such queries went into the sandbox before, primarily i think because they were very tricky to use. There wasn't such a thing as IndexOrDocValuesQuery and often you'd get worse performance when using them. nowadays, we can expose them more cleanly, via e.g. |
Interesting. I'll experiment with that in this benchmark to see if it makes any real difference (+1 to being more memory efficient here if possible).
That's fair. I'll target "non sandbox." I'll work on tightening up the code a bit and adding testing soon. |
84dcd52
to
db2de86
Compare
db2de86
to
d127aa4
Compare
@rmuir I modified the PR to update the existing I think this is ready for a more thorough review now if you (or anyone else) has the time. My plan is to reference this query directly from something like |
Hmm, @gsmiller I really personally don't like all this complexity and optimizations mixed into This is why I like the IndexOrDocValues approach: we can have two simple queries instead of one extremely complicated one. If it is a must that we create a giant HeroicTermInSetQuery that tries to do anything and everything, then let's name it just that: also make it package private in But I'd really like us to consider the cost in the complexity here. |
and when i look at the benchmarks, there are cases where IndexOrDV does better, and there are cases where this one does worse. I'm not seeing a "clear winner", so i'd personally lean towards tie-breaking on "lower complexity". We might be able to improve performance in easier ways that don't have complexity cost, such as replacing gigantic bitset with longhashset. |
another way to say it: I see a lot more optimizations and conditionals in the query than I see wins in the benchmark? which of these truly matter? Can we remove some to simplify the new query? |
i'll work up a draft PR trying to work it from the opposite direction (keep queries split apart) so my suggestions will be more concrete. You did the hard part already and made a benchmark, I will reuse it. |
I think some of the benchmarks needs to be revisited here anyway. |
@rmuir thanks for the feedback. +1 to making sure we're getting the right balance of complexity vs. benefit. If you strip away some of the complexity, I think the main benefit of the approach I'm proposing here is the ability to term-seek in a somewhat sensible way to come up with a better cost estimate. The challenge with the existing So this shows up with something like As for where the proposed implementation is worse than Anyway, back to the point about complexity vs. benefit, I 100% agree that relying on |
Yeah overall my concern is not so much with your specific query, just that it doesn't scale to many other queries. It is only matter of time before someone says "hey, we can really speed up many other slow use-cases by adding KeywordField.newPrefixQuery, newWildCardQuery, etc by using the docvalues", and it's true (for selective queries, you could avoid intersection against any large terms dict at all and do a a couple per-doc very-high-cost lookupTerm + runautomaton match). So it's an example where i'd like to avoid a mess: it would be better to fix apis such as MultiTermQuery, IndexOrDocValuesQuery, ScorerSupplier, etc so that we can make things work cleanly with simpler queries that are easier to test correctness of and maintain. But practically, right now benchmark is not very fair comparison because some of these queries just have obvious straightforward performance deficiencies, DocValuesTermsQuery especially. |
Thanks @rmuir! I'll have a look at the PRs tomorrow. I've got a bad case of jet lag kicking in at the moment and my brain is turning to mush. In the meantime, I took a pass at simplifying the proposal here in this PR. There was a lot of complexity (as you pointed out) for edge-cases and such. I tried to keep it focused on just the core idea of seeking some terms to get a better cost estimate, in a way that adds minimal complexity. Benchmark results are about the same as before, so quite a bit simpler and no real performance difference: TiSBenchResults_Simplified.md.txt Of course, this still relies on optimizing internally instead of leveraging Thanks again for the feedback and related PRs! |
Thanks, the patch is super-helpful as it illustrates the real issue. I am looking into it, to me the issue is a problem with I didn't get distracted, just saw related issues on the DV side that are easily fixed: and i've still got a few TODOs for these queries remaining. It is one advantage of having the separate queries: makes it super easy to just benchmark the pure-DV case and find/fix issues. |
@rmuir I grabbed your patch for adding a The only gap that remains now is when the field is not a PK-style field but the terms being used in the disjunction have a low aggregate cost (relative to the other terms in the field; e.g., Here are updated benchmark results: TiSBenchResults_Simplified_DVSSPatch.md.txt
+1. Maybe there's a way to address this remaining gap by being smarter about the cost function without term-seeking? That would be ideal. I also played around with the idea of a "cost iterator" abstraction on |
That's good that it made progress. I will look more into it tonight. I want to get these patches landed to simplify benchmarking. its true there is one benchmark where this combined query does better ("Medium Cardinality + Low Cost Country Code Filter Terms") but there is also one benchmark where it does substantially worse ("Low Cardinality + High Cost Country Code Filter Terms"). so net/net i would say they are equivalent. But I will look into this case to see if we can still do better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should switch this PR to a new doc-values query that looks up terms on demand rather than eagerly in order not to pay a O(numTerms) up-front cost per segment, which defeats the idea of using doc values when numTerms > leadCost.
costThreshold = isPrimaryKeyField ? leadCost << 3 : leadCost; | ||
|
||
if (termData.size() > costThreshold) { | ||
// If the number of terms is > the number of candidates, DV should perform better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is right given that the doc-values query still eagerly evaluates all terms against the terms dictionary. For this to work correctly, we'd need a query that looks up terms lazily rather than eagerly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure actually. The up-front term-seeking you refer to is certainly a cost, but it doesn't scale with the number of lead hits. So this can still be cheaper. But also, +1 to the idea of trying out on-demand term seeking for these situations!
final long indexedTermCount = indexTerms.size(); | ||
if (indexedTermCount != -1) { | ||
potentialExtraCost -= indexedTermCount; | ||
} | ||
cost = queryTermsCount + potentialExtraCost; | ||
|
||
final boolean isPrimaryKeyField = indexedTermCount != -1 && sumDocFreq == indexedTermCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since terms.size()
is an optional index statistic, maybe check sumDocFreq == docCount
instead?
100%. The issue here is that |
Well right now, I'm not seeing any justification to mix up concerns and pile all into one query. Currently overall, IndexOrDocValues has the best performance: it is slightly slower in one case but massively faster in that case. Also there are some problems with the benchmark, i tried to wrestle with it but there is some serious noise/gc/something going on. if i just rearrange order of tests numbers change dramatically. I am still willing to optimize the one case where IndexOrDocValues might do better, but I honestly don't feel I need to "play defense" any more. I think we should use IndexOrDocValues. |
I spent mine time her just to prevent that code is duplicated and prevent a mess, nothing more. I got no skin in the game and could care less about these stupid abusive "joins" that people do. But let's please look at the current situation and say "we are gonna keep it clean" so I don't have to stay up so late anymore to prevent code mess. There is no advantage to piling all this shit in one query. breathe in, breathe out, let it go. |
@rmuir thanks for the feedback and spending time having a look. I'm going to try summarizing where we've landed to make sure we're on the same page. I think we both agree on the following, but please let me know if I'm missing anything?
Given this, I tend to agree that moving away from
|
Closing this out for now. I think my last comment provides a pretty good summary of where things landed here. I did experiment a bit with other ways of estimating cost within |
Description
This change alters
TermInSetQuery
to "self-optimize" if there is a parallel doc value field present. The idea is to build on the newKeywordField
being proposed in #12054, which indexes both postings and DV data. It takes a bit of a different approach though as compared toIndexOrDocValuesQuery
by "internally" deciding whether to use postings vs. doc values (at the segment granularity). The idea is to actually start seeking through some of the terms to get a better postings-based cost estimate, which is something we can't easily expose through the typicalScoreSupplier#cost()
mechanism since we want to short-circuit with a DV approach if we "accumulate" enough estimated cost.I've run various task benchmarks using geonames data to convince myself that this approach is generally more performant than trying to use
IndexOrDocValuesQuery
on top of the existingTermInSetQuery
+DocValuesTermsQuery
, as well as to show that we shouldn't expect any significant regressions for existing users ofTermInSetQuery
: TiSBenchResults.md.txt. Benchmarks were run with a modified version of @rmuir's benchmarking tool used in #12087, and is available here: TiSBench.java.txt.