Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify TermInSetQuery to "self optimize" if doc values are available #12089

Conversation

gsmiller
Copy link
Contributor

@gsmiller gsmiller commented Jan 17, 2023

Description

This change alters TermInSetQuery to "self-optimize" if there is a parallel doc value field present. The idea is to build on the new KeywordField being proposed in #12054, which indexes both postings and DV data. It takes a bit of a different approach though as compared to IndexOrDocValuesQuery by "internally" deciding whether to use postings vs. doc values (at the segment granularity). The idea is to actually start seeking through some of the terms to get a better postings-based cost estimate, which is something we can't easily expose through the typical ScoreSupplier#cost() mechanism since we want to short-circuit with a DV approach if we "accumulate" enough estimated cost.

I've run various task benchmarks using geonames data to convince myself that this approach is generally more performant than trying to use IndexOrDocValuesQuery on top of the existing TermInSetQuery + DocValuesTermsQuery, as well as to show that we shouldn't expect any significant regressions for existing users of TermInSetQuery: TiSBenchResults.md.txt. Benchmarks were run with a modified version of @rmuir's benchmarking tool used in #12087, and is available here: TiSBench.java.txt.

@gsmiller gsmiller marked this pull request as draft January 17, 2023 19:33
@rmuir
Copy link
Member

rmuir commented Jan 17, 2023

Thanks for looking at this. I can alter benchmark from #12087 to test this case, honestly we could even just take the benchmark and index the numeric field as a string instead as a start :)

In the case of numeric fields, we just had crazy query in the sandbox which is better as e.g. NumericDocValues.newSlowSetQuery. And then we hooked into IntField etc as newSetQuery with the IndexOrDocValuesQuery. For that one, we had to fix PointInSetQuery to support ScorerSupplier etc (but TermInSetQuery already has this cost estimation)

I was naively thinking to try to the same approach with the DocValuesTermsQuery that is in sandbox... though I anticipated maybe more trickiness with inverted index as opposed to points. But maybe IndexOrDocValuesQuery would surprise me again, of course its probably worth exploring anyway, we could compare the approaches. I do like IndexOrDocValuesQuery for solving these problems and if we can improve it to keep it generic, i'd definitely be in favor of that. But whatever is fastest wins :)

I do think its important to add these fields such as KeywordField and put this best-practice logic behind simple methods so that it is easier on the user.

@rmuir
Copy link
Member

rmuir commented Jan 17, 2023

I modified the benchmark from #12087 to just use StringField instead of IntField. The queries are supposed to be "hard" in that I'm not trying to benchmark what is necessarily typical, instead target "hard" stuff that is more worst-case (e.g. we shouldn't cause regressions vs new PointInSetQuery() in main branch today): StringSetBenchmark.java.txt


private final String field;
// TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free
private final BytesRef[] terms;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this is really bad for performance to be unsorted: it means we do a bunch of random access lookups in the terms dictionaries: looping over these unsorted terms and doing seekExact, looping over these unsorted terms and doing lookupOrd that could instead easily be sequential/more friendly.

Given that sometimes looking up all the terms is a very heavy cost for this thing, calling Arrays.sort() in the ctor seems like an easy-win/no-brainer. As is the prefix-coding to keep the RAM usage lower in the worst-case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree. I was testing some interesting use-cases though with our current TermInSetQuery where we have 10,000+ terms (PK-type field), and the sorting it does actually came with a significant cost (and we got a pretty good win by removing it). But that's a bit of a different use-case maybe now that I think about it. We have a bloom filter in place as well, and a good share of the terms aren't actually in the index. So there's that...

OK, yeah, we probably ought to sort here in a general implementation :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I think of worst-case, I'm assuming where these term dictionaries don't even fit in RAM. We should really always assume this stuff doesn't fit in RAM :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use this tool when benchmarking to help make sure index no longer fits in RAM: https://github.com/mikemccand/luceneutil/blob/b48e7f49b19c27367737436214cc1ce7e67ad32c/src/python/ramhog.c

or you can open up the computer and remove DIMMs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point/perspective. I'm convinced. It's easy enough to borrow those ideas from our existing TermInSetQuery, so I'll do that. Thanks.

@gsmiller
Copy link
Contributor Author

@rmuir

I was naively thinking to try to the same approach with the DocValuesTermsQuery that is in sandbox...

I think that's probably a good place to start honestly. I was thinking of introducing this in the sandbox module initially and not actually hooking it into KeywordField. Then maybe following up by graduating it and hooking it in? But maybe it's worth doing initially if we can get it right? I dunno.

@@ -57,4 +57,14 @@ public DisiWrapper(Scorer scorer) {
matchCost = 0f;
}
}

public DisiWrapper(DocIdSetIterator iterator) {
Copy link
Contributor Author

@gsmiller gsmiller Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is common to #12055

// complicating matters, we don't know how many of the terms are actually in the
// segment (we assume they all are), and of course, our candidate size is also an
// estimate:
double avgTermAdvances = Math.min(leadCost, (double) t.getSumDocFreq() / t.size());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a very poor estimation depending on the use-case. If the field follows a Zipf distribution, this is making some pretty poor assumptions (that all terms have approximately equal lengths). I think this bit will be challenging to do well.

@gsmiller
Copy link
Contributor Author

gsmiller commented Jan 27, 2023

I found some time to come back to this and did some more benchmarking. I added a markdown file with some benchmark results in this PR for now just as a place to put it. It's here.

This is all using a benchmark tool I wrote that's heavily based on something @rmuir did over in #12087. I'll figure how a place to upload that benchmark tool shortly for visibility.

This is still very much a work in progress, but I thought the benchmark results were interesting and worth sharing out.

@rmuir
Copy link
Member

rmuir commented Jan 27, 2023

thanks for the work benchmarking! you can rename to a .txt file and just attach it to a github comment, as one solution.

yeah, this one looked to be trickier at a glance. simply indexing the same integer values as strings makes the queries quite a bit slower than what I saw with Points+DV on #12087. And not even using lots of query terms...

@gsmiller
Copy link
Contributor Author

attach it to a github comment

That works! Here's how I benchmarked. One note if you're interested in running this is to make sure to shuffle the genomes data prior to running or you get very skewed results. This is because the file is sorted by country code, so a postings-based approach is heavily favored by the natural index sorting if you index the lines in this order.
TiSBench.txt

I'm actually a bit surprised/impressed that our existing IndexOrDocValues functionality works as well as it does across these queries, given how rough of an estimate #cost() is on the current TermInSetQuery. There are some clear cases where term-seeking and using term-level stats in a heuristic helps make better decisions, but I expected the difference to be more pronounced.

@rmuir
Copy link
Member

rmuir commented Jan 28, 2023

I wonder for both these DV-queries (existing one in sandbox and new one here in sandbox), if they should really be using this LongBitSet to mark the matching ordinals found in the terms dict... it seems overkill/wasteful to create a bitset sized over the entire terms dictionary.

#12087 is using a org.apache.lucene.document.LongHashSet for this purpose.

@rmuir
Copy link
Member

rmuir commented Jan 28, 2023

also personally i wouldn't target the sandbox at all when attacking this problem. such queries went into the sandbox before, primarily i think because they were very tricky to use. There wasn't such a thing as IndexOrDocValuesQuery and often you'd get worse performance when using them.

nowadays, we can expose them more cleanly, via e.g. KeywordField.newSetQuery

@gsmiller
Copy link
Contributor Author

I wonder for both these DV-queries (existing one in sandbox and new one here in sandbox), if they should really be using this LongBitSet to mark the matching ordinals found in the terms dict... it seems overkill/wasteful to create a bitset sized over the entire terms dictionary.

Interesting. I'll experiment with that in this benchmark to see if it makes any real difference (+1 to being more memory efficient here if possible).

also personally i wouldn't target the sandbox at all when attacking this problem

That's fair. I'll target "non sandbox."

I'll work on tightening up the code a bit and adding testing soon.

@gsmiller gsmiller force-pushed the explore/sandbox-tis-query-no-multi-dv-support branch from 84dcd52 to db2de86 Compare February 4, 2023 14:02
@gsmiller gsmiller changed the title [DRAFT] Explore TermInSet Query that "self optimizes" Explore TermInSet Query that "self optimizes" Feb 4, 2023
@gsmiller gsmiller force-pushed the explore/sandbox-tis-query-no-multi-dv-support branch from db2de86 to d127aa4 Compare February 4, 2023 14:09
@gsmiller gsmiller marked this pull request as ready for review February 4, 2023 14:10
@gsmiller gsmiller changed the title Explore TermInSet Query that "self optimizes" Modify TermInSetQuery to "self optimize" if doc values are available Feb 4, 2023
@gsmiller
Copy link
Contributor Author

gsmiller commented Feb 4, 2023

@rmuir I modified the PR to update the existing TermInSetQuery in-place, instead of introducing a new sandbox concept. I've added some test coverage and re-ran all the benchmarks. The updated benchmark code is here: TiSBench.java.txt. The results are here:
TiSBenchResults.md.txt.

I think this is ready for a more thorough review now if you (or anyone else) has the time. My plan is to reference this query directly from something like KeywordField#newSetQuery after #12054 is merged (or could be part of that PR). Based on the benchmarks, I wouldn't expect this change to cause any significant regressions to existing TermInSetQuery users, and it looks more promising than trying to leverage IndexOrDocValues. I'm sure there's room to tweak/improve it further, but my hope is to get something initial merged and then iterate as it makes sense. Thanks for your feedback so far!

@rmuir
Copy link
Member

rmuir commented Feb 4, 2023

Hmm, @gsmiller I really personally don't like all this complexity and optimizations mixed into TermInSetQuery. It becomes almost unmanageable.

This is why I like the IndexOrDocValues approach: we can have two simple queries instead of one extremely complicated one.

If it is a must that we create a giant HeroicTermInSetQuery that tries to do anything and everything, then let's name it just that: also make it package private in org.apache.lucene.document, it can be hooked in by KeywordField.

But I'd really like us to consider the cost in the complexity here.

@rmuir
Copy link
Member

rmuir commented Feb 4, 2023

and when i look at the benchmarks, there are cases where IndexOrDV does better, and there are cases where this one does worse. I'm not seeing a "clear winner", so i'd personally lean towards tie-breaking on "lower complexity".

We might be able to improve performance in easier ways that don't have complexity cost, such as replacing gigantic bitset with longhashset.

@rmuir
Copy link
Member

rmuir commented Feb 4, 2023

another way to say it: I see a lot more optimizations and conditionals in the query than I see wins in the benchmark?

which of these truly matter? Can we remove some to simplify the new query?
Is there a key optimization or two that can simply be folded into existing TermInSetQuery/DocValuesTermsQuery cost() method so we can improve IndexOrDocValuesQuery?

@rmuir
Copy link
Member

rmuir commented Feb 4, 2023

i'll work up a draft PR trying to work it from the opposite direction (keep queries split apart) so my suggestions will be more concrete. You did the hard part already and made a benchmark, I will reuse it.

@rmuir
Copy link
Member

rmuir commented Feb 4, 2023

I think some of the benchmarks needs to be revisited here anyway. DocValuesTermsQuery is not in great shape, which may lead to bad results from IndexOrDocValuesQuery and cause confusion. I will clean it up.

@gsmiller
Copy link
Contributor Author

gsmiller commented Feb 4, 2023

@rmuir thanks for the feedback. +1 to making sure we're getting the right balance of complexity vs. benefit. If you strip away some of the complexity, I think the main benefit of the approach I'm proposing here is the ability to term-seek in a somewhat sensible way to come up with a better cost estimate. The challenge with the existing TermInSetQuery is that it's only looking at field level stats to guess at the disjunction cost. This can lead to a situation where the cost is significantly overestimated, causing IndexOrDocValues to use doc values when postings would be more appropriate. The advantage of optimizing within the query implementation is that it can short-circuit the term seeking when it reaches a point where it believes doc values will be more appropriate. I'm not sure how we'd do this in a sensible way in the existing ScoreSupplier#cost abstraction.

So this shows up with something like Medium Cardinality + Low Cost Country Code Filter Terms in the benchmarks, as well as the PK cases. The challenge with the PK case is a bit unique. While ScoreSupplier#cost in TermInSetQuery already handles PK cost estimate in a sensible way, it's missing the fact that most terms will not be present in a given segment (assuming the index is segmented). Again, this is where term-seeking helps provide a tighter cost estimate. Both of these are things we may be able to tweak in the current TermInSet cost approximation so that we don't need a special query implementation.

As for where the proposed implementation is worse than IndexOrDocValues, I think the main culprit is Low Cardinality + High Cost Country Code Filter Terms. This is because I kept the existing rewrite logic to rewrite to a boolean query if there are fewer than 16 terms. So it will always use a postings approach with < 16 terms, even though that's not the right thing in this situation. This is something I was planning to look into more. When I removed this rewrite logic, it performed much better in this case, but I bumped into some unit tests elsewhere that assume the rewrite, and I thought it would help to revise the rewrite logic in a separate pass rather than try to solve for that now.

Anyway, back to the point about complexity vs. benefit, I 100% agree that relying on IndexOrDocValues would be preferable if we can solve for cost over-estimating. I'd pursued this path after some feedback from @jpountz (#11741 (comment)) that it may make sense to take this type of approach, but I'm all for keeping it simple if we can here. I'll keep an eye out for your draft PR. Thanks for the engagement on this!

@rmuir
Copy link
Member

rmuir commented Feb 5, 2023

I fired up 3 related PRs to at least make it more of a fair fight :)

#12127
#12128
#12129

@rmuir
Copy link
Member

rmuir commented Feb 5, 2023

Anyway, back to the point about complexity vs. benefit, I 100% agree that relying on IndexOrDocValues would be preferable if we can solve for cost over-estimating. I'd pursued this path after some feedback from @jpountz (#11741 (comment)) that it may make sense to take this type of approach, but I'm all for keeping it simple if we can here. I'll keep an eye out for your draft PR. Thanks for the engagement on this!

Yeah overall my concern is not so much with your specific query, just that it doesn't scale to many other queries. It is only matter of time before someone says "hey, we can really speed up many other slow use-cases by adding KeywordField.newPrefixQuery, newWildCardQuery, etc by using the docvalues", and it's true (for selective queries, you could avoid intersection against any large terms dict at all and do a a couple per-doc very-high-cost lookupTerm + runautomaton match). So it's an example where i'd like to avoid a mess: it would be better to fix apis such as MultiTermQuery, IndexOrDocValuesQuery, ScorerSupplier, etc so that we can make things work cleanly with simpler queries that are easier to test correctness of and maintain.

But practically, right now benchmark is not very fair comparison because some of these queries just have obvious straightforward performance deficiencies, DocValuesTermsQuery especially.

@gsmiller
Copy link
Contributor Author

gsmiller commented Feb 5, 2023

Thanks @rmuir! I'll have a look at the PRs tomorrow. I've got a bad case of jet lag kicking in at the moment and my brain is turning to mush. In the meantime, I took a pass at simplifying the proposal here in this PR. There was a lot of complexity (as you pointed out) for edge-cases and such. I tried to keep it focused on just the core idea of seeking some terms to get a better cost estimate, in a way that adds minimal complexity. Benchmark results are about the same as before, so quite a bit simpler and no real performance difference: TiSBenchResults_Simplified.md.txt

Of course, this still relies on optimizing internally instead of leveraging IndexOrDocValuesQuery, so I recognize it's not getting at the core of your concern. I figured it would be good to at least simplify a bit though so we have something a little cleaner/simpler to discuss.

Thanks again for the feedback and related PRs!

@rmuir
Copy link
Member

rmuir commented Feb 5, 2023

Thanks, the patch is super-helpful as it illustrates the real issue. I am looking into it, to me the issue is a problem with TermInSetQuery ScorerSupplier cost method. Points uses a different approach that works great in all my benchmarks, so it seems like terms/postings might need one too.

I didn't get distracted, just saw related issues on the DV side that are easily fixed: and i've still got a few TODOs for these queries remaining. It is one advantage of having the separate queries: makes it super easy to just benchmark the pure-DV case and find/fix issues.

@gsmiller
Copy link
Contributor Author

gsmiller commented Feb 6, 2023

@rmuir I grabbed your patch for adding a ScoreSupplier to DocValuesTermsQuery (#12129) and reran benchmarks. The gap between IndexOrDV and the "self-optimizing" TermInSetQuery have closed with this change. It looks like I was wrong about the way IndexOrDV plans PK-type queries. I thought it was choosing to use doc values based on what I saw in profiler output, but what I was really seeing was the up-front ordinal lookups in DocValuesTermsQuery as a result of not having the ScoreSupplier abstraction. With your patch, that goes away.

The only gap that remains now is when the field is not a PK-style field but the terms being used in the disjunction have a low aggregate cost (relative to the other terms in the field; e.g., Medium Cardinality + Low Cost Country Code Filter Terms). In this case, IndexOrDV is always using doc values (due to the field-level stats used for cost), but—by doing some term-seeking—we could better decide to use postings.

Here are updated benchmark results: TiSBenchResults_Simplified_DVSSPatch.md.txt
(Note that "low cardinality" cases are kind of terrible still because the TiSQuery is being rewritten to a BooleanQuery)

to me the issue is a problem with TermInSetQuery ScorerSupplier cost method

+1. Maybe there's a way to address this remaining gap by being smarter about the cost function without term-seeking? That would be ideal.

I also played around with the idea of a "cost iterator" abstraction on ScoreSupplier as a way to allow something like TermInSetQuery to provide incremental costs to IndexOrDocValuesQuery as it term-seeks. This feels clunky to me, and I'm not proposing it as a "good idea" right now, but I'll share it as another approach. I was able to get comparable benchmark results with this technique, and it still allows IndexOrDocValuesQuery to "own" the decision between postings and doc values: main...gsmiller:lucene:explore/tis-score-supplier-cost-iterator. Benchmark results for this approach are here: TiSBenchResults_SSIterator.md.txt. It feels overly complicated though.

@rmuir
Copy link
Member

rmuir commented Feb 6, 2023

That's good that it made progress. I will look more into it tonight. I want to get these patches landed to simplify benchmarking.

its true there is one benchmark where this combined query does better ("Medium Cardinality + Low Cost Country Code Filter Terms") but there is also one benchmark where it does substantially worse ("Low Cardinality + High Cost Country Code Filter Terms").

so net/net i would say they are equivalent. But I will look into this case to see if we can still do better.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should switch this PR to a new doc-values query that looks up terms on demand rather than eagerly in order not to pay a O(numTerms) up-front cost per segment, which defeats the idea of using doc values when numTerms > leadCost.

costThreshold = isPrimaryKeyField ? leadCost << 3 : leadCost;

if (termData.size() > costThreshold) {
// If the number of terms is > the number of candidates, DV should perform better.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is right given that the doc-values query still eagerly evaluates all terms against the terms dictionary. For this to work correctly, we'd need a query that looks up terms lazily rather than eagerly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure actually. The up-front term-seeking you refer to is certainly a cost, but it doesn't scale with the number of lead hits. So this can still be cheaper. But also, +1 to the idea of trying out on-demand term seeking for these situations!

final long indexedTermCount = indexTerms.size();
if (indexedTermCount != -1) {
potentialExtraCost -= indexedTermCount;
}
cost = queryTermsCount + potentialExtraCost;

final boolean isPrimaryKeyField = indexedTermCount != -1 && sumDocFreq == indexedTermCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since terms.size() is an optional index statistic, maybe check sumDocFreq == docCount instead?

@gsmiller
Copy link
Contributor Author

gsmiller commented Feb 6, 2023

but there is also one benchmark where it does substantially worse ("Low Cardinality + High Cost Country Code Filter Terms").

100%. The issue here is that TermInSetQuery gets rewritten to a BooleanQuery because there are fewer than 16 terms, so it doesn't have a chance to "self-optimize" to use doc values. We can fix this by not eagerly rewriting to a BooleanQuery, but I held off doing that for now. So this is "easily" fixable I think.

@rmuir
Copy link
Member

rmuir commented Feb 7, 2023

100%. The issue here is that TermInSetQuery gets rewritten to a BooleanQuery because there are fewer than 16 terms, so it doesn't have a chance to "self-optimize" to use doc values. We can fix this by not eagerly rewriting to a BooleanQuery, but I held off doing that for now. So this is "easily" fixable I think.

Well right now, I'm not seeing any justification to mix up concerns and pile all into one query. Currently overall, IndexOrDocValues has the best performance: it is slightly slower in one case but massively faster in that case.

Also there are some problems with the benchmark, i tried to wrestle with it but there is some serious noise/gc/something going on. if i just rearrange order of tests numbers change dramatically.

I am still willing to optimize the one case where IndexOrDocValues might do better, but I honestly don't feel I need to "play defense" any more. I think we should use IndexOrDocValues.

@rmuir
Copy link
Member

rmuir commented Feb 7, 2023

I spent mine time her just to prevent that code is duplicated and prevent a mess, nothing more. I got no skin in the game and could care less about these stupid abusive "joins" that people do.

But let's please look at the current situation and say "we are gonna keep it clean" so I don't have to stay up so late anymore to prevent code mess. There is no advantage to piling all this shit in one query. breathe in, breathe out, let it go.

@gsmiller
Copy link
Contributor Author

gsmiller commented Feb 7, 2023

@rmuir thanks for the feedback and spending time having a look. I'm going to try summarizing where we've landed to make sure we're on the same page. I think we both agree on the following, but please let me know if I'm missing anything?

  1. Ideally we want to leverage IndexOrDocValues instead of putting custom solutions into queries themselves (e.g., we'd prefer not to have query implementations decide between postings / doc values). This decoupling allows more reuse and separate innovations in the separate queries (e.g., we can independently optimize TermInSetQuery and DocValuesTermsQuery). This also provides more reuse across various use-cases, as opposed to lots of different query implementations having to re-invent this postings vs. doc values logic.
  2. The only inputs IndexOrDocValues has for for making a decision are, 1) "lead cost" and 2) the stated "cost" of the wrapper query ScoreSuppliers. TermInSetQuery may significantly over-estimate cost in its currently implementation as it guarantees a cost ceiling, only looking at field-level statistics.
  3. "Primary key" type cases work well with IndexOrDocValues since TermInSetQuery cost is able to recognize this situation and provide a better cost estimate.
  4. The case where IndexOrDocValues doesn't do a great job is when the indexed terms cover many documents in general, but the terms used in the specific query cover few.

Given this, I tend to agree that moving away from IndexOrDocValues to only solve one type of scenario (mentioned in #4 above), probably isn't the right decision. To move forward, I'm going to explore other ways to solve this case while relying on the IndexOrDocValues abstraction. I can think of a couple ways to approach the problem:

  1. Alter the way we compute TermInSetQuery#cost. Instead of providing a ceiling, we could look at some sort of "average" cost (e.g., determine average number of docs per term and multiply that out by the number of terms in the query). We could also term-seek up to some fixed number of terms up-front to get more information. This obviously ads cost but may be worth it
  2. While significantly more complex, if it's not sufficient to use field-level statistics and a single cost value, I may further explore the idea of a "cost iterator" in ScoreSupplier. This still feels too complex to me and I don't really want to go in this direction.

@gsmiller
Copy link
Contributor Author

Closing this out for now. I think my last comment provides a pretty good summary of where things landed here. I did experiment a bit with other ways of estimating cost within TermInSetQuery's score supplier, but couldn't come up with anything useful. The core problem remains that using field-level statistics just isn't granular enough to make the "right decision" in all cases between postings or doc values, so unless we do some term seeking, I don't see a way to improve it. Since there was significant push-back on doing this, I'll close out for now. Thanks again everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants