-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the new loadIntoBitSet
API to speed up dense conjunctions.
#14080
Use the new loadIntoBitSet
API to speed up dense conjunctions.
#14080
Conversation
Now that loading doc IDs into a bit set is much more efficient thanks to auto-vectorization, it has become tempting to evaluate dense conjunctions by and-ing bit sets.
wikibigall on my AMD Ryzen 9 3900X:
|
On my Apple M3:
I also ran all queries from https://tantivy-search.github.io/bench/ and this change was often a big speedup (up to multiple times) and sometimes a small slowdown (< 10%). |
I made the heuristic more conservative, results now look like this on my M3 after a few iterations:
|
private final FixedBitSet windowMatches = new FixedBitSet(WINDOW_SIZE); | ||
private final FixedBitSet clauseWindowMatches = new FixedBitSet(WINDOW_SIZE); | ||
private final DocIdStreamView docIdStreamView = new DocIdStreamView(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one concern I would ahve is memory, but it doesn't seem like this would use any more memory than our other bulk scorers as its restricted to WINDOW_SIZE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it only requires 2*WINDOW_SIZE bits = 1kB.
|| subs.get(Occur.SHOULD).size() <= 1 | ||
|| minShouldMatch > 1) { | ||
|| minShouldMatch != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the switch to allow minShouldMatch == 0
. I would have thought this meant they aren't applied at all in the skipping logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's unrelated to this PR. This method is only called when minShouldMatch >= 1, so minShouldMatch > 1
and minShouldMatch != 1
are equivalent, but I liked that != 1
is more defensive.
Scorer filterScorer; | ||
if (filters.size() == 1) { | ||
filterScorer = filters.iterator().next(); | ||
if (scoreMode == ScoreMode.TOP_SCORES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, its possible now to have scoreMode.needsScores() == false
now. I suppose checking for TOP_SCORES
or scoreModel.needsScores()
is the same due to the check on line 307. It just took me a bit to connect the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You got it right, I'll add a comment.
|
||
/** | ||
* BulkScorer implementation of {@link ConjunctionScorer} that is specialized for dense clauses. | ||
* Whenever sensible, it intersects clauses by tran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self, this comment got truncated it seems.
This is great! It makes me wonder if I should try reviving a dense posting encoding I had played around with a while ago where very-high-frequency terms would be encoded in the index using a bitset. If we had that we could use those directly. Testing this is tricky because we don't really have such terms in luceneutil, but I believe they occur "in nature" - for example with enumerated values that have a small number of terms |
Nightly benchmarks picked up this change, the bump is pretty cool. :) https://benchmarks.mikemccandless.com/CountAndHighHigh.html I pushed an annotation.
Yes, this sounds like it could be interesting indeed! I wonder if we should make the decision on a per-block basis rather than for the whole postings, in order to benefit postings that are only dense on some specific ranges of the doc ID space (which can happen when using index sorting or recursive graph bisection). Another idea that crossed my mind would consist of storing doc IDs as deltas from the first doc ID in the block in a short[] to further take advantage of SIMD (when applicable).
Wouldn't stop words qualify? E.g. I see that "the", "of" and "not" appear in 77%, 78% and 28% of documents respectively. |
agreed .. I had previously started out with a global decision and then eventually had a per-block decision. I saw some speedups in artificial tests, but struggled to get wins in more typical workloads.I think there was some overhead from reading the "type" of block and then branching to a specialized implementation? Not sure, but there has been a lot of change now in that area, maybe it will work out better. Right, stop words! I had previously created some new field (Month), but conjunctions including stop words would probably be a good test |
…che#14080) Now that loading doc IDs into a bit set is much more efficient thanks to auto-vectorization, it has become tempting to evaluate dense conjunctions by and-ing bit sets.
Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago apache#6116. @msokolov recently brought up (apache#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.
Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago #6116. @msokolov recently brought up (#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.
Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago #6116. @msokolov recently brought up (#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.
Now that loading doc IDs into a bit set is much more efficient thanks to auto-vectorization, it has become tempting to evaluate dense conjunctions by and-ing bit sets.