-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encode dense blocks of postings as bit sets. #14133
Conversation
Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago apache#6116. @msokolov recently brought up (apache#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.
Opening as a draft for now because I would like to change the way how deleted docs are applied with the Here is what
|
It's worth noting that some tasks that do not use the |
FWIW the test failure is due to a bug in the "slow" logic that gets applied when there are deleted docs, which I hope to remove soon. |
…ntroduce `Bits#applyMask`. Most `DocIdSetIterator` implementations can no longer implement `#intoBitSet` efficiently as soon as there are live docs. So this commit remove this argument and instead introduces a new `Bits#applyMask` API that helps clear bits in a bit set when the corresponding doc ID is not live. Relates apache#14133
I also ran the benchmark from https://tantivy-search.github.io/bench/ to see if it gives similar feedback. For reference The |
…ntroduce `Bits#applyMask`. (#14134) Most `DocIdSetIterator` implementations can no longer implement `#intoBitSet` efficiently as soon as there are live docs. So this commit remove this argument and instead introduces a new `Bits#applyMask` API that helps clear bits in a bit set when the corresponding doc ID is not live. Relates #14133
I merged the removal of the |
…ntroduce `Bits#applyMask`. (#14134) Most `DocIdSetIterator` implementations can no longer implement `#intoBitSet` efficiently as soon as there are live docs. So this commit remove this argument and instead introduces a new `Bits#applyMask` API that helps clear bits in a bit set when the corresponding doc ID is not live. Relates #14133
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM, I just had a few questions for my education. Great improvements!
for (int l : ints) { | ||
or |= l; | ||
} | ||
assert or != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could indicate this assumption in the javadoc -- it's not clear to me, at a glance, why this should be true. I guess this is only called with all the postings for a term or something?
forDeltaUtil.decodeAndPrefixSum(bitsPerValue, docInUtil, prevDocID, docBuffer); | ||
encoding = DeltaEncoding.PACKED; | ||
} else if (bitsPerValue == 0) { | ||
// dense block: 128 one bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confusing -- since we set the bitset to all zeros?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what is confusing, docBitSet.set(0, BLOCK_SIZE)
sets BLOCK_SIZE bits to true
? I refactored a bit, hopefully it is clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, sorry, I read it as set the bits to zero, but that is wrong, thanks
for (int i = 0; i < numLongs - 1; ++i) { | ||
docCumulativeWordPopCounts[i] = Long.bitCount(docBitSet.getBits()[i]); | ||
} | ||
for (int i = 1; i < numLongs - 1; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sneaky!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. :) I added a comment to make it clearer.
docCumulativeWordPopCounts[i] += docCumulativeWordPopCounts[i - 1]; | ||
} | ||
docCumulativeWordPopCounts[numLongs - 1] = BLOCK_SIZE; | ||
assert docCumulativeWordPopCounts[numLongs - 2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if we have fewer than BLOCK_SIZE
postings to encode? Do these go in a different encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only use the bit set encoding for "full" blocks. Tail blocks, which may have less than 128 doc IDs to record, keep using the current encoding that stores deltas using group-varint, they never use a bit set.
doc = docBuffer[docBufferUpto]; | ||
break; | ||
case UNARY: | ||
int next = docBitSet.nextSetBit(doc - docBitSetBase + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there would be any benefit in maintaining next
as a member variable that would always be doc - docBitSetBase
. I guess it would save a single addition here, so maybe not worth it
s += i; | ||
spareBitSet.set(s); | ||
} | ||
level0Output.writeByte((byte) -numBitSetLongs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is guaranteed to fit in a byte by limits on BLOCK_SIZE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I added a comment.
Looks like one of the checks failed with " > org.apache.lucene.index.CheckIndex$CheckIndexException: Field "vector" has repeated neighbors of node 2424 with value 2450" -- unrelated, sorry I mean to get these cleared up soon! |
Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago #6116. @msokolov recently brought up (#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.
Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago #6116.
@msokolov recently brought up (#14080) that such an encoding has become especially appealing with the introduction of the
DocIdSetIterator#loadIntoBitSet
API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set,#loadIntoBitSet
would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.Closes #6116