Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode dense blocks of postings as bit sets. #14133

Merged
merged 5 commits into from
Jan 14, 2025
Merged

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jan 13, 2025

Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago #6116.

@msokolov recently brought up (#14080) that such an encoding has become especially appealing with the introduction of the DocIdSetIterator#loadIntoBitSet API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, #loadIntoBitSet would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.

Closes #6116

Bit sets can be faster at advancing and more storage-efficient on dense blocks
of postings. This is not a new idea, @mkhludnev proposed something similar a
long time ago apache#6116.

@msokolov recently brought up (apache#14080) that such an encoding has become
especially appealing with the introduction of the
`DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring
disjunctions and dense conjunctions now take advantage of it. Indeed, if
postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the
postings bits into the bits that are used as an intermediate representation of
matches of the query.
@jpountz
Copy link
Contributor Author

jpountz commented Jan 13, 2025

Opening as a draft for now because I would like to change the way how deleted docs are applied with the #loadIntoBitSet API. As things are today, a single deleted doc in a segment would completely cancel the speedup.

Here is what luceneutil reports on wikibigall:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                   TermTitleSort      153.45      (2.6%)      146.17      (2.1%)   -4.7% (  -9% -    0%) 0.000
                         Prefix3      140.30      (4.3%)      134.27      (3.1%)   -4.3% ( -11% -    3%) 0.008
                     OrStopWords       34.73      (7.9%)       33.26      (8.9%)   -4.2% ( -19% -   13%) 0.239
                 FilteredPrefix3      133.93      (4.2%)      128.54      (3.0%)   -4.0% ( -10% -    3%) 0.009
                            Term      487.39      (2.8%)      470.53      (4.9%)   -3.5% ( -10% -    4%) 0.042
                      DismaxTerm      584.95      (2.1%)      568.87      (3.8%)   -2.7% (  -8% -    3%) 0.037
                        Or3Terms      172.20      (4.7%)      167.58      (5.1%)   -2.7% ( -11% -    7%) 0.203
                      OrHighHigh       54.62      (6.3%)       53.22      (4.4%)   -2.6% ( -12% -    8%) 0.270
                        Wildcard       78.76      (3.6%)       77.01      (3.3%)   -2.2% (  -8% -    4%) 0.131
                 AndHighOrMedMed       44.90      (1.0%)       43.91      (1.2%)   -2.2% (  -4% -    0%) 0.000
                          OrMany       19.43      (2.7%)       19.03      (4.4%)   -2.1% (  -8% -    5%) 0.184
                      TermDTSort      286.84      (7.8%)      281.19      (5.9%)   -2.0% ( -14% -   12%) 0.505
                     AndHighHigh       44.74      (1.5%)       43.87      (2.4%)   -2.0% (  -5% -    1%) 0.022
                      OrHighRare      278.82      (6.4%)      273.59      (7.8%)   -1.9% ( -15% -   13%) 0.537
                          Fuzzy1       81.43      (2.6%)       80.06      (2.1%)   -1.7% (  -6% -    3%) 0.101
                    CombinedTerm       31.73      (2.2%)       31.20      (2.5%)   -1.7% (  -6% -    3%) 0.105
                       And3Terms      173.65      (3.4%)      170.79      (3.9%)   -1.6% (  -8% -    5%) 0.289
                  FilteredOrMany       16.75      (1.4%)       16.48      (2.8%)   -1.6% (  -5% -    2%) 0.098
                   TermMonthSort     3380.63      (3.0%)     3329.65      (2.1%)   -1.5% (  -6% -    3%) 0.175
              CombinedOrHighHigh       19.05      (1.8%)       18.77      (1.5%)   -1.5% (  -4% -    1%) 0.036
                    AndStopWords       31.57      (4.1%)       31.11      (6.6%)   -1.5% ( -11% -    9%) 0.532
                          Fuzzy2       76.49      (2.2%)       75.47      (1.8%)   -1.3% (  -5% -    2%) 0.117
                      AndHighMed      128.95      (1.2%)      127.28      (2.9%)   -1.3% (  -5% -    2%) 0.173
              Or2Terms2StopWords      162.86      (4.8%)      161.03      (5.2%)   -1.1% ( -10% -    9%) 0.601
                     CountPhrase        4.18      (1.6%)        4.14      (7.7%)   -1.0% ( -10% -    8%) 0.684
                        PKLookup      278.86      (2.4%)      276.45      (1.5%)   -0.9% (  -4% -    3%) 0.308
                DismaxOrHighHigh      119.43      (4.4%)      118.55      (4.0%)   -0.7% (  -8% -    8%) 0.682
               FilteredAnd3Terms      192.64      (2.2%)      191.75      (2.1%)   -0.5% (  -4% -    3%) 0.619
             And2Terms2StopWords      161.87      (3.3%)      161.43      (3.4%)   -0.3% (  -6% -    6%) 0.848
                FilteredOr3Terms      164.22      (1.5%)      163.83      (1.1%)   -0.2% (  -2% -    2%) 0.673
             CombinedAndHighHigh       15.26      (1.9%)       15.23      (1.9%)   -0.2% (  -4% -    3%) 0.774
                  FilteredIntNRQ      110.08     (12.4%)      109.86     (13.6%)   -0.2% ( -23% -   29%) 0.971
               CombinedOrHighMed       71.97      (1.9%)       71.85      (1.7%)   -0.2% (  -3% -    3%) 0.827
                       CountTerm     9414.23      (5.4%)     9409.07      (4.3%)   -0.1% (  -9% -   10%) 0.979
                          IntNRQ      110.97     (11.7%)      111.34     (13.8%)    0.3% ( -22% -   29%) 0.951
              FilteredAndHighMed      128.34      (2.7%)      129.86      (2.8%)    1.2% (  -4% -    6%) 0.312
              CombinedAndHighMed       55.25      (1.8%)       56.01      (2.0%)    1.4% (  -2% -    5%) 0.086
               FilteredOrHighMed      152.30      (1.4%)      154.67      (1.3%)    1.6% (  -1% -    4%) 0.006
      FilteredOr2Terms2StopWords      146.27      (1.8%)      148.57      (1.2%)    1.6% (  -1% -    4%) 0.016
                       OrHighMed      196.36      (5.1%)      199.53      (3.6%)    1.6% (  -6% -   10%) 0.389
                 DismaxOrHighMed      170.18      (3.3%)      173.26      (2.6%)    1.8% (  -4% -    8%) 0.158
                          Phrase       14.72      (5.4%)       15.09      (5.4%)    2.5% (  -7% -   14%) 0.278
     FilteredAnd2Terms2StopWords      194.56      (1.6%)      200.16      (1.9%)    2.9% (   0% -    6%) 0.000
                    FilteredTerm      154.40      (1.7%)      159.20      (1.7%)    3.1% (   0% -    6%) 0.000
               TermDayOfYearSort      628.53      (4.8%)      657.48      (4.5%)    4.6% (  -4% -   14%) 0.021
              FilteredOrHighHigh       64.07      (1.8%)       67.55      (2.1%)    5.4% (   1% -    9%) 0.000
             CountFilteredPhrase       24.42      (1.8%)       26.13      (2.8%)    7.0% (   2% -   11%) 0.000
             FilteredOrStopWords       43.16      (1.9%)       46.70      (2.5%)    8.2% (   3% -   12%) 0.000
                AndMedOrHighHigh       60.21      (1.7%)       66.11      (2.1%)    9.8% (   5% -   13%) 0.000
             FilteredAndHighHigh       61.91      (1.7%)       68.38      (2.2%)   10.5% (   6% -   14%) 0.000
                  FilteredPhrase       29.38      (1.1%)       32.96      (2.0%)   12.2% (   9% -   15%) 0.000
            FilteredAndStopWords       47.21      (1.6%)       54.89      (2.5%)   16.3% (  12% -   20%) 0.000
                 CountAndHighMed      238.97      (2.3%)      294.13      (2.8%)   23.1% (  17% -   28%) 0.000
          CountFilteredOrHighMed       88.21      (0.9%)      116.76      (0.8%)   32.4% (  30% -   34%) 0.000
         CountFilteredOrHighHigh       71.30      (1.2%)      105.22      (1.2%)   47.6% (  44% -   50%) 0.000
                  CountOrHighMed      189.16      (2.3%)      343.97      (4.2%)   81.8% (  73% -   90%) 0.000
             CountFilteredOrMany       11.05      (2.3%)       24.68      (4.3%)  123.2% ( 113% -  132%) 0.000
                CountAndHighHigh      132.06      (2.4%)      295.79      (5.1%)  124.0% ( 113% -  134%) 0.000
                 CountOrHighHigh      123.04      (2.4%)      279.03      (5.1%)  126.8% ( 116% -  137%) 0.000
                     CountOrMany       11.67      (1.4%)       28.06      (5.9%)  140.4% ( 131% -  149%) 0.000

@jpountz
Copy link
Contributor Author

jpountz commented Jan 13, 2025

It's worth noting that some tasks that do not use the loadIntoBitSet API also report a speedup: FilteredAndStopWords (+16%), FilteredPhrase (+12%), FilteredAndHighHigh (+10%), AndMedOrHighHigh (+10%), FilteredOrStopWords (+8%), CountFilteredPhrase (+7%), FilteredOrHighHigh (+5%), FilteredTerm (+3%), FilteredAnd2Terms2StopWords (+3%).

@jpountz
Copy link
Contributor Author

jpountz commented Jan 13, 2025

FWIW the test failure is due to a bug in the "slow" logic that gets applied when there are deleted docs, which I hope to remove soon.

jpountz added a commit to jpountz/lucene that referenced this pull request Jan 13, 2025
…ntroduce `Bits#applyMask`.

Most `DocIdSetIterator` implementations can no longer implement `#intoBitSet`
efficiently as soon as there are live docs. So this commit remove this argument
and instead introduces a new `Bits#applyMask` API that helps clear bits in a
bit set when the corresponding doc ID is not live.

Relates apache#14133
@jpountz
Copy link
Contributor Author

jpountz commented Jan 14, 2025

I also ran the benchmark from https://tantivy-search.github.io/bench/ to see if it gives similar feedback. For reference global queries means "conjunctions and disjunctions" in this benchmark. I like the results. The TOP_100 collection type mostly sees an improvement to its P99, which maps to queries that include stop words, which can now advance faster thanks to this new bit set encoding.

search_bench_top_100

The COUNT collection type sees a big improvement to its P90 and a huge improvement to its P99. The combination of vectorizing loading doc IDs into a bit set using #loadIntoBitSet (which the lucene-10.0.0 engine doesn't have either) and this new encoding for terms that have dense postings are helping a lot.

search_bench_count

jpountz added a commit that referenced this pull request Jan 14, 2025
…ntroduce `Bits#applyMask`. (#14134)

Most `DocIdSetIterator` implementations can no longer implement `#intoBitSet`
efficiently as soon as there are live docs. So this commit remove this argument
and instead introduces a new `Bits#applyMask` API that helps clear bits in a
bit set when the corresponding doc ID is not live.

Relates #14133
@jpountz jpountz marked this pull request as ready for review January 14, 2025 12:18
@jpountz
Copy link
Contributor Author

jpountz commented Jan 14, 2025

I merged the removal of the acceptDocs parameter to intoBitSet so this is now ready for review.

jpountz added a commit that referenced this pull request Jan 14, 2025
…ntroduce `Bits#applyMask`. (#14134)

Most `DocIdSetIterator` implementations can no longer implement `#intoBitSet`
efficiently as soon as there are live docs. So this commit remove this argument
and instead introduces a new `Bits#applyMask` API that helps clear bits in a
bit set when the corresponding doc ID is not live.

Relates #14133
Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM, I just had a few questions for my education. Great improvements!

for (int l : ints) {
or |= l;
}
assert or != 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could indicate this assumption in the javadoc -- it's not clear to me, at a glance, why this should be true. I guess this is only called with all the postings for a term or something?

forDeltaUtil.decodeAndPrefixSum(bitsPerValue, docInUtil, prevDocID, docBuffer);
encoding = DeltaEncoding.PACKED;
} else if (bitsPerValue == 0) {
// dense block: 128 one bits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confusing -- since we set the bitset to all zeros?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is confusing, docBitSet.set(0, BLOCK_SIZE) sets BLOCK_SIZE bits to true? I refactored a bit, hopefully it is clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, sorry, I read it as set the bits to zero, but that is wrong, thanks

for (int i = 0; i < numLongs - 1; ++i) {
docCumulativeWordPopCounts[i] = Long.bitCount(docBitSet.getBits()[i]);
}
for (int i = 1; i < numLongs - 1; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sneaky!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. :) I added a comment to make it clearer.

docCumulativeWordPopCounts[i] += docCumulativeWordPopCounts[i - 1];
}
docCumulativeWordPopCounts[numLongs - 1] = BLOCK_SIZE;
assert docCumulativeWordPopCounts[numLongs - 2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if we have fewer than BLOCK_SIZE postings to encode? Do these go in a different encoding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use the bit set encoding for "full" blocks. Tail blocks, which may have less than 128 doc IDs to record, keep using the current encoding that stores deltas using group-varint, they never use a bit set.

doc = docBuffer[docBufferUpto];
break;
case UNARY:
int next = docBitSet.nextSetBit(doc - docBitSetBase + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there would be any benefit in maintaining next as a member variable that would always be doc - docBitSetBase. I guess it would save a single addition here, so maybe not worth it

s += i;
spareBitSet.set(s);
}
level0Output.writeByte((byte) -numBitSetLongs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is guaranteed to fit in a byte by limits on BLOCK_SIZE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I added a comment.

@msokolov
Copy link
Contributor

Looks like one of the checks failed with " > org.apache.lucene.index.CheckIndex$CheckIndexException: Field "vector" has repeated neighbors of node 2424 with value 2450" -- unrelated, sorry I mean to get these cleared up soon!

@jpountz jpountz merged commit 26e5a8d into apache:main Jan 14, 2025
2 of 5 checks passed
@jpountz jpountz deleted the bitset_block branch January 14, 2025 18:02
@jpountz jpountz added this to the 10.2.0 milestone Jan 14, 2025
jpountz added a commit that referenced this pull request Jan 14, 2025
Bit sets can be faster at advancing and more storage-efficient on dense blocks
of postings. This is not a new idea, @mkhludnev proposed something similar a
long time ago #6116.

@msokolov recently brought up (#14080) that such an encoding has become
especially appealing with the introduction of the
`DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring
disjunctions and dense conjunctions now take advantage of it. Indeed, if
postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the
postings bits into the bits that are used as an intermediate representation of
matches of the query.
jpountz added a commit that referenced this pull request Jan 14, 2025
jpountz added a commit that referenced this pull request Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bitset codec for off heap filters [LUCENE-5052]
2 participants