Encode dense blocks of postings as bit sets. #14133

jpountz · 2025-01-13T09:05:23Z

Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago #6116.

@msokolov recently brought up (#14080) that such an encoding has become especially appealing with the introduction of the DocIdSetIterator#loadIntoBitSet API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, #loadIntoBitSet would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.

Closes #6116

@mkhludnev

Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago apache#6116. @msokolov recently brought up (apache#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.

jpountz · 2025-01-13T09:07:27Z

Opening as a draft for now because I would like to change the way how deleted docs are applied with the #loadIntoBitSet API. As things are today, a single deleted doc in a segment would completely cancel the speedup.

Here is what luceneutil reports on wikibigall:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                   TermTitleSort      153.45      (2.6%)      146.17      (2.1%)   -4.7% (  -9% -    0%) 0.000
                         Prefix3      140.30      (4.3%)      134.27      (3.1%)   -4.3% ( -11% -    3%) 0.008
                     OrStopWords       34.73      (7.9%)       33.26      (8.9%)   -4.2% ( -19% -   13%) 0.239
                 FilteredPrefix3      133.93      (4.2%)      128.54      (3.0%)   -4.0% ( -10% -    3%) 0.009
                            Term      487.39      (2.8%)      470.53      (4.9%)   -3.5% ( -10% -    4%) 0.042
                      DismaxTerm      584.95      (2.1%)      568.87      (3.8%)   -2.7% (  -8% -    3%) 0.037
                        Or3Terms      172.20      (4.7%)      167.58      (5.1%)   -2.7% ( -11% -    7%) 0.203
                      OrHighHigh       54.62      (6.3%)       53.22      (4.4%)   -2.6% ( -12% -    8%) 0.270
                        Wildcard       78.76      (3.6%)       77.01      (3.3%)   -2.2% (  -8% -    4%) 0.131
                 AndHighOrMedMed       44.90      (1.0%)       43.91      (1.2%)   -2.2% (  -4% -    0%) 0.000
                          OrMany       19.43      (2.7%)       19.03      (4.4%)   -2.1% (  -8% -    5%) 0.184
                      TermDTSort      286.84      (7.8%)      281.19      (5.9%)   -2.0% ( -14% -   12%) 0.505
                     AndHighHigh       44.74      (1.5%)       43.87      (2.4%)   -2.0% (  -5% -    1%) 0.022
                      OrHighRare      278.82      (6.4%)      273.59      (7.8%)   -1.9% ( -15% -   13%) 0.537
                          Fuzzy1       81.43      (2.6%)       80.06      (2.1%)   -1.7% (  -6% -    3%) 0.101
                    CombinedTerm       31.73      (2.2%)       31.20      (2.5%)   -1.7% (  -6% -    3%) 0.105
                       And3Terms      173.65      (3.4%)      170.79      (3.9%)   -1.6% (  -8% -    5%) 0.289
                  FilteredOrMany       16.75      (1.4%)       16.48      (2.8%)   -1.6% (  -5% -    2%) 0.098
                   TermMonthSort     3380.63      (3.0%)     3329.65      (2.1%)   -1.5% (  -6% -    3%) 0.175
              CombinedOrHighHigh       19.05      (1.8%)       18.77      (1.5%)   -1.5% (  -4% -    1%) 0.036
                    AndStopWords       31.57      (4.1%)       31.11      (6.6%)   -1.5% ( -11% -    9%) 0.532
                          Fuzzy2       76.49      (2.2%)       75.47      (1.8%)   -1.3% (  -5% -    2%) 0.117
                      AndHighMed      128.95      (1.2%)      127.28      (2.9%)   -1.3% (  -5% -    2%) 0.173
              Or2Terms2StopWords      162.86      (4.8%)      161.03      (5.2%)   -1.1% ( -10% -    9%) 0.601
                     CountPhrase        4.18      (1.6%)        4.14      (7.7%)   -1.0% ( -10% -    8%) 0.684
                        PKLookup      278.86      (2.4%)      276.45      (1.5%)   -0.9% (  -4% -    3%) 0.308
                DismaxOrHighHigh      119.43      (4.4%)      118.55      (4.0%)   -0.7% (  -8% -    8%) 0.682
               FilteredAnd3Terms      192.64      (2.2%)      191.75      (2.1%)   -0.5% (  -4% -    3%) 0.619
             And2Terms2StopWords      161.87      (3.3%)      161.43      (3.4%)   -0.3% (  -6% -    6%) 0.848
                FilteredOr3Terms      164.22      (1.5%)      163.83      (1.1%)   -0.2% (  -2% -    2%) 0.673
             CombinedAndHighHigh       15.26      (1.9%)       15.23      (1.9%)   -0.2% (  -4% -    3%) 0.774
                  FilteredIntNRQ      110.08     (12.4%)      109.86     (13.6%)   -0.2% ( -23% -   29%) 0.971
               CombinedOrHighMed       71.97      (1.9%)       71.85      (1.7%)   -0.2% (  -3% -    3%) 0.827
                       CountTerm     9414.23      (5.4%)     9409.07      (4.3%)   -0.1% (  -9% -   10%) 0.979
                          IntNRQ      110.97     (11.7%)      111.34     (13.8%)    0.3% ( -22% -   29%) 0.951
              FilteredAndHighMed      128.34      (2.7%)      129.86      (2.8%)    1.2% (  -4% -    6%) 0.312
              CombinedAndHighMed       55.25      (1.8%)       56.01      (2.0%)    1.4% (  -2% -    5%) 0.086
               FilteredOrHighMed      152.30      (1.4%)      154.67      (1.3%)    1.6% (  -1% -    4%) 0.006
      FilteredOr2Terms2StopWords      146.27      (1.8%)      148.57      (1.2%)    1.6% (  -1% -    4%) 0.016
                       OrHighMed      196.36      (5.1%)      199.53      (3.6%)    1.6% (  -6% -   10%) 0.389
                 DismaxOrHighMed      170.18      (3.3%)      173.26      (2.6%)    1.8% (  -4% -    8%) 0.158
                          Phrase       14.72      (5.4%)       15.09      (5.4%)    2.5% (  -7% -   14%) 0.278
     FilteredAnd2Terms2StopWords      194.56      (1.6%)      200.16      (1.9%)    2.9% (   0% -    6%) 0.000
                    FilteredTerm      154.40      (1.7%)      159.20      (1.7%)    3.1% (   0% -    6%) 0.000
               TermDayOfYearSort      628.53      (4.8%)      657.48      (4.5%)    4.6% (  -4% -   14%) 0.021
              FilteredOrHighHigh       64.07      (1.8%)       67.55      (2.1%)    5.4% (   1% -    9%) 0.000
             CountFilteredPhrase       24.42      (1.8%)       26.13      (2.8%)    7.0% (   2% -   11%) 0.000
             FilteredOrStopWords       43.16      (1.9%)       46.70      (2.5%)    8.2% (   3% -   12%) 0.000
                AndMedOrHighHigh       60.21      (1.7%)       66.11      (2.1%)    9.8% (   5% -   13%) 0.000
             FilteredAndHighHigh       61.91      (1.7%)       68.38      (2.2%)   10.5% (   6% -   14%) 0.000
                  FilteredPhrase       29.38      (1.1%)       32.96      (2.0%)   12.2% (   9% -   15%) 0.000
            FilteredAndStopWords       47.21      (1.6%)       54.89      (2.5%)   16.3% (  12% -   20%) 0.000
                 CountAndHighMed      238.97      (2.3%)      294.13      (2.8%)   23.1% (  17% -   28%) 0.000
          CountFilteredOrHighMed       88.21      (0.9%)      116.76      (0.8%)   32.4% (  30% -   34%) 0.000
         CountFilteredOrHighHigh       71.30      (1.2%)      105.22      (1.2%)   47.6% (  44% -   50%) 0.000
                  CountOrHighMed      189.16      (2.3%)      343.97      (4.2%)   81.8% (  73% -   90%) 0.000
             CountFilteredOrMany       11.05      (2.3%)       24.68      (4.3%)  123.2% ( 113% -  132%) 0.000
                CountAndHighHigh      132.06      (2.4%)      295.79      (5.1%)  124.0% ( 113% -  134%) 0.000
                 CountOrHighHigh      123.04      (2.4%)      279.03      (5.1%)  126.8% ( 116% -  137%) 0.000
                     CountOrMany       11.67      (1.4%)       28.06      (5.9%)  140.4% ( 131% -  149%) 0.000

jpountz · 2025-01-13T09:23:20Z

It's worth noting that some tasks that do not use the loadIntoBitSet API also report a speedup: FilteredAndStopWords (+16%), FilteredPhrase (+12%), FilteredAndHighHigh (+10%), AndMedOrHighHigh (+10%), FilteredOrStopWords (+8%), CountFilteredPhrase (+7%), FilteredOrHighHigh (+5%), FilteredTerm (+3%), FilteredAnd2Terms2StopWords (+3%).

jpountz · 2025-01-13T09:54:40Z

FWIW the test failure is due to a bug in the "slow" logic that gets applied when there are deleted docs, which I hope to remove soon.

…ntroduce `Bits#applyMask`. Most `DocIdSetIterator` implementations can no longer implement `#intoBitSet` efficiently as soon as there are live docs. So this commit remove this argument and instead introduces a new `Bits#applyMask` API that helps clear bits in a bit set when the corresponding doc ID is not live. Relates apache#14133

jpountz · 2025-01-14T09:25:54Z

I also ran the benchmark from https://tantivy-search.github.io/bench/ to see if it gives similar feedback. For reference global queries means "conjunctions and disjunctions" in this benchmark. I like the results. The TOP_100 collection type mostly sees an improvement to its P99, which maps to queries that include stop words, which can now advance faster thanks to this new bit set encoding.

The COUNT collection type sees a big improvement to its P90 and a huge improvement to its P99. The combination of vectorizing loading doc IDs into a bit set using #loadIntoBitSet (which the lucene-10.0.0 engine doesn't have either) and this new encoding for terms that have dense postings are helping a lot.

…ntroduce `Bits#applyMask`. (#14134) Most `DocIdSetIterator` implementations can no longer implement `#intoBitSet` efficiently as soon as there are live docs. So this commit remove this argument and instead introduces a new `Bits#applyMask` API that helps clear bits in a bit set when the corresponding doc ID is not live. Relates #14133

jpountz · 2025-01-14T12:19:20Z

I merged the removal of the acceptDocs parameter to intoBitSet so this is now ready for review.

…ntroduce `Bits#applyMask`. (#14134) Most `DocIdSetIterator` implementations can no longer implement `#intoBitSet` efficiently as soon as there are live docs. So this commit remove this argument and instead introduces a new `Bits#applyMask` API that helps clear bits in a bit set when the corresponding doc ID is not live. Relates #14133

msokolov

Thanks, LGTM, I just had a few questions for my education. Great improvements!

msokolov · 2025-01-14T12:49:18Z

lucene/core/src/java/org/apache/lucene/codecs/lucene101/ForDeltaUtil.java

+    for (int l : ints) {
+      or |= l;
+    }
+    assert or != 0;


Perhaps we could indicate this assumption in the javadoc -- it's not clear to me, at a glance, why this should be true. I guess this is only called with all the postings for a term or something?

msokolov · 2025-01-14T12:57:59Z

lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java

+        forDeltaUtil.decodeAndPrefixSum(bitsPerValue, docInUtil, prevDocID, docBuffer);
+        encoding = DeltaEncoding.PACKED;
+      } else if (bitsPerValue == 0) {
+        // dense block: 128 one bits


confusing -- since we set the bitset to all zeros?

I'm not sure what is confusing, docBitSet.set(0, BLOCK_SIZE) sets BLOCK_SIZE bits to true? I refactored a bit, hopefully it is clearer.

ah, sorry, I read it as set the bits to zero, but that is wrong, thanks

msokolov · 2025-01-14T13:00:03Z

lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java

+        for (int i = 0; i < numLongs - 1; ++i) {
+          docCumulativeWordPopCounts[i] = Long.bitCount(docBitSet.getBits()[i]);
+        }
+        for (int i = 1; i < numLongs - 1; ++i) {


Indeed. :) I added a comment to make it clearer.

msokolov · 2025-01-14T13:01:43Z

lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java

+          docCumulativeWordPopCounts[i] += docCumulativeWordPopCounts[i - 1];
+        }
+        docCumulativeWordPopCounts[numLongs - 1] = BLOCK_SIZE;
+        assert docCumulativeWordPopCounts[numLongs - 2]


what happens if we have fewer than BLOCK_SIZE postings to encode? Do these go in a different encoding?

We only use the bit set encoding for "full" blocks. Tail blocks, which may have less than 128 doc IDs to record, keep using the current encoding that stores deltas using group-varint, they never use a bit set.

msokolov · 2025-01-14T13:05:42Z

lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java

+          doc = docBuffer[docBufferUpto];
+          break;
+        case UNARY:
+          int next = docBitSet.nextSetBit(doc - docBitSetBase + 1);


I wonder if there would be any benefit in maintaining next as a member variable that would always be doc - docBitSetBase. I guess it would save a single addition here, so maybe not worth it

msokolov · 2025-01-14T13:15:01Z

lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsWriter.java

+          s += i;
+          spareBitSet.set(s);
+        }
+        level0Output.writeByte((byte) -numBitSetLongs);


I guess this is guaranteed to fit in a byte by limits on BLOCK_SIZE?

Indeed, I added a comment.

msokolov · 2025-01-14T16:43:32Z

Looks like one of the checks failed with " > org.apache.lucene.index.CheckIndex$CheckIndexException: Field "vector" has repeated neighbors of node 2424 with value 2450" -- unrelated, sorry I mean to get these cleared up soon!

@mkhludnev

Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago #6116. @msokolov recently brought up (#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.

jpountz mentioned this pull request Jan 13, 2025

Explore within-block skipping for postings #12486

Closed

Fix auto-generated files

0347729

Check in missing test

4d33871

jpountz mentioned this pull request Jan 13, 2025

Remove acceptDocs argument from DocIdSetIterator#intoBitSet and introduce Bits#applyMask. #14134

Merged

Merge branch 'main' into bitset_block

383dd35

jpountz marked this pull request as ready for review January 14, 2025 12:18

msokolov approved these changes Jan 14, 2025

View reviewed changes

Review feedback

e0dca11

jpountz merged commit 26e5a8d into apache:main Jan 14, 2025
2 of 5 checks passed

jpountz deleted the bitset_block branch January 14, 2025 18:02

jpountz added this to the 10.2.0 milestone Jan 14, 2025

jpountz added a commit that referenced this pull request Jan 14, 2025

Add CHANGES for #14133.

4d26d23

jpountz added a commit that referenced this pull request Jan 14, 2025

Add CHANGES for #14133.

245acc8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode dense blocks of postings as bit sets. #14133

Encode dense blocks of postings as bit sets. #14133

jpountz commented Jan 13, 2025 •

edited

Loading

jpountz commented Jan 13, 2025

jpountz commented Jan 13, 2025

jpountz commented Jan 13, 2025

jpountz commented Jan 14, 2025

jpountz commented Jan 14, 2025

msokolov left a comment

msokolov Jan 14, 2025

msokolov Jan 14, 2025

jpountz Jan 14, 2025

msokolov Jan 14, 2025

msokolov Jan 14, 2025

jpountz Jan 14, 2025

msokolov Jan 14, 2025

jpountz Jan 14, 2025

msokolov Jan 14, 2025

msokolov Jan 14, 2025

jpountz Jan 14, 2025

msokolov commented Jan 14, 2025

Encode dense blocks of postings as bit sets. #14133

Encode dense blocks of postings as bit sets. #14133

Conversation

jpountz commented Jan 13, 2025 • edited Loading

jpountz commented Jan 13, 2025

jpountz commented Jan 13, 2025

jpountz commented Jan 13, 2025

jpountz commented Jan 14, 2025

jpountz commented Jan 14, 2025

msokolov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msokolov commented Jan 14, 2025

jpountz commented Jan 13, 2025 •

edited

Loading