LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR) #69

gsmiller · 2021-04-06T23:50:35Z

Description

Switch over to PFOR encoding for doc IDs (instead of FOR) to achieve better index compression.

Solution

Details are in the Jira issue, but I explored the index size vs. decompression speed tradeoffs using luceneutil benchmarks and found ~3.3% index size reduction with no significant OPS impact.

Tests

In addition to benchmarks, I ported over the PForDeltaUtil unit tests to ensure unit test coverage.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.

jpountz

I left a few comments but this looks great in general.

lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java

lucene/core/src/java/org/apache/lucene/codecs/lucene90/ForUtil.java

…l.java Co-authored-by: Adrien Grand <jpountz@gmail.com>

jpountz · 2021-04-08T13:54:46Z

I think we leave the implementation as is and hope that we can do something better with more explicit vectorization support in the future

(Sorry replying here as Github prevents me from replying on the existing thread)

+1 Let go with whichever of arr[i] = IDENTITY_PLUS_ONE[i] * val + base or arr[i] = (i+1) * val + base runs fastest in your micro benchmark. We can still improve things later if we find a way to trick the JVM into auto-vectorizing this loop.

gsmiller · 2021-04-08T18:23:46Z

@jpountz

+1 Let go with whichever of arr[i] = IDENTITY_PLUS_ONE[i] * val + base or arr[i] = (i+1) * val + base runs fastest in your micro benchmark. We can still improve things later if we find a way to trick the JVM into auto-vectorizing this loop.

Perfect, thanks! I'm changing this back to (i + 1) * val + base because it (somewhat surprisingly maybe, but I suppose this simple addition could be more efficient than an array reference) does consistently perform slightly better in microbenchmarks (arraryRef == 0 is this implementation while arrayRef == 1 references IDENTITY_PLUS_ONE[i]):

Benchmark                                        (arrayRef)  (bitsPerValue)  (exceptionCount)  (sameVal)   Mode  Cnt  Score   Error   Units
PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder           0               0                 0          2  thrpt   20  7.915 ± 0.008  ops/us
PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder           1               0                 0          2  thrpt   20  7.695 ± 0.010  ops/us

gsmiller · 2021-04-13T17:11:24Z

@jpountz I think I've addressed all of your feedback at this point. No rush if you've got other work occupying your time right now of course, just wanted to check in and make sure you're not waiting on me to make some additional changes. Thanks again for all your feedback!

jpountz

This looks great. I'll merge soon.

lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java

rmuir · 2021-04-13T17:39:50Z

Thanks @gsmiller ! I'm not really competent to review this stuff, just here for moral support. But the numbers look good to me.

gsmiller · 2021-04-13T21:14:56Z

Thanks @jpountz, @rmuir !

* Consolidate developer docs into top level /dev-docs, and provide a single pointer to other places that host developer oriented docs. * Some small tweaks to the cloud testing script.

We are still keeping PFOR for positions only. This is a partial revert of apache#69 which brings back ForDeltaUtil.

* Change Postings back to using FOR in Lucene99PostingsFormat We are still keeping PFOR for positions only. This is a partial revert of #69 which brings back ForDeltaUtil. * fix merge commit * Add forgotten forDeltaUtil calls to reader * Addressing comments: adding Lucene90RWPostingsFormat + more Also: * Change to Changes.txt * Removal of dead code which was only used in unit tests * Removal of test code from PForUtil * Changes.txt edit in right place now * Apply suggestions from code review: `90 -> 99 refactoring` Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com> * Remove decodeTo32 from ForUtil and regenerate --------- Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>

Greg Miller added 3 commits April 6, 2021 16:44

LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

992d6ef

fix javadoc

4666227

spotless miss

192c503

jpountz reviewed Apr 7, 2021

View reviewed changes

Greg Miller and others added 2 commits April 7, 2021 07:12

PR feedback

bbe0969

Update lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUti…

2902a06

…l.java Co-authored-by: Adrien Grand <jpountz@gmail.com>

prefixSumOf tweak

a79373d

gsmiller requested review from rmuir and jpountz April 9, 2021 19:38

jpountz approved these changes Apr 13, 2021

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java Show resolved Hide resolved

jpountz merged commit fbbdc62 into apache:main Apr 14, 2021

gsmiller deleted the LUCENE-9850/pfordocids-pr branch April 14, 2021 17:26

This was referenced Aug 20, 2021

Can PForUtil be further auto-vectorized? [LUCENE-9918] #10957

Open

Add generation/ checksumming task for gen_ForUtil.py [LUCENE-9915] #10954

Closed

jpountz mentioned this pull request Oct 25, 2023

Adding option to codec to disable patching in Lucene's PFOR encoding #12696

Closed

slow-J added a commit to slow-J/lucene that referenced this pull request Oct 31, 2023

Change Postings back to using FOR in Lucene99PostingsFormat

f05542c

We are still keeping PFOR for positions only. This is a partial revert of apache#69 which brings back ForDeltaUtil.

slow-J added a commit to slow-J/lucene that referenced this pull request Oct 31, 2023

Change Postings back to using FOR in Lucene99PostingsFormat

c8575cc

We are still keeping PFOR for positions only. This is a partial revert of apache#69 which brings back ForDeltaUtil.

slow-J mentioned this pull request Oct 31, 2023

Remove patching for doc blocks. #12741

Merged

slow-J added a commit to slow-J/lucene that referenced this pull request Nov 6, 2023

Change Postings back to using FOR in Lucene99PostingsFormat

fc88d53

We are still keeping PFOR for positions only. This is a partial revert of apache#69 which brings back ForDeltaUtil.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR) #69

LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR) #69

gsmiller commented Apr 6, 2021

jpountz left a comment

jpountz commented Apr 8, 2021

gsmiller commented Apr 8, 2021

gsmiller commented Apr 13, 2021

jpountz left a comment

rmuir commented Apr 13, 2021

gsmiller commented Apr 13, 2021

LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR) #69

LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR) #69

Conversation

gsmiller commented Apr 6, 2021

Description

Solution

Tests

Checklist

jpountz left a comment

Choose a reason for hiding this comment

jpountz commented Apr 8, 2021

gsmiller commented Apr 8, 2021

gsmiller commented Apr 13, 2021

jpountz left a comment

Choose a reason for hiding this comment

rmuir commented Apr 13, 2021

gsmiller commented Apr 13, 2021