Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR) #69

Merged
merged 6 commits into from
Apr 14, 2021

Conversation

gsmiller
Copy link
Contributor

@gsmiller gsmiller commented Apr 6, 2021

Description

Switch over to PFOR encoding for doc IDs (instead of FOR) to achieve better index compression.

Solution

Details are in the Jira issue, but I explored the index size vs. decompression speed tradeoffs using luceneutil benchmarks and found ~3.3% index size reduction with no significant OPS impact.

Tests

In addition to benchmarks, I ported over the PForDeltaUtil unit tests to ensure unit test coverage.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments but this looks great in general.

Greg Miller and others added 2 commits April 7, 2021 07:12
…l.java

Co-authored-by: Adrien Grand <jpountz@gmail.com>
@jpountz
Copy link
Contributor

jpountz commented Apr 8, 2021

I think we leave the implementation as is and hope that we can do something better with more explicit vectorization support in the future

(Sorry replying here as Github prevents me from replying on the existing thread)

+1 Let go with whichever of arr[i] = IDENTITY_PLUS_ONE[i] * val + base or arr[i] = (i+1) * val + base runs fastest in your micro benchmark. We can still improve things later if we find a way to trick the JVM into auto-vectorizing this loop.

@gsmiller
Copy link
Contributor Author

gsmiller commented Apr 8, 2021

@jpountz

+1 Let go with whichever of arr[i] = IDENTITY_PLUS_ONE[i] * val + base or arr[i] = (i+1) * val + base runs fastest in your micro benchmark. We can still improve things later if we find a way to trick the JVM into auto-vectorizing this loop.

Perfect, thanks! I'm changing this back to (i + 1) * val + base because it (somewhat surprisingly maybe, but I suppose this simple addition could be more efficient than an array reference) does consistently perform slightly better in microbenchmarks (arraryRef == 0 is this implementation while arrayRef == 1 references IDENTITY_PLUS_ONE[i]):

Benchmark                                        (arrayRef)  (bitsPerValue)  (exceptionCount)  (sameVal)   Mode  Cnt  Score   Error   Units
PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder           0               0                 0          2  thrpt   20  7.915 ± 0.008  ops/us
PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder           1               0                 0          2  thrpt   20  7.695 ± 0.010  ops/us

@gsmiller gsmiller requested review from rmuir and jpountz April 9, 2021 19:38
@gsmiller
Copy link
Contributor Author

@jpountz I think I've addressed all of your feedback at this point. No rush if you've got other work occupying your time right now of course, just wanted to check in and make sure you're not waiting on me to make some additional changes. Thanks again for all your feedback!

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I'll merge soon.

@rmuir
Copy link
Member

rmuir commented Apr 13, 2021

Thanks @gsmiller ! I'm not really competent to review this stuff, just here for moral support. But the numbers look good to me.

@gsmiller
Copy link
Contributor Author

Thanks @jpountz, @rmuir !

@jpountz jpountz merged commit fbbdc62 into apache:main Apr 14, 2021
@gsmiller gsmiller deleted the LUCENE-9850/pfordocids-pr branch April 14, 2021 17:26
janhoy pushed a commit to cominvent/lucene that referenced this pull request May 12, 2021
* Consolidate developer docs into top level /dev-docs, and provide a single pointer to other places that host developer oriented docs.
* Some small tweaks to the cloud testing script.
slow-J added a commit to slow-J/lucene that referenced this pull request Oct 31, 2023
We are still keeping PFOR for positions only.
This is a partial revert of apache#69 which brings back ForDeltaUtil.
slow-J added a commit to slow-J/lucene that referenced this pull request Oct 31, 2023
We are still keeping PFOR for positions only.
This is a partial revert of apache#69 which brings back ForDeltaUtil.
slow-J added a commit to slow-J/lucene that referenced this pull request Nov 6, 2023
We are still keeping PFOR for positions only.
This is a partial revert of apache#69 which brings back ForDeltaUtil.
mikemccand pushed a commit that referenced this pull request Nov 6, 2023
* Change Postings back to using FOR in Lucene99PostingsFormat

We are still keeping PFOR for positions only.
This is a partial revert of #69 which brings back ForDeltaUtil.

* fix merge commit

* Add forgotten forDeltaUtil calls to reader

* Addressing comments: adding Lucene90RWPostingsFormat + more

Also:
* Change to Changes.txt
* Removal of dead code which was only used in unit tests
* Removal of test code from PForUtil

* Changes.txt edit in right place now

* Apply suggestions from code review: `90 -> 99 refactoring`

Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>

* Remove decodeTo32 from ForUtil and regenerate

---------

Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
mikemccand pushed a commit that referenced this pull request Nov 6, 2023
* Change Postings back to using FOR in Lucene99PostingsFormat

We are still keeping PFOR for positions only.
This is a partial revert of #69 which brings back ForDeltaUtil.

* fix merge commit

* Add forgotten forDeltaUtil calls to reader

* Addressing comments: adding Lucene90RWPostingsFormat + more

Also:
* Change to Changes.txt
* Removal of dead code which was only used in unit tests
* Removal of test code from PForUtil

* Changes.txt edit in right place now

* Apply suggestions from code review: `90 -> 99 refactoring`

Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>

* Remove decodeTo32 from ForUtil and regenerate

---------

Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants