-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try turning off patching in Lucene's PFOR encoding #46
Comments
++ Is there an easy way to disable it? Based on my code-reading, it does not seem to be configurable |
I forked a Test PostingsFormat, PostingsReader, PostingsWriter, codec and forutil to not do the patching for exceptions (large values). Done a quick test. Want to re-index and confirm again. Done a benchmark (m6g.4xlarge) with COUNT, TOP_10_COUNT, TOP_100 against Lucene with pfor (from previous test after d1b928c) and without the patching in the PFOR. See the test code here: slow-J/lucene@cd68926 (please let me know if there are any improvements I could have made). To run the code in the benchmark, I ran Attaching results: |
Did you get a chance to check the index size impact? |
I also wonder whether removal of patching is more or less impactful on Graviton3? |
Re-indexed again to make sure I have the right correct version built. with patching turned on (baseline): with patching turned off: So turning off patching causes a +5.0208% increase in the size of the index. I'll test Graviton3 when I get a chance. |
Variables
Candidate: Comparing to baseline results from #36 Baseline: Changes with turning off the patching in PFOR encoding: So the improvement to Attaching results.json. |
This is quite a compelling gain. Did you turn off patching for both |
I turned off all patching in postings (both I will re-run the graviton3 benchmark to double check will the |
Ran the Graviton 3 benchmark again. COUNT, avg: 10,822 μs So this time compared to the control, we have So there is some variance between benchmark runs, and no |
The Lucene PR: apache/lucene#12741 has been merged! Resolving! |
One difference between Lucene and Tantivy is Lucene uses the "patch" FOR, meaning the large values in a block are held out as exceptions so that the remaining values can use a smaller number of bits to encode, a tradeoff of CPU for lower storage space.
Let's try temporarily disabling the patching in Lucene, to match how Tantivy encodes, to see how much of a performance difference that is costing us?
The text was updated successfully, but these errors were encountered: