-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-level skipping on posting lists [LUCENE-866] #1941
Comments
Michael Busch (migrated from JIRA) Attaching the first version of this patch. |
Michael Busch (migrated from JIRA) I ran a few performance tests with this patch. I built a rather big I compare query evaluation time with one-level skipping (old) and The following queries contain very frequent and very unique terms. Query: +lucene +search +engine +library Query: +apache +http +server Even though I/O is reduced for the next query, it runs a bit slower. Query: +multi +level +skipping For the next query there is slightly more I/O necessary in the Query: +beatles +yellow +submarine However, if I change the query a little bit, then the speed-up Query: +"the beatles" +yellow +submarine It would be interesting to run more sophisticated benchmarks. I'm not that familiar with the new benchmark stuff that has |
Yonik Seeley (@yonik) (migrated from JIRA) Looks interesting! Have you done any performance testing? I guess best case for this patch would be "skip to a random doc", and worst case would be skipTo(currDoc+2) in a loop (or a conjunction across terms that appear in almost all docs). I haven't dug into the code, but do you avoid storing extra levels when the docfreq is small? |
Michael Busch (migrated from JIRA) > and worst case would be skipTo(currDoc+2) in a loop True that should be the worst case. A query with 3 stop words Query: +the +and +a Maybe a possible optimization could be to avoid using the higher levels > do you avoid storing extra levels when the docfreq is small? I only store another level if it contains at least one SkipDatum. The .frq file in my index grew by 1.3% in the multi-level case |
Michael Busch (migrated from JIRA) #1963 will introduce the new method BufferedIndexInput.setBufferSize(). Since the length of the different skip levels is known I can set the buffer length before I clone the skip stream in this patch. |
Michael Busch (migrated from JIRA) I ran some more performance experiments to test the average speedup. Here are the results: old (one level skipping): new (multi level): This is a speedup of about 16% and i/o savings of 42%. Then I changed the algorithm a bit to not always start on the Time: 49500 ms. This 20% speedup is for average AND queries. Certain queries I will attach a new patch as soon as #1963 is committed. |
Michael Busch (migrated from JIRA) New version of the patch with the following changes:
All unit tests pass. I think this patch is ready to commit now. |
Paul Elschot (migrated from JIRA) Patch applies cleanly here.
It's possible that this is due to another diff in my working copy, but a quick look did |
Doron Cohen (migrated from JIRA) On a clean checkout it applies cleanly and this test (and all others) pass. |
Michael McCandless (@mikemccand) (migrated from JIRA) Hmm – that NullPointerException I think is from the assert I added for #1963. I will fix. |
Paul Elschot (migrated from JIRA) The last change to BufferedIndexInput indeed makes all core tests pass again here with this patch applied. |
Michael Busch (migrated from JIRA) Thanks for reviewing Paul and Doron, and thanks for fixing the NPE, Mike!! I'm planning to commit this in a day or so. |
Yonik Seeley (@yonik) (migrated from JIRA) Nice job! +1 to commit |
Michael Busch (migrated from JIRA) This patch updates the fileformats document. It also adds a comment |
Michael Busch (migrated from JIRA) Committed. |
To accelerate posting list skips (TermDocs.skipTo(int)) Lucene uses skip lists.
The default skip interval is set to 16. If we want to skip e. g. 100 documents,
then it is not necessary to read 100 entries from the posting list, but only
100/16 = 6 skip list entries plus 100%16 = 4 entries from the posting list. This
speeds up conjunction (AND) and phrase queries significantly.
However, the skip interval is always a compromise. If you have a very big index
with huge posting lists and you want to skip over lets say 100k documents, then
it is still necessary to read 100k/16 = 6250 entries from the skip list. For big
indexes the skip interval could be set to a higher value, but then after a big
skip a long scan to the target doc might be necessary.
A solution for this compromise is to have multi-level skip lists that guarantee a
logarithmic amount of skips to any target in the posting list. This patch
implements such an approach in the following way:
Example for skipInterval = 3:
c (skip level 2)
c c c (skip level 1)
x x x x x x x x x x (skip level 0)
d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d (posting list)
3 6 9 12 15 18 21 24 27 30 (df)
d - document
x - skip data
c - skip data with child pointer
Skip level i contains every skipInterval-th entry from skip level i-1. Therefore the
number of entries on level i is: floor(df / ((skipInterval ^ (i + 1))).
Each skip entry on a level i>0 contains a pointer to the corresponding skip entry in
list i-1. This guarantees a logarithmic amount of skips to find the target document.
Implementations details:
simplify those classes. The two new classes AbstractSkipListReader and
AbstractSkipListWriter implement the skipping functionality.
multiple skip levels, they do not implement an actual skip data format. The two
new subclasses DefaultSkipListReader and Writer implement the skip data format
that is currently used in Lucene (with two file pointers for the freq and prox
file and with payload length information). I added this extra layer to be
prepared for flexible indexing and different posting list formats.
File format changes:
I added the new parameter 'maxSkipLevels' to the term dictionary and increased the
version of this file. If maxSkipLevels is set to one, then the format of the freq
file does not change at all, because we only have one skip level as before. For
backwards compatibility maxSkipLevels is set to one automatically if an index
without the new parameter is read.
In case maxSkipLevels > 1, then the frq file changes as follows:
FreqFile (.frq) --> <TermFreqs, SkipData>^TermCount
SkipData --> <<SkipLevelLength, SkipLevel>^(Min(maxSkipLevels,
floor(log(DocFreq/log(skipInterval))) - 1)>, SkipLevel>
SkipLevel --> <SkipDatum>DocFreq/(SkipInterval(Level + 1))
Remark: The length of the SkipLevel is not stored for level 0, because 1) it is not
needed, and 2) the format of this file does not change for maxSkipLevels=1 then.
All unit tests pass with this patch.
Migrated from LUCENE-866 by Michael Busch, 2 votes, resolved May 31 2007
Attachments: fileformats.patch, lucene-866.patch (versions: 2)
Linked issues:
The text was updated successfully, but these errors were encountered: