Optimize block wand for one and several TermScorer. #1190

fmassot · 2021-10-30T00:25:11Z

I made some changes to your initial commit @fulmicoton

first, based on the paper of Ding and Suel I made a small change on the block wand algorithm. The function block_max_was_too_low_advance_one_scorer now takes the min of all last doc id scorers until the pivot scorer. Before, we were taking the doc() of the pivot scorer.
secondly, I think there was a tiny bug in block_max_was_too_low_advance_one_scorer: we were doing doc_to_seek_after + 1 even on scorer after the pivot. There is a very small probability to miss a document here, I think we should not add +1 in this case.
thirdly, I updated the way the scorer to advance is chosen. Now we take the scorer which has the best score among scorers[..pivot_len].
at last, I added the implementation for the case where we have one TermScorer because the code is simple and readable and the performance is far better (around x3 faster). With that, we are on par with Lucene.

I also created a branch with the benchmark results here: https://github.com/quickwit-inc/search-benchmark-game/tree/blockwand-for-termquery

I noticed that the performance is better on average and for union we have:

Top 10 - UNION: 1,081 μs | 1,419 μs

A proptest was also added. Co-authored-by: Paul Masurel <paul.masurel@gmail.com> Co-authored-by: François Massot <francois.massot@gmail.com>

codecov-commenter · 2021-10-30T18:04:56Z

Codecov Report

Merging #1190 (b25dbeb) into main (5916ced) will decrease coverage by 0.01%.
The diff coverage is 98.38%.

@@            Coverage Diff             @@
##             main    #1190      +/-   ##
==========================================
- Coverage   93.99%   93.97%   -0.02%     
==========================================
  Files         204      204              
  Lines       34606    34662      +56     
==========================================
+ Hits        32528    32574      +46     
- Misses       2078     2088      +10

Impacted Files	Coverage Δ
src/query/boolean_query/mod.rs	`100.00% <ø> (ø)`
src/query/boolean_query/block_wand.rs	`96.85% <98.36%> (+0.41%)`	⬆️
src/query/term_query/term_weight.rs	`95.06% <100.00%> (ø)`
src/indexer/segment_updater.rs	`93.75% <0.00%> (-1.03%)`	⬇️
src/fastfield/reader.rs	`94.06% <0.00%> (-0.85%)`	⬇️
src/directory/watch_event_router.rs	`95.41% <0.00%> (-0.77%)`	⬇️
src/postings/stacker/expull.rs	`93.44% <0.00%> (-0.44%)`	⬇️
src/core/index.rs	`93.75% <0.00%> (+0.19%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5916ced...b25dbeb. Read the comment docs.

…e pivot scorer (included).

fmassot · 2021-10-31T10:18:13Z

Ok there is one more subtleties to take into account: we need to advance the scorer that have the greatest idf. I will update the algorithm and the benchmark.

…nce the scorer with best score.

fulmicoton · 2021-10-31T13:43:21Z

src/query/boolean_query/block_wand.rs

    for scorer_ord in (0..pivot_len - 1).rev() {
        let scorer = &scorers[scorer_ord];
        if scorer.last_doc_in_block() <= doc_to_seek_after {
            doc_to_seek_after = scorer.last_doc_in_block();
+        }
+        if scorers[scorer_ord].max_score > global_max_score {
+            global_max_score = scorers[scorer_ord].max_score;


ah yes this is way better! Advancing any scorer is "correct", but we want to advance the termscorer with the lowest docfreq.

fulmicoton · 2021-10-31T13:55:35Z

src/query/boolean_query/block_wand.rs

            scorer_to_seek = scorer_ord;
        }
    }
+    // Add +1 to go to the next block unless we are already at the end.
+    if doc_to_seek_after != TERMINATED {
+        doc_to_seek_after += 1;


you are right this was a bug.

fulmicoton · 2021-10-31T13:59:47Z

src/query/boolean_query/block_wand.rs

+        // the threshold.
+        while scorer.block_max_score() < threshold {
+            let last_doc_in_block = scorer.last_doc_in_block();
+            if doc == TERMINATED {


did you mean last_doc_in_block here maybe?

fulmicoton · 2021-10-31T14:03:45Z

src/query/boolean_query/block_wand.rs

+            if score > threshold {
+                threshold = callback(doc, score);
+            }
+            if doc >= scorer.last_doc_in_block() {


This >= only triggers as == ?
If this is correct, should we add a debug_assert on == maybe?

Indeed, we don't need the assert, this is pretty straightforward.

…to have an equality check on doc to break the loop.

Added optimisation using block wand for single TermScorer.

6338030

A proptest was also added. Co-authored-by: Paul Masurel <paul.masurel@gmail.com> Co-authored-by: François Massot <francois.massot@gmail.com>

fmassot changed the title ~~Added optimisation using block wand for single TermScorer.~~ Optimize block wand for one or several TermScorer. Oct 30, 2021

fmassot force-pushed the blockwand-for-termquery branch from 7f9627c to 8e2158e Compare October 30, 2021 18:40

Fix block wand algorithm by taking the last doc id of scores until th…

f1424f2

…e pivot scorer (included).

fmassot force-pushed the blockwand-for-termquery branch from 8e2158e to f1424f2 Compare October 30, 2021 18:42

fmassot changed the title ~~Optimize block wand for one or several TermScorer.~~ Optimize block wand for one and several TermScorer. Oct 31, 2021

In block wand, when block max score is lower than the threshold, adva…

2aee4fb

…nce the scorer with best score.

fulmicoton reviewed Oct 31, 2021

View reviewed changes

fulmicoton approved these changes Oct 31, 2021

View reviewed changes

Fix wrong condition in block_wand_single_scorer and add debug_assert …

b25dbeb

…to have an equality check on doc to break the loop.

fmassot force-pushed the blockwand-for-termquery branch from afbf2e3 to b25dbeb Compare October 31, 2021 15:49

fulmicoton approved these changes Nov 1, 2021

View reviewed changes

fulmicoton merged commit 0462754 into main Nov 1, 2021

fulmicoton deleted the blockwand-for-termquery branch November 1, 2021 00:18

This was referenced Feb 18, 2022

fix open bytes index PSeitz/tantivy#1

Closed

aggregation PSeitz/tantivy#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize block wand for one and several TermScorer. #1190

Optimize block wand for one and several TermScorer. #1190

fmassot commented Oct 30, 2021 •

edited

Loading

codecov-commenter commented Oct 30, 2021 •

edited

Loading

fmassot commented Oct 31, 2021

fulmicoton Oct 31, 2021 •

edited

Loading

fulmicoton Oct 31, 2021

fulmicoton Oct 31, 2021

fulmicoton Oct 31, 2021 •

edited

Loading

fmassot Oct 31, 2021

Optimize block wand for one and several TermScorer. #1190

Optimize block wand for one and several TermScorer. #1190

Conversation

fmassot commented Oct 30, 2021 • edited Loading

codecov-commenter commented Oct 30, 2021 • edited Loading

Codecov Report

fmassot commented Oct 31, 2021

fulmicoton Oct 31, 2021 • edited Loading

Choose a reason for hiding this comment

fulmicoton Oct 31, 2021

Choose a reason for hiding this comment

fulmicoton Oct 31, 2021

Choose a reason for hiding this comment

fulmicoton Oct 31, 2021 • edited Loading

Choose a reason for hiding this comment

fmassot Oct 31, 2021

Choose a reason for hiding this comment

fmassot commented Oct 30, 2021 •

edited

Loading

codecov-commenter commented Oct 30, 2021 •

edited

Loading

fulmicoton Oct 31, 2021 •

edited

Loading

fulmicoton Oct 31, 2021 •

edited

Loading