Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize block wand for one and several TermScorer. #1190

Merged
merged 4 commits into from
Nov 1, 2021

Conversation

fmassot
Copy link
Contributor

@fmassot fmassot commented Oct 30, 2021

I made some changes to your initial commit @fulmicoton

  • first, based on the paper of Ding and Suel I made a small change on the block wand algorithm. The function block_max_was_too_low_advance_one_scorer now takes the min of all last doc id scorers until the pivot scorer. Before, we were taking the doc() of the pivot scorer.
  • secondly, I think there was a tiny bug in block_max_was_too_low_advance_one_scorer: we were doing doc_to_seek_after + 1 even on scorer after the pivot. There is a very small probability to miss a document here, I think we should not add +1 in this case.
  • thirdly, I updated the way the scorer to advance is chosen. Now we take the scorer which has the best score among scorers[..pivot_len].
  • at last, I added the implementation for the case where we have one TermScorer because the code is simple and readable and the performance is far better (around x3 faster). With that, we are on par with Lucene.

I also created a branch with the benchmark results here: https://github.com/quickwit-inc/search-benchmark-game/tree/blockwand-for-termquery

I noticed that the performance is better on average and for union we have:

  • Top 10 - UNION: 1,081 μs | 1,419 μs

A proptest was also added.

Co-authored-by: Paul Masurel <paul.masurel@gmail.com>
Co-authored-by: François Massot <francois.massot@gmail.com>
@fmassot fmassot changed the title Added optimisation using block wand for single TermScorer. Optimize block wand for one or several TermScorer. Oct 30, 2021
@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2021

Codecov Report

Merging #1190 (b25dbeb) into main (5916ced) will decrease coverage by 0.01%.
The diff coverage is 98.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1190      +/-   ##
==========================================
- Coverage   93.99%   93.97%   -0.02%     
==========================================
  Files         204      204              
  Lines       34606    34662      +56     
==========================================
+ Hits        32528    32574      +46     
- Misses       2078     2088      +10     
Impacted Files Coverage Δ
src/query/boolean_query/mod.rs 100.00% <ø> (ø)
src/query/boolean_query/block_wand.rs 96.85% <98.36%> (+0.41%) ⬆️
src/query/term_query/term_weight.rs 95.06% <100.00%> (ø)
src/indexer/segment_updater.rs 93.75% <0.00%> (-1.03%) ⬇️
src/fastfield/reader.rs 94.06% <0.00%> (-0.85%) ⬇️
src/directory/watch_event_router.rs 95.41% <0.00%> (-0.77%) ⬇️
src/postings/stacker/expull.rs 93.44% <0.00%> (-0.44%) ⬇️
src/core/index.rs 93.75% <0.00%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5916ced...b25dbeb. Read the comment docs.

@fmassot fmassot force-pushed the blockwand-for-termquery branch from 7f9627c to 8e2158e Compare October 30, 2021 18:40
@fmassot fmassot force-pushed the blockwand-for-termquery branch from 8e2158e to f1424f2 Compare October 30, 2021 18:42
@fmassot fmassot changed the title Optimize block wand for one or several TermScorer. Optimize block wand for one and several TermScorer. Oct 31, 2021
@fmassot
Copy link
Contributor Author

fmassot commented Oct 31, 2021

Ok there is one more subtleties to take into account: we need to advance the scorer that have the greatest idf. I will update the algorithm and the benchmark.

for scorer_ord in (0..pivot_len - 1).rev() {
let scorer = &scorers[scorer_ord];
if scorer.last_doc_in_block() <= doc_to_seek_after {
doc_to_seek_after = scorer.last_doc_in_block();
}
if scorers[scorer_ord].max_score > global_max_score {
global_max_score = scorers[scorer_ord].max_score;
Copy link
Collaborator

@fulmicoton fulmicoton Oct 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes this is way better! Advancing any scorer is "correct", but we want to advance the termscorer with the lowest docfreq.

scorer_to_seek = scorer_ord;
}
}
// Add +1 to go to the next block unless we are already at the end.
if doc_to_seek_after != TERMINATED {
doc_to_seek_after += 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right this was a bug.

// the threshold.
while scorer.block_max_score() < threshold {
let last_doc_in_block = scorer.last_doc_in_block();
if doc == TERMINATED {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean last_doc_in_block here maybe?

if score > threshold {
threshold = callback(doc, score);
}
if doc >= scorer.last_doc_in_block() {
Copy link
Collaborator

@fulmicoton fulmicoton Oct 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This >= only triggers as == ?
If this is correct, should we add a debug_assert on == maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we don't need the assert, this is pretty straightforward.

…to have an equality check on doc to break the loop.
@fmassot fmassot force-pushed the blockwand-for-termquery branch from afbf2e3 to b25dbeb Compare October 31, 2021 15:49
@fulmicoton fulmicoton merged commit 0462754 into main Nov 1, 2021
@fulmicoton fulmicoton deleted the blockwand-for-termquery branch November 1, 2021 00:18
This was referenced Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants