-
-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize block wand for one and several TermScorer. #1190
Conversation
A proptest was also added. Co-authored-by: Paul Masurel <paul.masurel@gmail.com> Co-authored-by: François Massot <francois.massot@gmail.com>
Codecov Report
@@ Coverage Diff @@
## main #1190 +/- ##
==========================================
- Coverage 93.99% 93.97% -0.02%
==========================================
Files 204 204
Lines 34606 34662 +56
==========================================
+ Hits 32528 32574 +46
- Misses 2078 2088 +10
Continue to review full report at Codecov.
|
7f9627c
to
8e2158e
Compare
…e pivot scorer (included).
8e2158e
to
f1424f2
Compare
Ok there is one more subtleties to take into account: we need to advance the scorer that have the greatest idf. I will update the algorithm and the benchmark. |
…nce the scorer with best score.
for scorer_ord in (0..pivot_len - 1).rev() { | ||
let scorer = &scorers[scorer_ord]; | ||
if scorer.last_doc_in_block() <= doc_to_seek_after { | ||
doc_to_seek_after = scorer.last_doc_in_block(); | ||
} | ||
if scorers[scorer_ord].max_score > global_max_score { | ||
global_max_score = scorers[scorer_ord].max_score; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes this is way better! Advancing any scorer is "correct", but we want to advance the termscorer with the lowest docfreq.
scorer_to_seek = scorer_ord; | ||
} | ||
} | ||
// Add +1 to go to the next block unless we are already at the end. | ||
if doc_to_seek_after != TERMINATED { | ||
doc_to_seek_after += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right this was a bug.
// the threshold. | ||
while scorer.block_max_score() < threshold { | ||
let last_doc_in_block = scorer.last_doc_in_block(); | ||
if doc == TERMINATED { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean last_doc_in_block here maybe?
if score > threshold { | ||
threshold = callback(doc, score); | ||
} | ||
if doc >= scorer.last_doc_in_block() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This >=
only triggers as ==
?
If this is correct, should we add a debug_assert on == maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we don't need the assert, this is pretty straightforward.
…to have an equality check on doc to break the loop.
afbf2e3
to
b25dbeb
Compare
I made some changes to your initial commit @fulmicoton
block_max_was_too_low_advance_one_scorer
now takes the min of all last doc id scorers until the pivot scorer. Before, we were taking thedoc()
of the pivot scorer.block_max_was_too_low_advance_one_scorer
: we were doingdoc_to_seek_after + 1
even on scorer after the pivot. There is a very small probability to miss a document here, I think we should not add +1 in this case.TermScorer
because the code is simple and readable and the performance is far better (around x3 faster). With that, we are on par with Lucene.I also created a branch with the benchmark results here: https://github.com/quickwit-inc/search-benchmark-game/tree/blockwand-for-termquery
I noticed that the performance is better on average and for union we have: