perf: implement XTR for retrieving multivector #3437

BubbleCal · 2025-02-08T06:58:28Z

this PR introduces XTR, which can score the documents without the original multivector, so we don't need any IO op for searching on multivector.

it sets the minimum similarity as the estimated similarity for missed documents of single query vector.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov-commenter · 2025-02-08T08:16:39Z

Codecov Report

Attention: Patch coverage is 85.31746% with 37 lines in your changes missing coverage. Please review.

Project coverage is 78.48%. Comparing base (33ae43b) to head (8d5a835).

Files with missing lines	Patch %	Lines
rust/lance/src/io/exec/knn.rs	85.44%	24 Missing and 7 partials ⚠️
rust/lance/src/dataset/scanner.rs	79.31%	1 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3437      +/-   ##
==========================================
- Coverage   78.48%   78.48%   -0.01%     
==========================================
  Files         252      252              
  Lines       94011    94220     +209     
  Branches    94011    94220     +209     
==========================================
+ Hits        73783    73947     +164     
- Misses      17232    17279      +47     
+ Partials     2996     2994       -2

Flag	Coverage Δ
unittests	`78.48% <85.31%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…-xtr

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

rust/lance-linalg/src/distance.rs

rust/lance/src/dataset/scanner.rs

rust/lance/src/index/vector/ivf/v2.rs

wjones127

Something seems off in the algorithm, with how missed_similarities is handled. Could you address my comment, and also maybe write a unit tests that shows we get correct results? out of this?

wjones127 · 2025-02-10T22:22:33Z

rust/lance/src/io/exec/knn.rs

+                let row_ids = batch[ROW_ID].as_primitive::<UInt64Type>();
+                let dists = batch[DIST_COL].as_primitive::<Float32Type>();


Since we are using values here, can we add a debug assert that there are non nulls?

wjones127 · 2025-02-10T22:29:18Z

rust/lance/src/io/exec/knn.rs

+            // at most, we will have k * refine_factor results for each query
+            let mut results = HashMap::with_capacity(k * refactor);
+            let mut missed_similarities = 0.0;
+            while let Some((min_sim, batch)) = reduced_inputs.try_next().await? {


I don't understand the algorithm in the paper deeply, but it seems odd to me that the order of the ANN queries matters. It appears that later batches will be adding a higher missed_similarities value. Is that intentional?

It looks like the output order of select_all isn't deterministic. https://docs.rs/futures/latest/futures/stream/fn.select_all.html

it's intentional, this is a little bit complicated, will add more comments about this:
considering we are updating the final results with a batch from a query vector, and for a row x:

if x exists in results but not batch: set min_sim as the estimated similarity, the contribution is min_sim

if x exists in both, then the contribution is sim in batch

if x exists in only batch, this means all queries before missed this row, this algo maintains missed_similarities as the sum of min_sim so far, so the contribution is missed_similarities + sim

Ah I see. It makes more sense now that I know what part was missing.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal · 2025-02-11T04:10:11Z

Something seems off in the algorithm, with how missed_similarities is handled. Could you address my comment, and also maybe write a unit tests that shows we get correct results? out of this?

we have tests here https://github.com/lancedb/lance/pull/3437/files#diff-6de816b72e7c722316243c57df4f809ad34dc8581367c72335154dada48c40edL993

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 · 2025-02-18T16:44:57Z

Something seems off in the algorithm, with how missed_similarities is handled. Could you address my comment, and also maybe write a unit tests that shows we get correct results? out of this?

we have tests here https://github.com/lancedb/lance/pull/3437/files#diff-6de816b72e7c722316243c57df4f809ad34dc8581367c72335154dada48c40edL993

I meant more text the XTR algorithm itself was working as expected. Part of why I'm having a hard time understand this PR is there are no tests showing the expected behavior of the algorithm.

wjones127 · 2025-02-18T16:59:15Z

rust/lance/src/io/exec/knn.rs

+                // for a row `r`:
+                // if `r` is in only `results``, then `results[r] += min_sim`
+                // if `r` is in only `query_results`, then `results[r] = query_results[r] + missed_similarities`,
+                // here `missed_similarities` is the sum of `min_sim` from previous iterations
+                // if `r` is in both, then `results[r] += query_results[r]`


I'm still having trouble understanding the logic.

So let's imagine batch 1 is:

rowid: 1, distance: 2.0

rowid: 2, distance 3.0

And batch 2 is:

rowid: 2, distance: 1.0

rowid: 3, distance: 0.5

If batch 1 comes in first, then we should get:

rowid 1, distance = 2.0 + 2.0 = 4.0 (distance in batch 1, plus min sim from batch 1)

rowid 2, distance = 3.0 + 1.0 = 4.0 (sum of distances across batches)

rowid 3, distance = 0.0 + 0.5 = 0.5 (min distance plus 0.5)

But if batch 2 comes in first, then we will get:

rowid 1, distance = 2.0 + 0.5 = 2.5 (min distance of batch 2, plus distance in batch 1)

rowid 2, distance = 1.0 + 3.0 = 4.0

rowid 3, distance = 0.5 + 0.5 = 1.0

rowid 1 and rowid 3 seem to get completely different distances depending on the order that the queries finish. And there doesn't seem to be any definition of what queries finish first.

min_sim is from the current batch, missed_similarities is sum of the min_sim of all batches before.

so if batch 1 comes first:

rowid 1, sim = 2.0 + 0.5 = 2.5 (sim in batch 1 + min_sim in batch 2)

rowid 2, sim = 3.0 + 1.0 = 4.0 (sum of sim)

rowid 3, sim = 2.0 + 0.5 = 2.5 (missed_sim_sum (min_sim in batch 1) + sim in batch 2)

if batch 2 comes first:

rowid 1, sim = 0.5 + 2.0 = 2.5 (missed_sim_sum (min_sim in batch 2) + sim in batch 1)

rowid 2, sim = 3.0 + 1.0 = 4.0

rowid 3, sim = 0.5 + 2.0 = 2.5 (sim in batch 2 + min_sim in batch 1)

yeah it's very complicated... i should add some tests to verify that the scores are all the same in any orders

just added a test for this: https://github.com/lancedb/lance/pull/3437/files#diff-c649512bcd0d46ee4fe7e3b2df181a7d2175521b9ab0a24c6d8122be89644168R992

Oh I think I see now. It wasn't clear initially which parts run per batch and what state is regular.

…-xtr

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal · 2025-02-19T06:45:43Z

rust/lance/src/io/exec/knn.rs

+            .collect::<Vec<_>>();
+
+        let mut res: Option<HashMap<_, _>> = None;
+        for perm in batches.into_iter().permutations(3) {


I'm not sure whether select_all produces the items in the order of streams if all of them are ready, but 6 runs should be enough to verify it even though it produces in random order.

…-xtr

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 · 2025-03-03T16:27:38Z

rust/lance/src/io/exec/knn.rs

+                // for a row `r`:
+                // if `r` is in only `results``, then `results[r] += min_sim`
+                // if `r` is in only `query_results`, then `results[r] = query_results[r] + missed_similarities`,
+                // here `missed_similarities` is the sum of `min_sim` from previous iterations
+                // if `r` is in both, then `results[r] += query_results[r]`


Oh I think I see now. It wasn't clear initially which parts run per batch and what state is regular.

BubbleCal added 2 commits February 8, 2025 11:41

xtr

1520bb7

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

perf: implement XTR for retrieving multivector

0525f33

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

github-actions bot added the performance label Feb 8, 2025

fix

dbfa98b

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal marked this pull request as ready for review February 8, 2025 08:10

BubbleCal requested review from eddyxu, westonpace and wjones127 February 8, 2025 08:19

BubbleCal added 3 commits February 8, 2025 16:27

remove unused lines

dc7cc24

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Merge branch 'main' of https://github.com/lancedb/lance into multivec…

686cfae

…-xtr

fix

417d776

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal commented Feb 10, 2025

View reviewed changes

rust/lance-linalg/src/distance.rs Show resolved Hide resolved

BubbleCal commented Feb 10, 2025

View reviewed changes

rust/lance/src/dataset/scanner.rs Show resolved Hide resolved

BubbleCal commented Feb 10, 2025

View reviewed changes

rust/lance/src/index/vector/ivf/v2.rs Show resolved Hide resolved

wjones127 requested changes Feb 10, 2025

View reviewed changes

fix

c6aa615

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested a review from wjones127 February 11, 2025 07:35

more comments

f573c26

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 reviewed Feb 18, 2025

View reviewed changes

BubbleCal added 2 commits February 19, 2025 13:30

Merge branch 'main' of https://github.com/lancedb/lance into multivec…

ec0310d

…-xtr

test scoring

a598113

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested a review from wjones127 February 19, 2025 06:33

BubbleCal added 2 commits February 19, 2025 14:36

more assertion

955471b

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

940fb85

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal commented Feb 19, 2025

View reviewed changes

BubbleCal added 2 commits March 3, 2025 12:09

Merge branch 'main' of https://github.com/lancedb/lance into multivec…

624506e

…-xtr

fix

8d5a835

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 approved these changes Mar 3, 2025

View reviewed changes

BubbleCal merged commit 87f055f into lancedb:main Mar 4, 2025
27 checks passed

wjones127 mentioned this pull request Mar 7, 2025

feat: upgrade Lance to v0.24.0-beta.3 lancedb/lancedb#2186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: implement XTR for retrieving multivector #3437

perf: implement XTR for retrieving multivector #3437

BubbleCal commented Feb 8, 2025

codecov-commenter commented Feb 8, 2025 •

edited

Loading

wjones127 left a comment

wjones127 Feb 10, 2025

BubbleCal Feb 11, 2025

wjones127 Feb 10, 2025

BubbleCal Feb 11, 2025

wjones127 Feb 18, 2025

BubbleCal commented Feb 11, 2025

wjones127 commented Feb 18, 2025

wjones127 Feb 18, 2025

BubbleCal Feb 19, 2025

BubbleCal Feb 19, 2025 •

edited

Loading

wjones127 Mar 3, 2025

BubbleCal Feb 19, 2025 •

edited

Loading

wjones127 Mar 3, 2025

		let row_ids = batch[ROW_ID].as_primitive::<UInt64Type>();
		let dists = batch[DIST_COL].as_primitive::<Float32Type>();

perf: implement XTR for retrieving multivector #3437

perf: implement XTR for retrieving multivector #3437

Conversation

BubbleCal commented Feb 8, 2025

codecov-commenter commented Feb 8, 2025 • edited Loading

Codecov Report

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BubbleCal commented Feb 11, 2025

wjones127 commented Feb 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BubbleCal Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BubbleCal Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 8, 2025 •

edited

Loading

BubbleCal Feb 19, 2025 •

edited

Loading

BubbleCal Feb 19, 2025 •

edited

Loading