feat: do brute force search on unindexed data #3036

BubbleCal · 2024-10-23T11:17:40Z

fix #3014
Before this, FTS ignores the new unindexed data.
Here we introduce the ability to do flat search on unindexed data.
To calculate BM25 score of the unindexed rows, the only diff thing is IDF (determined by the number of documents containing the token), say idf(nq, num_docs):

For known token, its IDF is calculated by the index, which means this algo doesn't count the unindexed rows as the number of documents containing the token
For unknown token, its IDF is simply idf(1, num_docs), because this token is rare so it should contribute more to the score.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov-commenter · 2024-10-23T13:04:26Z

Codecov Report

Attention: Patch coverage is 84.91620% with 54 lines in your changes missing coverage. Please review.

Project coverage is 78.32%. Comparing base (f17d88d) to head (f9eeea3).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/inverted/index.rs	78.43%	18 Missing and 4 partials ⚠️
rust/lance/src/io/exec/fts.rs	74.68%	15 Missing and 5 partials ⚠️
rust/lance/src/dataset/scanner.rs	86.51%	0 Missing and 12 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3036      +/-   ##
==========================================
+ Coverage   78.24%   78.32%   +0.07%     
==========================================
  Files         240      240              
  Lines       77284    78699    +1415     
  Branches    77284    78699    +1415     
==========================================
+ Hits        60470    61638    +1168     
- Misses      13696    13939     +243     
- Partials     3118     3122       +4

Flag	Coverage Δ
unittests	`78.32% <84.91%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127

Thanks for starting on this. I'd like for us to try to more towards expressing things in more in terms of datafusion plans, to make these queries more composable. I think the flat_bm25_search_stream could be it's own ExecutionPlan and you can use UnionExec to combine the results of that one with the FtsExec output.

rust/lance-index/src/scalar/inverted/index.rs

rust/lance/src/io/exec/fts.rs

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 · 2024-10-24T17:58:47Z

rust/lance/src/io/exec/fts.rs

+                    let unindexed_stream = input.execute(partition, context)?;
+                    let unindexed_result_stream =
+                        flat_bm25_search_stream(unindexed_stream, column, query, index);


Hmm it looks like for each index, we independently create a new scan. So if we are searching over 2 columns, that's two independent scans of unindexed data. I feel like this would be a lot more efficient if we executed the scan once, and called flat_bm25_search on each batch for each index instead.

Right now, the logic is:

for index in index: for batch in scan_unindexed_data(): yield flat_bm25_search(index, batch)

And I think we should instead do:

for batch in scan_unindexed_data(): for index in index: yield flat_bm25_search(index, batch)

yeah, do this because diff index may be created at diff time, say index A could be created at the time the table had 100K rows, but index B was created at the time the table had 200K rows.

single scan is still doable, but it needs to scan "union of unindexed fragments over indexes". Any documents can be large and that would be a waste, so decided to scan for each index (column)

yeah, do this because diff index may be created at diff time, say index A could be created at the time the table had 100K rows, but index B was created at the time the table had 200K rows.

Yeah that makes sense. They might cover different fragments.

But in the code here, they are all sharing the unindexed_stream. So doesn't that make your code as written incorrect too?

this unindexed_stream is from input for each index(column), each one is from here

Okay I seen now. Thanks.

rust/lance/src/dataset/scanner.rs

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 · 2024-10-31T19:31:10Z

rust/lance/src/io/exec/fts.rs

+                    let unindexed_stream = input.execute(partition, context)?;
+                    let unindexed_result_stream =
+                        flat_bm25_search_stream(unindexed_stream, column, query, index);


Okay I seen now. Thanks.

BubbleCal added 4 commits October 19, 2024 12:24

fix: do brute force search on unindexed data

4a0643e

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

35c72e2

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Merge branch 'main' of https://github.com/lancedb/lance into bf-fts

c31d44d

fix

3f138f7

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

github-actions bot added the enhancement New feature or request label Oct 23, 2024

BubbleCal changed the title ~~feat: do brute force search on unindexed data~~ fix: do brute force search on unindexed data Oct 23, 2024

github-actions bot added the bug Something isn't working label Oct 23, 2024

BubbleCal requested review from westonpace and wjones127 October 23, 2024 11:20

BubbleCal added 5 commits October 23, 2024 19:36

fix

0ff2b7e

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

47fba58

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

60109db

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

remove unused code

20ea0f8

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

clean code

6dc3cb7

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 reviewed Oct 23, 2024

View reviewed changes

rust/lance-index/src/scalar/inverted/index.rs Outdated Show resolved Hide resolved

rust/lance/src/io/exec/fts.rs Outdated Show resolved Hide resolved

BubbleCal added 4 commits October 24, 2024 12:54

fix

f25ff0c

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

a636d12

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

79082cb

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

9eef9ab

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal marked this pull request as ready for review October 24, 2024 13:04

BubbleCal requested a review from wjones127 October 24, 2024 13:04

wjones127 reviewed Oct 24, 2024

View reviewed changes

wjones127 changed the title ~~fix: do brute force search on unindexed data~~ feat: do brute force search on unindexed data Oct 24, 2024

wjones127 removed the bug Something isn't working label Oct 24, 2024

BubbleCal added 2 commits October 25, 2024 16:49

fix & add more tests

d0c170d

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Merge branch 'main' of https://github.com/lancedb/lance into bf-fts

f9eeea3

BubbleCal requested a review from wjones127 October 25, 2024 12:06

wjones127 approved these changes Oct 31, 2024

View reviewed changes

BubbleCal merged commit dcaee1d into lancedb:main Oct 31, 2024
23 checks passed

BubbleCal mentioned this pull request Nov 19, 2024

Graphrag：The recall effect is not ideal lancedb/lancedb#1825

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: do brute force search on unindexed data #3036

feat: do brute force search on unindexed data #3036

BubbleCal commented Oct 23, 2024 •

edited

Loading

codecov-commenter commented Oct 23, 2024 •

edited

Loading

wjones127 left a comment

wjones127 Oct 24, 2024

BubbleCal Oct 25, 2024 •

edited

Loading

wjones127 Oct 29, 2024

BubbleCal Oct 31, 2024

wjones127 Oct 31, 2024

wjones127 Oct 31, 2024

feat: do brute force search on unindexed data #3036

feat: do brute force search on unindexed data #3036

Conversation

BubbleCal commented Oct 23, 2024 • edited Loading

codecov-commenter commented Oct 23, 2024 • edited Loading

Codecov Report

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Oct 24, 2024

Choose a reason for hiding this comment

BubbleCal Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

wjones127 Oct 29, 2024

Choose a reason for hiding this comment

BubbleCal Oct 31, 2024

Choose a reason for hiding this comment

wjones127 Oct 31, 2024

Choose a reason for hiding this comment

wjones127 Oct 31, 2024

Choose a reason for hiding this comment

BubbleCal commented Oct 23, 2024 •

edited

Loading

codecov-commenter commented Oct 23, 2024 •

edited

Loading

BubbleCal Oct 25, 2024 •

edited

Loading