Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: do brute force search on unindexed data #3036

Merged
merged 15 commits into from
Oct 31, 2024
Merged

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Oct 23, 2024

fix #3014
Before this, FTS ignores the new unindexed data.
Here we introduce the ability to do flat search on unindexed data.
To calculate BM25 score of the unindexed rows, the only diff thing is IDF (determined by the number of documents containing the token), say idf(nq, num_docs):

  • For known token, its IDF is calculated by the index, which means this algo doesn't count the unindexed rows as the number of documents containing the token
  • For unknown token, its IDF is simply idf(1, num_docs), because this token is rare so it should contribute more to the score.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions github-actions bot added the enhancement New feature or request label Oct 23, 2024
@BubbleCal BubbleCal changed the title feat: do brute force search on unindexed data fix: do brute force search on unindexed data Oct 23, 2024
@github-actions github-actions bot added the bug Something isn't working label Oct 23, 2024
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@codecov-commenter
Copy link

codecov-commenter commented Oct 23, 2024

Codecov Report

Attention: Patch coverage is 84.91620% with 54 lines in your changes missing coverage. Please review.

Project coverage is 78.32%. Comparing base (f17d88d) to head (f9eeea3).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/index.rs 78.43% 18 Missing and 4 partials ⚠️
rust/lance/src/io/exec/fts.rs 74.68% 15 Missing and 5 partials ⚠️
rust/lance/src/dataset/scanner.rs 86.51% 0 Missing and 12 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3036      +/-   ##
==========================================
+ Coverage   78.24%   78.32%   +0.07%     
==========================================
  Files         240      240              
  Lines       77284    78699    +1415     
  Branches    77284    78699    +1415     
==========================================
+ Hits        60470    61638    +1168     
- Misses      13696    13939     +243     
- Partials     3118     3122       +4     
Flag Coverage Δ
unittests 78.32% <84.91%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting on this. I'd like for us to try to more towards expressing things in more in terms of datafusion plans, to make these queries more composable. I think the flat_bm25_search_stream could be it's own ExecutionPlan and you can use UnionExec to combine the results of that one with the FtsExec output.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal marked this pull request as ready for review October 24, 2024 13:04
@BubbleCal BubbleCal requested a review from wjones127 October 24, 2024 13:04
Comment on lines +291 to +293
let unindexed_stream = input.execute(partition, context)?;
let unindexed_result_stream =
flat_bm25_search_stream(unindexed_stream, column, query, index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it looks like for each index, we independently create a new scan. So if we are searching over 2 columns, that's two independent scans of unindexed data. I feel like this would be a lot more efficient if we executed the scan once, and called flat_bm25_search on each batch for each index instead.

Right now, the logic is:

for index in index:
   for batch in scan_unindexed_data():
       yield flat_bm25_search(index, batch)

And I think we should instead do:

for batch in scan_unindexed_data():
    for index in index:
        yield flat_bm25_search(index, batch)

Copy link
Contributor Author

@BubbleCal BubbleCal Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, do this because diff index may be created at diff time, say index A could be created at the time the table had 100K rows, but index B was created at the time the table had 200K rows.

single scan is still doable, but it needs to scan "union of unindexed fragments over indexes". Any documents can be large and that would be a waste, so decided to scan for each index (column)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, do this because diff index may be created at diff time, say index A could be created at the time the table had 100K rows, but index B was created at the time the table had 200K rows.

Yeah that makes sense. They might cover different fragments.

But in the code here, they are all sharing the unindexed_stream. So doesn't that make your code as written incorrect too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this unindexed_stream is from input for each index(column), each one is from here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I seen now. Thanks.

@wjones127 wjones127 changed the title fix: do brute force search on unindexed data feat: do brute force search on unindexed data Oct 24, 2024
@wjones127 wjones127 removed the bug Something isn't working label Oct 24, 2024
@BubbleCal BubbleCal requested a review from wjones127 October 25, 2024 12:06
Comment on lines +291 to +293
let unindexed_stream = input.execute(partition, context)?;
let unindexed_result_stream =
flat_bm25_search_stream(unindexed_stream, column, query, index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I seen now. Thanks.

@BubbleCal BubbleCal merged commit dcaee1d into lancedb:main Oct 31, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

full_text_search ignores new data
3 participants