-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: do brute force search on unindexed data #3036
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3036 +/- ##
==========================================
+ Coverage 78.24% 78.32% +0.07%
==========================================
Files 240 240
Lines 77284 78699 +1415
Branches 77284 78699 +1415
==========================================
+ Hits 60470 61638 +1168
- Misses 13696 13939 +243
- Partials 3118 3122 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting on this. I'd like for us to try to more towards expressing things in more in terms of datafusion plans, to make these queries more composable. I think the flat_bm25_search_stream
could be it's own ExecutionPlan
and you can use UnionExec
to combine the results of that one with the FtsExec
output.
let unindexed_stream = input.execute(partition, context)?; | ||
let unindexed_result_stream = | ||
flat_bm25_search_stream(unindexed_stream, column, query, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it looks like for each index, we independently create a new scan. So if we are searching over 2 columns, that's two independent scans of unindexed data. I feel like this would be a lot more efficient if we executed the scan once, and called flat_bm25_search
on each batch for each index instead.
Right now, the logic is:
for index in index:
for batch in scan_unindexed_data():
yield flat_bm25_search(index, batch)
And I think we should instead do:
for batch in scan_unindexed_data():
for index in index:
yield flat_bm25_search(index, batch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, do this because diff index may be created at diff time, say index A could be created at the time the table had 100K rows, but index B was created at the time the table had 200K rows.
single scan is still doable, but it needs to scan "union of unindexed fragments over indexes". Any documents can be large and that would be a waste, so decided to scan for each index (column)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, do this because diff index may be created at diff time, say index A could be created at the time the table had 100K rows, but index B was created at the time the table had 200K rows.
Yeah that makes sense. They might cover different fragments.
But in the code here, they are all sharing the unindexed_stream
. So doesn't that make your code as written incorrect too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this unindexed_stream is from input
for each index(column), each one is from here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I seen now. Thanks.
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
let unindexed_stream = input.execute(partition, context)?; | ||
let unindexed_result_stream = | ||
flat_bm25_search_stream(unindexed_stream, column, query, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I seen now. Thanks.
fix #3014
Before this, FTS ignores the new unindexed data.
Here we introduce the ability to do flat search on unindexed data.
To calculate BM25 score of the unindexed rows, the only diff thing is IDF (determined by the number of documents containing the token), say
idf(nq, num_docs)
:idf(1, num_docs)
, because this token is rare so it should contribute more to the score.