-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: parallelize ngram indexing #3501
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3501 +/- ##
==========================================
- Coverage 78.48% 78.46% -0.03%
==========================================
Files 252 252
Lines 94011 94078 +67
Branches 94011 94078 +67
==========================================
+ Hits 73783 73815 +32
- Misses 17232 17268 +36
+ Partials 2996 2995 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
rust/lance-index/src/scalar/ngram.rs
Outdated
@@ -465,12 +467,56 @@ impl NGramIndexBuilder { | |||
let schema = data.schema(); | |||
Self::validate_schema(schema.as_ref())?; | |||
|
|||
let num_shards = *LANCE_FTS_NUM_SHARDS; | |||
let buffer_size = get_num_compute_intensive_cpus() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure I understand the buffer_size
calculation? How is it related to get_num_compute_intensive_cpus
? Why not just used a small fixed number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh I forget why it's calculated as this, this is from FTS indexing code. But yes it looks like a small fixed numbers is fine here. set it to 2
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
total indexing time reduced from 23s to 5s