perf: parallelize ngram indexing #3501

BubbleCal · 2025-03-03T08:21:41Z

total indexing time reduced from 23s to 5s

ngram_index(1000000)    time:   [5.1192 s 5.1756 s 5.2319 s]
                        change: [-78.163% -77.791% -77.410%] (p = 0.00 < 0.05)
                        Performance has improved.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov-commenter · 2025-03-03T08:59:48Z

Codecov Report

Attention: Patch coverage is 97.14286% with 1 line in your changes missing coverage. Please review.

Project coverage is 78.46%. Comparing base (33ae43b) to head (e520053).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/ngram.rs	97.14%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3501      +/-   ##
==========================================
- Coverage   78.48%   78.46%   -0.03%     
==========================================
  Files         252      252              
  Lines       94011    94078      +67     
  Branches    94011    94078      +67     
==========================================
+ Hits        73783    73815      +32     
- Misses      17232    17268      +36     
+ Partials     2996     2995       -1

Flag	Coverage Δ
unittests	`78.46% <97.14%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace · 2025-03-03T13:31:49Z

rust/lance-index/src/scalar/ngram.rs

@@ -465,12 +467,56 @@ impl NGramIndexBuilder {
        let schema = data.schema();
        Self::validate_schema(schema.as_ref())?;

+        let num_shards = *LANCE_FTS_NUM_SHARDS;
+        let buffer_size = get_num_compute_intensive_cpus()


I'm not entirely sure I understand the buffer_size calculation? How is it related to get_num_compute_intensive_cpus? Why not just used a small fixed number?

tbh I forget why it's calculated as this, this is from FTS indexing code. But yes it looks like a small fixed numbers is fine here. set it to 2

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

perf: parallize ngram indexing

5d528bb

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

github-actions bot added the performance label Mar 3, 2025

BubbleCal changed the title ~~perf: parallize ngram indexing~~ perf: parallelize ngram indexing Mar 3, 2025

add benchmark

472f96a

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal marked this pull request as ready for review March 3, 2025 12:42

BubbleCal requested a review from westonpace March 3, 2025 12:42

fmt

6258527

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace approved these changes Mar 3, 2025

View reviewed changes

BubbleCal added 2 commits March 3, 2025 22:16

remove buffer_size

fb21ee4

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

e520053

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace merged commit a144028 into lancedb:main Mar 3, 2025
27 checks passed

BubbleCal mentioned this pull request Mar 7, 2025

Parallelize ngram index creation #3496

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: parallelize ngram indexing #3501

perf: parallelize ngram indexing #3501

BubbleCal commented Mar 3, 2025 •

edited

Loading

codecov-commenter commented Mar 3, 2025 •

edited

Loading

westonpace Mar 3, 2025

BubbleCal Mar 3, 2025

perf: parallelize ngram indexing #3501

perf: parallelize ngram indexing #3501

Conversation

BubbleCal commented Mar 3, 2025 • edited Loading

codecov-commenter commented Mar 3, 2025 • edited Loading

Codecov Report

westonpace Mar 3, 2025

Choose a reason for hiding this comment

BubbleCal Mar 3, 2025

Choose a reason for hiding this comment

BubbleCal commented Mar 3, 2025 •

edited

Loading

codecov-commenter commented Mar 3, 2025 •

edited

Loading