perf: reduce the required memory for indexing FTS #2926

BubbleCal · 2024-09-24T10:01:59Z

Now it requires about 12GB memory at peak to index 40M wikipedia dataset.
This PR introduced a parameter called FLUSH_THRESHOLD to control the memory footprint while indexing, the total memory footprint would be FLUSH_THRESHOLD * NUM_SHARDS, it's 256MB * 8 = 2GB by default.

but the real memory footprint is much higher than expected, may be because of some buffers/cache for IO or something else.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Maddening error

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov-commenter · 2024-09-25T09:50:27Z

Codecov Report

Attention: Patch coverage is 68.28087% with 131 lines in your changes missing coverage. Please review.

Project coverage is 78.26%. Comparing base (631e9bf) to head (3c74b73).

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/inverted/builder.rs	69.07%	75 Missing and 15 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs	66.39%	37 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2926      +/-   ##
==========================================
+ Coverage   78.22%   78.26%   +0.03%     
==========================================
  Files         239      239              
  Lines       76576    76712     +136     
  Branches    76576    76712     +136     
==========================================
+ Hits        59902    60038     +136     
+ Misses      13626    13604      -22     
- Partials     3048     3070      +22

Flag	Coverage Δ
unittests	`78.26% <68.28%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace · 2024-10-11T13:04:15Z

How much higher than expected? By default the lance reader will buffer at least 2GB of I/O buffer. Also, depending on your batch size, there may be some compute buffer used (the default 8Ki usually generates pretty small batches but if you are using a larger-than-default batch size it may be larger)

BubbleCal · 2024-10-14T10:50:08Z

How much higher than expected? By default the lance reader will buffer at least 2GB of I/O buffer. Also, depending on your batch size, there may be some compute buffer used (the default 8Ki usually generates pretty small batches but if you are using a larger-than-default batch size it may be larger)

about 8GB higher than expected.
I'm using default batch size but with chunking the batch stream with chunk_size=4096

BubbleCal · 2024-10-14T10:58:36Z

How much higher than expected? By default the lance reader will buffer at least 2GB of I/O buffer. Also, depending on your batch size, there may be some compute buffer used (the default 8Ki usually generates pretty small batches but if you are using a larger-than-default batch size it may be larger)

how about writer? The algorithm is reading/indexing/writing concurrently

wjones127

Could use a little more explanation in comments as to how this works.

wjones127 · 2024-10-09T15:50:05Z

rust/lance-index/src/scalar/inverted/builder.rs

+    static ref FLUSH_THRESHOLD: usize = std::env::var("FLUSH_THRESHOLD")
+        .unwrap_or_else(|_| "256".to_string())
+        .parse()
+        .expect("failed to parse FLUSH_THRESHOLD");
+    static ref FLUSH_SIZE: usize = std::env::var("FLUSH_SIZE")
+        .unwrap_or_else(|_| "64".to_string())
+        .parse()
+        .expect("failed to parse FLUSH_SIZE");
+    static ref NUM_SHARDS: usize = std::env::var("NUM_SHARDS")
+        .unwrap_or_else(|_| "8".to_string())
+        .parse()
+        .expect("failed to parse NUM_SHARDS");
+    static ref CHANNEL_SIZE: usize = std::env::var("CHANNEL_SIZE")
+        .unwrap_or_else(|_| "8".to_string())
+        .parse()
+        .expect("failed to parse CHANNEL_SIZE");


Could you add doc comments on what these mean and how they should be set?
And could we add a prefix like LANCE_FTS_?

wjones127 · 2024-10-09T15:54:05Z

rust/lance-index/src/scalar/inverted/builder.rs

-                let shard = hasher.finish() as usize % NUM_SHARDS;
-                token_maps[shard].insert(token, token_id);
-            }
+        let mut token_maps = (0..num_shards).map(|_| HashMap::new()).collect_vec();


I'm pretty sure you can just do this:

Suggested change

let mut token_maps = (0..num_shards).map(|_| HashMap::new()).collect_vec();

let mut token_maps = vec![HashMap::new(); num_shards];

wjones127 · 2024-10-14T17:29:53Z

rust/lance-index/src/scalar/inverted/builder.rs

+        // spawn workers to build the index
+        let buffer_size = get_num_compute_intensive_cpus()
+            .saturating_sub(num_shards)
+            .max(1)
+            .min(num_shards);


So we are choosing a number between 1..=num_shards, from NUM_CPUS - NUM_SHARDS? So if I have 8 CPUs and 8 shards, I get 1. But if I have 12 CPUs and 8 shards I get 4. Is that intended? I'm not sure what the buffer size is here.

yes, this buffer is used for IO so we want NUM_CPUS - NUM_SHARDS, but higher than NUM_SHARDS may not help much because it's faster than that indexing workers, so also limit it to NUM_SHARDS.
Added comment to explain this

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

perf: reduce the required memory for indexing FTS

157f9c3

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

github-actions bot added the performance label Sep 24, 2024

westonpace and others added 5 commits September 24, 2024 05:24

Maddening error

8593ff1

Merge pull request #1 from westonpace/fts-mem

ad0f080

Maddening error

fix

057f29d

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix buffer

3f678f0

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

improve

4dd67c4

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal added 8 commits September 26, 2024 19:08

fix

7731af9

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Merge branch 'main' of https://github.com/lancedb/lance into fts-mem

01bbade

improve

f214ce6

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

93fa94f

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix order

49179da

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

merge stream

687da54

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

optimize scheduling

1217a99

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

remove trie

5a6903d

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal force-pushed the fts-mem branch from c90fe66 to 5a6903d Compare October 8, 2024 15:39

BubbleCal added 2 commits October 9, 2024 08:24

fix

b3b076e

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

05bab1b

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested review from westonpace and eddyxu October 9, 2024 08:56

BubbleCal marked this pull request as ready for review October 10, 2024 03:37

BubbleCal force-pushed the fts-mem branch from 05bab1b to 60ada69 Compare October 10, 2024 06:06

Merge branch 'main' of https://github.com/lancedb/lance into fts-mem

fe5fcf3

BubbleCal force-pushed the fts-mem branch from 60ada69 to fe5fcf3 Compare October 10, 2024 06:08

BubbleCal mentioned this pull request Oct 11, 2024

Full text search (FTS) indices #1195

Open

26 tasks

wjones127 reviewed Oct 14, 2024

View reviewed changes

fix

76795c8

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested a review from wjones127 October 17, 2024 01:42

Merge branch 'main' of https://github.com/lancedb/lance into fts-mem

3c74b73

wjones127 approved these changes Oct 17, 2024

View reviewed changes

BubbleCal merged commit 707b78d into lancedb:main Oct 17, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: reduce the required memory for indexing FTS #2926

perf: reduce the required memory for indexing FTS #2926

BubbleCal commented Sep 24, 2024 •

edited

Loading

codecov-commenter commented Sep 25, 2024 •

edited

Loading

westonpace commented Oct 11, 2024

BubbleCal commented Oct 14, 2024

BubbleCal commented Oct 14, 2024

wjones127 left a comment

wjones127 Oct 9, 2024

BubbleCal Oct 17, 2024

wjones127 Oct 9, 2024

BubbleCal Oct 17, 2024

wjones127 Oct 14, 2024

BubbleCal Oct 17, 2024

	let mut token_maps = (0..num_shards).map(\|_\| HashMap::new()).collect_vec();
	let mut token_maps = vec![HashMap::new(); num_shards];

perf: reduce the required memory for indexing FTS #2926

perf: reduce the required memory for indexing FTS #2926

Conversation

BubbleCal commented Sep 24, 2024 • edited Loading

codecov-commenter commented Sep 25, 2024 • edited Loading

Codecov Report

westonpace commented Oct 11, 2024

BubbleCal commented Oct 14, 2024

BubbleCal commented Oct 14, 2024

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Oct 9, 2024

Choose a reason for hiding this comment

BubbleCal Oct 17, 2024

Choose a reason for hiding this comment

wjones127 Oct 9, 2024

Choose a reason for hiding this comment

BubbleCal Oct 17, 2024

Choose a reason for hiding this comment

wjones127 Oct 14, 2024

Choose a reason for hiding this comment

BubbleCal Oct 17, 2024

Choose a reason for hiding this comment

BubbleCal commented Sep 24, 2024 •

edited

Loading

codecov-commenter commented Sep 25, 2024 •

edited

Loading