Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: reduce the required memory for indexing FTS #2926

Merged
merged 19 commits into from
Oct 17, 2024

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Sep 24, 2024

Now it requires about 12GB memory at peak to index 40M wikipedia dataset.
This PR introduced a parameter called FLUSH_THRESHOLD to control the memory footprint while indexing, the total memory footprint would be FLUSH_THRESHOLD * NUM_SHARDS, it's 256MB * 8 = 2GB by default.

but the real memory footprint is much higher than expected, may be because of some buffers/cache for IO or something else.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
westonpace and others added 5 commits September 24, 2024 05:24
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@codecov-commenter
Copy link

codecov-commenter commented Sep 25, 2024

Codecov Report

Attention: Patch coverage is 68.28087% with 131 lines in your changes missing coverage. Please review.

Project coverage is 78.26%. Comparing base (631e9bf) to head (3c74b73).

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/builder.rs 69.07% 75 Missing and 15 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs 66.39% 37 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2926      +/-   ##
==========================================
+ Coverage   78.22%   78.26%   +0.03%     
==========================================
  Files         239      239              
  Lines       76576    76712     +136     
  Branches    76576    76712     +136     
==========================================
+ Hits        59902    60038     +136     
+ Misses      13626    13604      -22     
- Partials     3048     3070      +22     
Flag Coverage Δ
unittests 78.26% <68.28%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal requested review from westonpace and eddyxu October 9, 2024 08:56
@BubbleCal BubbleCal marked this pull request as ready for review October 10, 2024 03:37
@westonpace
Copy link
Contributor

How much higher than expected? By default the lance reader will buffer at least 2GB of I/O buffer. Also, depending on your batch size, there may be some compute buffer used (the default 8Ki usually generates pretty small batches but if you are using a larger-than-default batch size it may be larger)

@BubbleCal
Copy link
Contributor Author

How much higher than expected? By default the lance reader will buffer at least 2GB of I/O buffer. Also, depending on your batch size, there may be some compute buffer used (the default 8Ki usually generates pretty small batches but if you are using a larger-than-default batch size it may be larger)

about 8GB higher than expected.
I'm using default batch size but with chunking the batch stream with chunk_size=4096

@BubbleCal
Copy link
Contributor Author

How much higher than expected? By default the lance reader will buffer at least 2GB of I/O buffer. Also, depending on your batch size, there may be some compute buffer used (the default 8Ki usually generates pretty small batches but if you are using a larger-than-default batch size it may be larger)

how about writer? The algorithm is reading/indexing/writing concurrently

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a little more explanation in comments as to how this works.

Comment on lines 40 to 55
static ref FLUSH_THRESHOLD: usize = std::env::var("FLUSH_THRESHOLD")
.unwrap_or_else(|_| "256".to_string())
.parse()
.expect("failed to parse FLUSH_THRESHOLD");
static ref FLUSH_SIZE: usize = std::env::var("FLUSH_SIZE")
.unwrap_or_else(|_| "64".to_string())
.parse()
.expect("failed to parse FLUSH_SIZE");
static ref NUM_SHARDS: usize = std::env::var("NUM_SHARDS")
.unwrap_or_else(|_| "8".to_string())
.parse()
.expect("failed to parse NUM_SHARDS");
static ref CHANNEL_SIZE: usize = std::env::var("CHANNEL_SIZE")
.unwrap_or_else(|_| "8".to_string())
.parse()
.expect("failed to parse CHANNEL_SIZE");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add doc comments on what these mean and how they should be set?
And could we add a prefix like LANCE_FTS_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

let shard = hasher.finish() as usize % NUM_SHARDS;
token_maps[shard].insert(token, token_id);
}
let mut token_maps = (0..num_shards).map(|_| HashMap::new()).collect_vec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure you can just do this:

Suggested change
let mut token_maps = (0..num_shards).map(|_| HashMap::new()).collect_vec();
let mut token_maps = vec![HashMap::new(); num_shards];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 119 to 123
// spawn workers to build the index
let buffer_size = get_num_compute_intensive_cpus()
.saturating_sub(num_shards)
.max(1)
.min(num_shards);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are choosing a number between 1..=num_shards, from NUM_CPUS - NUM_SHARDS? So if I have 8 CPUs and 8 shards, I get 1. But if I have 12 CPUs and 8 shards I get 4. Is that intended? I'm not sure what the buffer size is here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this buffer is used for IO so we want NUM_CPUS - NUM_SHARDS, but higher than NUM_SHARDS may not help much because it's faster than that indexing workers, so also limit it to NUM_SHARDS.
Added comment to explain this

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal requested a review from wjones127 October 17, 2024 01:42
@BubbleCal BubbleCal merged commit 707b78d into lancedb:main Oct 17, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants