-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: reduce the required memory for indexing FTS #2926
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Maddening error
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2926 +/- ##
==========================================
+ Coverage 78.22% 78.26% +0.03%
==========================================
Files 239 239
Lines 76576 76712 +136
Branches 76576 76712 +136
==========================================
+ Hits 59902 60038 +136
+ Misses 13626 13604 -22
- Partials 3048 3070 +22
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
How much higher than expected? By default the lance reader will buffer at least 2GB of I/O buffer. Also, depending on your batch size, there may be some compute buffer used (the default 8Ki usually generates pretty small batches but if you are using a larger-than-default batch size it may be larger) |
about 8GB higher than expected. |
how about writer? The algorithm is reading/indexing/writing concurrently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use a little more explanation in comments as to how this works.
static ref FLUSH_THRESHOLD: usize = std::env::var("FLUSH_THRESHOLD") | ||
.unwrap_or_else(|_| "256".to_string()) | ||
.parse() | ||
.expect("failed to parse FLUSH_THRESHOLD"); | ||
static ref FLUSH_SIZE: usize = std::env::var("FLUSH_SIZE") | ||
.unwrap_or_else(|_| "64".to_string()) | ||
.parse() | ||
.expect("failed to parse FLUSH_SIZE"); | ||
static ref NUM_SHARDS: usize = std::env::var("NUM_SHARDS") | ||
.unwrap_or_else(|_| "8".to_string()) | ||
.parse() | ||
.expect("failed to parse NUM_SHARDS"); | ||
static ref CHANNEL_SIZE: usize = std::env::var("CHANNEL_SIZE") | ||
.unwrap_or_else(|_| "8".to_string()) | ||
.parse() | ||
.expect("failed to parse CHANNEL_SIZE"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add doc comments on what these mean and how they should be set?
And could we add a prefix like LANCE_FTS_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
let shard = hasher.finish() as usize % NUM_SHARDS; | ||
token_maps[shard].insert(token, token_id); | ||
} | ||
let mut token_maps = (0..num_shards).map(|_| HashMap::new()).collect_vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure you can just do this:
let mut token_maps = (0..num_shards).map(|_| HashMap::new()).collect_vec(); | |
let mut token_maps = vec![HashMap::new(); num_shards]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
// spawn workers to build the index | ||
let buffer_size = get_num_compute_intensive_cpus() | ||
.saturating_sub(num_shards) | ||
.max(1) | ||
.min(num_shards); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are choosing a number between 1..=num_shards
, from NUM_CPUS - NUM_SHARDS
? So if I have 8 CPUs and 8 shards, I get 1. But if I have 12 CPUs and 8 shards I get 4. Is that intended? I'm not sure what the buffer size is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this buffer is used for IO so we want NUM_CPUS - NUM_SHARDS
, but higher than NUM_SHARDS
may not help much because it's faster than that indexing workers, so also limit it to NUM_SHARDS
.
Added comment to explain this
Now it requires about 12GB memory at peak to index 40M wikipedia dataset.
This PR introduced a parameter called
FLUSH_THRESHOLD
to control the memory footprint while indexing, the total memory footprint would beFLUSH_THRESHOLD * NUM_SHARDS
, it's256MB * 8 = 2GB
by default.but the real memory footprint is much higher than expected, may be because of some buffers/cache for IO or something else.