Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: filter out null values when sampling for index training #3404

Merged
merged 7 commits into from
Jan 28, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
cleanup
wjones127 committed Jan 27, 2025

Verified

This commit was signed with the committer’s verified signature.
wjones127 Will Jones
commit 31aae7a8402756466ef28fdad9cd44040bc94f36
1 change: 1 addition & 0 deletions rust/lance-core/src/utils/tokio.rs
Original file line number Diff line number Diff line change
@@ -37,6 +37,7 @@ lazy_static::lazy_static! {
.thread_name("lance-cpu")
.max_blocking_threads(get_num_compute_intensive_cpus())
.worker_threads(1)
.enable_time()
// keep the thread alive "forever"
.thread_keep_alive(Duration::from_secs(u64::MAX))
.build()
19 changes: 18 additions & 1 deletion rust/lance/src/index/vector/utils.rs
Original file line number Diff line number Diff line change
@@ -254,6 +254,20 @@ impl PartitionLoadLock {
}
}

/// Generate random ranges to sample from a dataset.
///
/// This will return an iterator of ranges that cover the whole dataset. It
/// provides an unbound iterator so that the caller can decide when to stop.
/// This is useful when the caller wants to sample a fixed number of rows, but
/// has an additional filter that must be applied.
///
/// Parameters:
/// * `num_rows`: number of rows in the dataset
/// * `sample_size_hint`: the target number of rows to be sampled in the end.
/// This is a hint for the minimum number of rows that will be consumed, but
/// the caller may consume more than this.
/// * `block_size`: the byte size of ranges that should be used.
/// * `byte_width`: the byte width of the vectors that will be sampled.
fn random_ranges(
num_rows: usize,
sample_size_hint: usize,
@@ -270,7 +284,10 @@ fn random_ranges(
indices.shuffle(&mut rng);
Box::new(indices.into_iter())
} else {
// Create slices of size `sample_granularity` to sample from
// If the sample is a small proportion, then we can instead use a set
// to track which bins we have seen. We start by using the sample_size_hint
// to provide an efficient start, and from there we randomly choose bins
// one by one.
let num_bins = num_rows.div_ceil(rows_per_batch);
// Start with the minimum number we will need.
let min_sample_size = sample_size_hint / rows_per_batch;