Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: improve v3 indexing perf #3525

Merged
merged 40 commits into from
Mar 13, 2025
Merged

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Mar 10, 2025

  • avoids writing/loading the original vectors to disk to significantly reduce the IO time (doesn't help for IVF_FLAT)
  • use quantized storage to build HNSW so it can be much faster

IVF_PQ is 3x faster than before on sift1M

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…indexing-perf

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@@ -140,64 +139,37 @@ pub trait VectorStore: Send + Sync + Sized + Clone {
fn dist_calculator(&self, query: ArrayRef) -> Self::DistanceCalculator<'_>;

fn dist_calculator_from_id(&self, id: u32) -> Self::DistanceCalculator<'_>;

fn distance_between(&self, a: u32, b: u32) -> f32;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method can't be implemented efficiently for PQ, so remove it, it's always the same as dist_calculator_from_id(a).distance(b)

fn distance_between(&self, a: u32, b: u32) -> f32;

fn dist_calculator_from_native(&self, _query: ArrayRef) -> Self::DistanceCalculator<'_> {
todo!("Implement this")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think any type implements this for now, so remove it

&& ProductQuantizer::use_residual(distance_type)
&& ivf.centroids.is_some()
{
transformers.push(Arc::new(ResidualTransform::new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the residual transformer to be in IVFTransformer, which happens before shuffling the data into partitions, so that we don't need to write/read the original vectors, to save IO time/space

store: &ObjectStore,
_dataset: &Arc<Dataset>,
_column: &str,
_store: &ObjectStore,
Copy link
Contributor Author

@BubbleCal BubbleCal Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will probably need these params in the future PR soon, so keep them for now

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal marked this pull request as ready for review March 11, 2025 14:00
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 80.66298% with 35 lines in your changes missing coverage. Please review.

Project coverage is 78.72%. Comparing base (4358df1) to head (27ff3a4).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-index/src/vector/pq/storage.rs 77.86% 29 Missing ⚠️
rust/lance-index/src/vector/storage.rs 66.66% 0 Missing and 4 partials ⚠️
rust/lance-index/src/vector/pq/transform.rs 83.33% 0 Missing and 1 partial ⚠️
rust/lance-index/src/vector/v3/shuffler.rs 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3525      +/-   ##
==========================================
+ Coverage   78.64%   78.72%   +0.07%     
==========================================
  Files         254      255       +1     
  Lines       94952    95163     +211     
  Branches    94952    95163     +211     
==========================================
+ Hits        74674    74916     +242     
+ Misses      17251    17211      -40     
- Partials     3027     3036       +9     
Flag Coverage Δ
unittests 78.72% <80.66%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions but looks good to me. Overall I think the result is simpler so this is nice!

Comment on lines +14 to +16

#[derive(Debug)]
pub struct FlatTransformer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[derive(Debug)]
pub struct FlatTransformer {
/// Renames the input column to FLAT_COLUMN
#[derive(Debug)]
pub struct FlatTransformer {

Does this transformer do anything else? Or is it just renaming the column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just renames the column

pub fn new(distance_type: DistanceType, quantizer: Q) -> Result<Self> {
let transformers = if matches!(Q::quantization_type(), QuantizationType::Product) {
let metadata = quantizer.metadata(None)?;
vec![Arc::new(TransposeTransformer::new(metadata.to_string())?) as _]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the roundtrip through JSON needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata can be from the other quantizer but this handles only PQ case

@@ -265,7 +266,7 @@ impl ShuffleReader for IvfShufflerReader {
Arc::new(schema),
reader.read_stream(
lance_io::ReadBatchParams::RangeFull,
4096,
u32::MAX,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this use too much memory? Can shuffle files be large?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah probably, but we will concat all batches later so it's the same

@BubbleCal BubbleCal merged commit 15420d5 into lancedb:main Mar 13, 2025
26 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants