Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: support hamming distance & binary vector #3198

Merged
merged 15 commits into from
Dec 7, 2024

Conversation

BubbleCal
Copy link
Contributor

No description provided.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions github-actions bot added the enhancement New feature or request label Dec 4, 2024
pub fn nearest<T: ArrowPrimitiveType>(
&mut self,
column: &str,
q: &PrimitiveArray<T>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a breaking change, code like this can't work since now:

.nearest(column, q.as_primitive(), k)

because the compiler can't infer the type

@@ -228,3 +228,81 @@ impl TryFrom<Quantizer> for FlatQuantizer {
}
}
}

#[derive(Debug, Clone, DeepSizeOf)]
pub struct FlatBinQuantizer {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to add this rather than reusing FlatQuantizer even flat quantizer does nothing, because we determine the VectorStore type by Quantization

@@ -647,6 +676,7 @@ mod tests {
#[case(4, DistanceType::L2, 1.0)]
#[case(4, DistanceType::Cosine, 1.0)]
#[case(4, DistanceType::Dot, 1.0)]
#[case(4, DistanceType::Hamming, 0.9)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even it's flat search, the recall for hamming can't always hit 100%, because there are many vectors that are at the same distance from the query vector

@BubbleCal BubbleCal changed the title feat: support hamming distance & binary vector feat!: support hamming distance & binary vector Dec 4, 2024
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions github-actions bot added the python label Dec 5, 2024
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@codecov-commenter
Copy link

codecov-commenter commented Dec 5, 2024

Codecov Report

Attention: Patch coverage is 67.40947% with 117 lines in your changes missing coverage. Please review.

Project coverage is 78.54%. Comparing base (6c7b9fd) to head (bd27747).
Report is 132 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-index/src/vector/flat/storage.rs 56.12% 41 Missing and 2 partials ⚠️
rust/lance-index/src/vector/flat/index.rs 53.19% 21 Missing and 1 partial ⚠️
rust/lance-linalg/src/distance/hamming.rs 36.00% 16 Missing ⚠️
rust/lance-linalg/src/kmeans.rs 82.66% 13 Missing ⚠️
rust/lance/src/index.rs 53.33% 6 Missing and 1 partial ⚠️
rust/lance-index/src/vector/quantizer.rs 0.00% 5 Missing ⚠️
rust/lance/src/dataset/scanner.rs 72.22% 4 Missing and 1 partial ⚠️
rust/lance/src/index/vector/ivf.rs 80.00% 2 Missing ⚠️
rust/lance/src/index/vector/utils.rs 0.00% 0 Missing and 2 partials ⚠️
rust/lance-index/src/vector/hnsw/builder.rs 50.00% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3198      +/-   ##
==========================================
- Coverage   78.55%   78.54%   -0.01%     
==========================================
  Files         244      244              
  Lines       83213    84298    +1085     
  Branches    83213    84298    +1085     
==========================================
+ Hits        65370    66216     +846     
- Misses      15056    15286     +230     
- Partials     2787     2796       +9     
Flag Coverage Δ
unittests 78.54% <67.40%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal marked this pull request as ready for review December 5, 2024 12:18
…ector

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might still be good for @chebbyChefNEQ or @eddyxu to look at the indexing code but the rest looks ok to me.

Looks like we needed to duplicate a few structs to have a "Bin" version and a "Float" version?

Comment on lines 213 to 230
// deprecated, use `try_from_batch` instead
pub fn new(vectors: FixedSizeListArray, distance_type: DistanceType) -> Self {
let row_ids = Arc::new(UInt64Array::from_iter_values(0..vectors.len() as u64));
let vectors = Arc::new(vectors);

let batch = RecordBatch::try_from_iter_with_nullable(vec![
(ROW_ID, row_ids.clone() as ArrayRef, true),
(FLAT_COLUMN, vectors.clone() as ArrayRef, true),
])
.unwrap();

Self {
batch,
distance_type,
row_ids,
vectors,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this deprecated method in Bin path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, deleted

Comment on lines +673 to +677
} else if dt.data_type().is_floating() {
coerce_float_vector(
q.as_any().downcast_ref::<Float32Array>().unwrap(),
FloatType::try_from(dt.data_type())?,
)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error message isn't quite right here. If dt is not a floating point type and q is a vector, but does not exactly match dt (e.g. UInt16 or something) then the error the user will get is Column {} is not a vector column. However, it would be nicer to give them an error like Column {} has type UInt8 and the query vector is UInt16.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, fixed!

.try_collect::<Vec<_>>()
.await?;
concat_batches(&Arc::new(ArrowSchema::from(&projection)), &batches)?
Box::pin(scanner.try_into_batch()).await?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be more thorough for try_into_batch to return a boxed future instead of being async?

@BubbleCal
Copy link
Contributor Author

BubbleCal commented Dec 6, 2024

Might still be good for @chebbyChefNEQ or @eddyxu to look at the indexing code but the rest looks ok to me.

Looks like we needed to duplicate a few structs to have a "Bin" version and a "Float" version?

Quantization determines the VectorStore, VectorStore determines the DistCalculator. For best performance we are storing the vectors ref / query as reference to slice of primitive type (f32/u8) in DistCalculator, so yes now we have to dup the structs to make the type system happy.

it's avoidable, by making DistCalculator store ArrayRef, but will make search slower

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@@ -21,7 +21,7 @@ exclude = ["python"]
resolver = "2"

[workspace.package]
version = "0.20.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR introduces a breaking change so bump the version

@@ -31,7 +31,7 @@ fn bench_hnsw(c: &mut Criterion) {

let data = generate_random_array_with_seed::<Float32Type>(TOTAL * DIMENSION, SEED);
let fsl = FixedSizeListArray::try_new_from_values(data, DIMENSION as i32).unwrap();
let vectors = Arc::new(FlatStorage::new(fsl.clone(), DistanceType::L2));
let vectors = Arc::new(FlatFloatStorage::new(fsl.clone(), DistanceType::L2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, do we need FlatStorage to be typed? This is sad

Copy link
Contributor Author

@BubbleCal BubbleCal Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... unfortunately now we have to do this.
there are 2 ways to avoid this in my mind but they would introduce some new issues as well:

  1. storing ArrayRef in dist calculator then cast it to &PrimitiveArray<T> according distance type / column type. This is what we did before, it's slow
  2. just storing everything in bytes ref &[u8] then do unsafe cast

@BubbleCal BubbleCal requested a review from eddyxu December 7, 2024 06:41
@BubbleCal BubbleCal merged commit 4444c60 into lancedb:main Dec 7, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants