-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: support hamming distance & binary vector #3198
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
pub fn nearest<T: ArrowPrimitiveType>( | ||
&mut self, | ||
column: &str, | ||
q: &PrimitiveArray<T>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a breaking change, code like this can't work since now:
.nearest(column, q.as_primitive(), k)
because the compiler can't infer the type
@@ -228,3 +228,81 @@ impl TryFrom<Quantizer> for FlatQuantizer { | |||
} | |||
} | |||
} | |||
|
|||
#[derive(Debug, Clone, DeepSizeOf)] | |||
pub struct FlatBinQuantizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to add this rather than reusing FlatQuantizer
even flat quantizer does nothing, because we determine the VectorStore
type by Quantization
@@ -647,6 +676,7 @@ mod tests { | |||
#[case(4, DistanceType::L2, 1.0)] | |||
#[case(4, DistanceType::Cosine, 1.0)] | |||
#[case(4, DistanceType::Dot, 1.0)] | |||
#[case(4, DistanceType::Hamming, 0.9)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even it's flat search, the recall for hamming can't always hit 100%, because there are many vectors that are at the same distance from the query vector
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3198 +/- ##
==========================================
- Coverage 78.55% 78.54% -0.01%
==========================================
Files 244 244
Lines 83213 84298 +1085
Branches 83213 84298 +1085
==========================================
+ Hits 65370 66216 +846
- Misses 15056 15286 +230
- Partials 2787 2796 +9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…ector Signed-off-by: BubbleCal <bubble-cal@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might still be good for @chebbyChefNEQ or @eddyxu to look at the indexing code but the rest looks ok to me.
Looks like we needed to duplicate a few structs to have a "Bin" version and a "Float" version?
// deprecated, use `try_from_batch` instead | ||
pub fn new(vectors: FixedSizeListArray, distance_type: DistanceType) -> Self { | ||
let row_ids = Arc::new(UInt64Array::from_iter_values(0..vectors.len() as u64)); | ||
let vectors = Arc::new(vectors); | ||
|
||
let batch = RecordBatch::try_from_iter_with_nullable(vec![ | ||
(ROW_ID, row_ids.clone() as ArrayRef, true), | ||
(FLAT_COLUMN, vectors.clone() as ArrayRef, true), | ||
]) | ||
.unwrap(); | ||
|
||
Self { | ||
batch, | ||
distance_type, | ||
row_ids, | ||
vectors, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid this deprecated method in Bin
path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, deleted
} else if dt.data_type().is_floating() { | ||
coerce_float_vector( | ||
q.as_any().downcast_ref::<Float32Array>().unwrap(), | ||
FloatType::try_from(dt.data_type())?, | ||
)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the error message isn't quite right here. If dt
is not a floating point type and q
is a vector, but does not exactly match dt
(e.g. UInt16
or something) then the error the user will get is Column {} is not a vector column
. However, it would be nicer to give them an error like Column {} has type UInt8 and the query vector is UInt16
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, fixed!
rust/lance/src/index/vector/utils.rs
Outdated
.try_collect::<Vec<_>>() | ||
.await?; | ||
concat_batches(&Arc::new(ArrowSchema::from(&projection)), &batches)? | ||
Box::pin(scanner.try_into_batch()).await? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be more thorough for try_into_batch
to return a boxed future instead of being async?
it's avoidable, by making |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@@ -21,7 +21,7 @@ exclude = ["python"] | |||
resolver = "2" | |||
|
|||
[workspace.package] | |||
version = "0.20.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR introduces a breaking change so bump the version
@@ -31,7 +31,7 @@ fn bench_hnsw(c: &mut Criterion) { | |||
|
|||
let data = generate_random_array_with_seed::<Float32Type>(TOTAL * DIMENSION, SEED); | |||
let fsl = FixedSizeListArray::try_new_from_values(data, DIMENSION as i32).unwrap(); | |||
let vectors = Arc::new(FlatStorage::new(fsl.clone(), DistanceType::L2)); | |||
let vectors = Arc::new(FlatFloatStorage::new(fsl.clone(), DistanceType::L2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, do we need FlatStorage to be typed? This is sad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... unfortunately now we have to do this.
there are 2 ways to avoid this in my mind but they would introduce some new issues as well:
- storing
ArrayRef
in dist calculator then cast it to&PrimitiveArray<T>
according distance type / column type. This is what we did before, it's slow - just storing everything in bytes ref
&[u8]
then do unsafe cast
No description provided.