-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: improve v3 indexing perf #3525
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…indexing-perf Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@@ -140,64 +139,37 @@ pub trait VectorStore: Send + Sync + Sized + Clone { | |||
fn dist_calculator(&self, query: ArrayRef) -> Self::DistanceCalculator<'_>; | |||
|
|||
fn dist_calculator_from_id(&self, id: u32) -> Self::DistanceCalculator<'_>; | |||
|
|||
fn distance_between(&self, a: u32, b: u32) -> f32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method can't be implemented efficiently for PQ, so remove it, it's always the same as dist_calculator_from_id(a).distance(b)
fn distance_between(&self, a: u32, b: u32) -> f32; | ||
|
||
fn dist_calculator_from_native(&self, _query: ArrayRef) -> Self::DistanceCalculator<'_> { | ||
todo!("Implement this") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think any type implements this for now, so remove it
&& ProductQuantizer::use_residual(distance_type) | ||
&& ivf.centroids.is_some() | ||
{ | ||
transformers.push(Arc::new(ResidualTransform::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved the residual transformer to be in IVFTransformer, which happens before shuffling the data into partitions, so that we don't need to write/read the original vectors, to save IO time/space
store: &ObjectStore, | ||
_dataset: &Arc<Dataset>, | ||
_column: &str, | ||
_store: &ObjectStore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will probably need these params in the future PR soon, so keep them for now
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3525 +/- ##
==========================================
+ Coverage 78.64% 78.72% +0.07%
==========================================
Files 254 255 +1
Lines 94952 95163 +211
Branches 94952 95163 +211
==========================================
+ Hits 74674 74916 +242
+ Misses 17251 17211 -40
- Partials 3027 3036 +9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions but looks good to me. Overall I think the result is simpler so this is nice!
|
||
#[derive(Debug)] | ||
pub struct FlatTransformer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[derive(Debug)] | |
pub struct FlatTransformer { | |
/// Renames the input column to FLAT_COLUMN | |
#[derive(Debug)] | |
pub struct FlatTransformer { |
Does this transformer do anything else? Or is it just renaming the column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just renames the column
pub fn new(distance_type: DistanceType, quantizer: Q) -> Result<Self> { | ||
let transformers = if matches!(Q::quantization_type(), QuantizationType::Product) { | ||
let metadata = quantizer.metadata(None)?; | ||
vec![Arc::new(TransposeTransformer::new(metadata.to_string())?) as _] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the roundtrip through JSON needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata can be from the other quantizer but this handles only PQ case
@@ -265,7 +266,7 @@ impl ShuffleReader for IvfShufflerReader { | |||
Arc::new(schema), | |||
reader.read_stream( | |||
lance_io::ReadBatchParams::RangeFull, | |||
4096, | |||
u32::MAX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this use too much memory? Can shuffle files be large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah probably, but we will concat all batches later so it's the same
IVF_PQ is 3x faster than before on sift1M