-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add bit-width, cardinality and data-size to datablock statistics #2986
Conversation
Bitpack, | ||
Fsst, | ||
FixedSizeBinary, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can put this enum into a dedicated file encodings.rs
later, each registered encoding also need to supply the DataBlock
type they can work with, optionally, each registered encoding can also provide a cost function(to compute the compression ratio, and even more things),
for example, the compression ratio of bit-pack
can be computed directly using DataBlock statistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about making all encodings a trait object(inspired by datafusion's builtin function trait object)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2986 +/- ##
==========================================
+ Coverage 78.84% 78.88% +0.03%
==========================================
Files 236 237 +1
Lines 73552 74696 +1144
Branches 73552 74696 +1144
==========================================
+ Hits 57995 58922 +927
- Misses 12550 12751 +201
- Partials 3007 3023 +16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction! I have a few suggestions and we'll want to clean up the fmt/clippy warnings but then we can finish this up.
rust/lance-encoding/src/data.rs
Outdated
|
||
// count_nulls will be handled differently after V2.1 | ||
pub fn count_nulls(&mut self) -> u64 { | ||
let nulls_buf = &self.borrow_and_clone().into_buffers()[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you access self.nulls
directly (e.g. let nulls_buf = self.nulls.borrow_and_clone()
)? I don't think you need into_buffers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, the count_nulls
method here is vestige, I removed count_nulls
for now as NullableDataBlock
will be removed soon
rust/lance-encoding/src/data.rs
Outdated
// count_nulls will be handled differently after V2.1 | ||
pub fn count_nulls(&mut self) -> u64 { | ||
let nulls_buf = &self.borrow_and_clone().into_buffers()[0]; | ||
let boolean_buf = BooleanBuffer::new(nulls_buf.into(), 0, nulls_buf.len() * 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to use self.data.num_values()
.
nulls_buf.len() * 8
is not correct because the last byte may not be complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this method completely.
rust/lance-encoding/src/data.rs
Outdated
pub fn data_size(&self) -> u64 { | ||
match self { | ||
Self::AllNull(_) => 0, | ||
Self::Nullable(inner) => inner.data_size(), | ||
Self::FixedWidth(inner) => inner.data_size(), | ||
Self::FixedSizeList(inner) => inner.data_size(), | ||
Self::VariableWidth(inner) => inner.data_size(), | ||
// not implemented yet | ||
Self::Struct(_) => 0, | ||
// not implemented yet | ||
Self::Dictionary(_) => 0, | ||
Self::Opaque(inner) => inner.data_size(), // Handle OpaqueBlock case | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I like this method.
rust/lance-encoding/src/data.rs
Outdated
Self::FixedSizeList(inner) => inner.data_size(), | ||
Self::VariableWidth(inner) => inner.data_size(), | ||
// not implemented yet | ||
Self::Struct(_) => 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use todo!()
instead of 0
. That way we don't accidentally forget we haven't done this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
rust/lance-encoding/src/data.rs
Outdated
// not implemented yet | ||
Self::Struct(_) => 0, | ||
// not implemented yet | ||
Self::Dictionary(_) => 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, please use todo!()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
let total_nulls_size_in_bytes = (concatenated_array.nulls().unwrap().len() + 7) / 8; | ||
assert!(block.data_size() == (total_buffer_size + total_nulls_size_in_bytes) as u64); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a test for count_nulls
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed count_nulls
//! Data layouts to represent encoded data in a sub-Arrow format | ||
//! | ||
//! These [`DataBlock`] structures represent physical layouts. They fill a gap somewhere | ||
//! between [`arrow_data::data::ArrayData`] (which, as a collection of buffers, is too | ||
//! generic because it doesn't give us enough information about what those buffers represent) | ||
//! and [`arrow_array::array::Array`] (which is too specific, because it cares about the | ||
//! logical data type). | ||
//! | ||
//! In addition, the layouts represented here are slightly stricter than Arrow's layout rules. | ||
//! For example, offset buffers MUST start with 0. These additional restrictions impose a | ||
//! slight penalty on encode (to normalize arrow data) but make the development of encoders | ||
//! and decoders easier (since they can rely on a normalized representation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
None | ||
} | ||
Self::FixedWidth(data_block) => data_block.get_stat(stat), | ||
Self::FixedSizeList(data_block) => data_block.child.get_stat(stat), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cardinality and bit-width of a fixed-size-list will technically be different than the cardinality and bit-width of the child. Maybe just leave this as a todo!()
or None
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. put in a todo!()
// TODO: Decimal128 | ||
// when self.bits_per_value is not (8, 16, 32, 64), it is already bit-packed and `self.bits_per_value` | ||
// is it's max_bit_width(except Decimal128, Decimal256) | ||
_ => Arc::new(UInt64Array::from(vec![self.bits_per_value])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine. No rush to bit-pack anything that isn't 8/16/32/64 right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this needs a rebase but we can merge once that is done and CI passes
The statistics here is different from arrow array statistics.
One way to think about the concept here is that the
array statistics
arelogical statistics
anddatablock statistics
arephysical statistics
It is data type agnostic and it aims to facilitate encoding selection and to provide a centralized calculation of encoding parameter
#2981
#2980