Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add bit-width, cardinality and data-size to datablock statistics #2986

Merged
merged 11 commits into from
Oct 14, 2024

Conversation

broccoliSpicy
Copy link
Contributor

@broccoliSpicy broccoliSpicy commented Oct 7, 2024

The statistics here is different from arrow array statistics.

One way to think about the concept here is that the array statistics are logical statistics and datablock statistics are physical statistics

It is data type agnostic and it aims to facilitate encoding selection and to provide a centralized calculation of encoding parameter

#2981
#2980

@github-actions github-actions bot added the enhancement New feature or request label Oct 7, 2024
@broccoliSpicy broccoliSpicy changed the title feat: datablock statistics(sketch, draft) feat: add NullCount and DataSize to datablock statistics Oct 9, 2024
Bitpack,
Fsst,
FixedSizeBinary,
}
Copy link
Contributor Author

@broccoliSpicy broccoliSpicy Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can put this enum into a dedicated file encodings.rs later, each registered encoding also need to supply the DataBlock type they can work with, optionally, each registered encoding can also provide a cost function(to compute the compression ratio, and even more things),
for example, the compression ratio of bit-pack can be computed directly using DataBlock statistics

Copy link
Contributor Author

@broccoliSpicy broccoliSpicy Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about making all encodings a trait object(inspired by datafusion's builtin function trait object)

@codecov-commenter
Copy link

codecov-commenter commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 87.36059% with 136 lines in your changes missing coverage. Please review.

Project coverage is 78.88%. Comparing base (73ab2b5) to head (19009b2).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-encoding/src/statistics.rs 89.03% 94 Missing and 3 partials ⚠️
rust/lance-encoding/src/data.rs 80.59% 26 Missing ⚠️
rust/lance-encoding/src/encodings/physical/fsst.rs 0.00% 8 Missing ⚠️
...t/lance-encoding/src/encodings/physical/bitpack.rs 50.00% 2 Missing ⚠️
...-encoding/src/encodings/physical/block_compress.rs 0.00% 2 Missing ⚠️
...ust/lance-encoding/src/encodings/logical/binary.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2986      +/-   ##
==========================================
+ Coverage   78.84%   78.88%   +0.03%     
==========================================
  Files         236      237       +1     
  Lines       73552    74696    +1144     
  Branches    73552    74696    +1144     
==========================================
+ Hits        57995    58922     +927     
- Misses      12550    12751     +201     
- Partials     3007     3023      +16     
Flag Coverage Δ
unittests 78.88% <87.36%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@broccoliSpicy broccoliSpicy changed the title feat: add NullCount and DataSize to datablock statistics feat: add bit-width and data-size to datablock statistics Oct 10, 2024
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction! I have a few suggestions and we'll want to clean up the fmt/clippy warnings but then we can finish this up.


// count_nulls will be handled differently after V2.1
pub fn count_nulls(&mut self) -> u64 {
let nulls_buf = &self.borrow_and_clone().into_buffers()[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you access self.nulls directly (e.g. let nulls_buf = self.nulls.borrow_and_clone())? I don't think you need into_buffers?

Copy link
Contributor Author

@broccoliSpicy broccoliSpicy Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, the count_nulls method here is vestige, I removed count_nulls for now as NullableDataBlock will be removed soon

// count_nulls will be handled differently after V2.1
pub fn count_nulls(&mut self) -> u64 {
let nulls_buf = &self.borrow_and_clone().into_buffers()[0];
let boolean_buf = BooleanBuffer::new(nulls_buf.into(), 0, nulls_buf.len() * 8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to use self.data.num_values().

nulls_buf.len() * 8 is not correct because the last byte may not be complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this method completely.

Comment on lines 704 to 717
pub fn data_size(&self) -> u64 {
match self {
Self::AllNull(_) => 0,
Self::Nullable(inner) => inner.data_size(),
Self::FixedWidth(inner) => inner.data_size(),
Self::FixedSizeList(inner) => inner.data_size(),
Self::VariableWidth(inner) => inner.data_size(),
// not implemented yet
Self::Struct(_) => 0,
// not implemented yet
Self::Dictionary(_) => 0,
Self::Opaque(inner) => inner.data_size(), // Handle OpaqueBlock case
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I like this method.

Self::FixedSizeList(inner) => inner.data_size(),
Self::VariableWidth(inner) => inner.data_size(),
// not implemented yet
Self::Struct(_) => 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use todo!() instead of 0. That way we don't accidentally forget we haven't done this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

// not implemented yet
Self::Struct(_) => 0,
// not implemented yet
Self::Dictionary(_) => 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, please use todo!()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


let total_nulls_size_in_bytes = (concatenated_array.nulls().unwrap().len() + 7) / 8;
assert!(block.data_size() == (total_buffer_size + total_nulls_size_in_bytes) as u64);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test for count_nulls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed count_nulls

Comment on lines 4 to 15
//! Data layouts to represent encoded data in a sub-Arrow format
//!
//! These [`DataBlock`] structures represent physical layouts. They fill a gap somewhere
//! between [`arrow_data::data::ArrayData`] (which, as a collection of buffers, is too
//! generic because it doesn't give us enough information about what those buffers represent)
//! and [`arrow_array::array::Array`] (which is too specific, because it cares about the
//! logical data type).
//!
//! In addition, the layouts represented here are slightly stricter than Arrow's layout rules.
//! For example, offset buffers MUST start with 0. These additional restrictions impose a
//! slight penalty on encode (to normalize arrow data) but make the development of encoders
//! and decoders easier (since they can rely on a normalized representation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

None
}
Self::FixedWidth(data_block) => data_block.get_stat(stat),
Self::FixedSizeList(data_block) => data_block.child.get_stat(stat),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cardinality and bit-width of a fixed-size-list will technically be different than the cardinality and bit-width of the child. Maybe just leave this as a todo!() or None for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. put in a todo!()

// TODO: Decimal128
// when self.bits_per_value is not (8, 16, 32, 64), it is already bit-packed and `self.bits_per_value`
// is it's max_bit_width(except Decimal128, Decimal256)
_ => Arc::new(UInt64Array::from(vec![self.bits_per_value])),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. No rush to bit-pack anything that isn't 8/16/32/64 right now.

@broccoliSpicy broccoliSpicy changed the title feat: add bit-width and data-size to datablock statistics feat: add bit-width, cardinality and data-size to datablock statistics Oct 11, 2024
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this needs a rebase but we can merge once that is done and CI passes

@broccoliSpicy broccoliSpicy merged commit 8f95fbe into lancedb:main Oct 14, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants