feat: add bit-width, cardinality and data-size to datablock statistics #2986

broccoliSpicy · 2024-10-07T13:32:27Z

The statistics here is different from arrow array statistics.

One way to think about the concept here is that the array statistics are logical statistics and datablock statistics are physical statistics

It is data type agnostic and it aims to facilitate encoding selection and to provide a centralized calculation of encoding parameter

#2981
#2980

broccoliSpicy · 2024-10-09T21:18:23Z

rust/lance-encoding/src/data.rs

+    Bitpack,
+    Fsst,
+    FixedSizeBinary,
+}


we can put this enum into a dedicated file encodings.rs later, each registered encoding also need to supply the DataBlock type they can work with, optionally, each registered encoding can also provide a cost function(to compute the compression ratio, and even more things),
for example, the compression ratio of bit-pack can be computed directly using DataBlock statistics

I am thinking about making all encodings a trait object(inspired by datafusion's builtin function trait object)

codecov-commenter · 2024-10-09T21:20:49Z

Codecov Report

Attention: Patch coverage is 87.36059% with 136 lines in your changes missing coverage. Please review.

Project coverage is 78.88%. Comparing base (73ab2b5) to head (19009b2).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance-encoding/src/statistics.rs	89.03%	94 Missing and 3 partials ⚠️
rust/lance-encoding/src/data.rs	80.59%	26 Missing ⚠️
rust/lance-encoding/src/encodings/physical/fsst.rs	0.00%	8 Missing ⚠️
...t/lance-encoding/src/encodings/physical/bitpack.rs	50.00%	2 Missing ⚠️
...-encoding/src/encodings/physical/block_compress.rs	0.00%	2 Missing ⚠️
...ust/lance-encoding/src/encodings/logical/binary.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2986      +/-   ##
==========================================
+ Coverage   78.84%   78.88%   +0.03%     
==========================================
  Files         236      237       +1     
  Lines       73552    74696    +1144     
  Branches    73552    74696    +1144     
==========================================
+ Hits        57995    58922     +927     
- Misses      12550    12751     +201     
- Partials     3007     3023      +16

Flag	Coverage Δ
unittests	`78.88% <87.36%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

westonpace

I like the direction! I have a few suggestions and we'll want to clean up the fmt/clippy warnings but then we can finish this up.

westonpace · 2024-10-11T12:25:48Z

rust/lance-encoding/src/data.rs

+
+    // count_nulls will be handled differently after V2.1
+    pub fn count_nulls(&mut self) -> u64 {
+        let nulls_buf = &self.borrow_and_clone().into_buffers()[0];


Can you access self.nulls directly (e.g. let nulls_buf = self.nulls.borrow_and_clone())? I don't think you need into_buffers?

sorry, the count_nulls method here is vestige, I removed count_nulls for now as NullableDataBlock will be removed soon

westonpace · 2024-10-11T12:27:30Z

rust/lance-encoding/src/data.rs

+    // count_nulls will be handled differently after V2.1
+    pub fn count_nulls(&mut self) -> u64 {
+        let nulls_buf = &self.borrow_and_clone().into_buffers()[0];
+        let boolean_buf = BooleanBuffer::new(nulls_buf.into(), 0, nulls_buf.len() * 8);


You'll need to use self.data.num_values().

nulls_buf.len() * 8 is not correct because the last byte may not be complete.

removed this method completely.

westonpace · 2024-10-11T12:29:12Z

rust/lance-encoding/src/data.rs

+    pub fn data_size(&self) -> u64 {
+        match self {
+            Self::AllNull(_) => 0,
+            Self::Nullable(inner) => inner.data_size(),
+            Self::FixedWidth(inner) => inner.data_size(),
+            Self::FixedSizeList(inner) => inner.data_size(),
+            Self::VariableWidth(inner) => inner.data_size(),
+            // not implemented yet
+            Self::Struct(_) => 0,
+            // not implemented yet
+            Self::Dictionary(_) => 0,
+            Self::Opaque(inner) => inner.data_size(), // Handle OpaqueBlock case
+        }
+    }


Nice. I like this method.

westonpace · 2024-10-11T12:29:35Z

rust/lance-encoding/src/data.rs

+            Self::FixedSizeList(inner) => inner.data_size(),
+            Self::VariableWidth(inner) => inner.data_size(),
+            // not implemented yet
+            Self::Struct(_) => 0,


Please use todo!() instead of 0. That way we don't accidentally forget we haven't done this.

westonpace · 2024-10-11T12:29:44Z

rust/lance-encoding/src/data.rs

+            // not implemented yet
+            Self::Struct(_) => 0,
+            // not implemented yet
+            Self::Dictionary(_) => 0,


Ditto, please use todo!()

westonpace · 2024-10-11T12:30:51Z

rust/lance-encoding/src/data.rs

+
+        let total_nulls_size_in_bytes = (concatenated_array.nulls().unwrap().len() + 7) / 8;
+        assert!(block.data_size() == (total_buffer_size + total_nulls_size_in_bytes) as u64);
+    }


Let's add a test for count_nulls?

removed count_nulls

westonpace · 2024-10-11T12:32:07Z

rust/lance-encoding/src/statistics.rs

+//! Data layouts to represent encoded data in a sub-Arrow format
+//!
+//! These [`DataBlock`] structures represent physical layouts.  They fill a gap somewhere
+//! between [`arrow_data::data::ArrayData`] (which, as a collection of buffers, is too
+//! generic because it doesn't give us enough information about what those buffers represent)
+//! and [`arrow_array::array::Array`] (which is too specific, because it cares about the
+//! logical data type).
+//!
+//! In addition, the layouts represented here are slightly stricter than Arrow's layout rules.
+//! For example, offset buffers MUST start with 0.  These additional restrictions impose a
+//! slight penalty on encode (to normalize arrow data) but make the development of encoders
+//! and decoders easier (since they can rely on a normalized representation)


Replace this comment?

westonpace · 2024-10-11T12:34:29Z

rust/lance-encoding/src/statistics.rs

+                None
+            }
+            Self::FixedWidth(data_block) => data_block.get_stat(stat),
+            Self::FixedSizeList(data_block) => data_block.child.get_stat(stat),


Cardinality and bit-width of a fixed-size-list will technically be different than the cardinality and bit-width of the child. Maybe just leave this as a todo!() or None for now.

fixed. put in a todo!()

westonpace · 2024-10-11T12:38:07Z

rust/lance-encoding/src/statistics.rs

+            // TODO: Decimal128
+            // when self.bits_per_value is not (8, 16, 32, 64), it is already bit-packed and `self.bits_per_value`
+            // is it's max_bit_width(except Decimal128, Decimal256)
+            _ => Arc::new(UInt64Array::from(vec![self.bits_per_value])),


This is fine. No rush to bit-pack anything that isn't 8/16/32/64 right now.

westonpace

Looks like this needs a rebase but we can merge once that is done and CI passes

datablock statistics

0d669f4

broccoliSpicy requested a review from westonpace October 7, 2024 13:32

github-actions bot added the enhancement New feature or request label Oct 7, 2024

add NullCount and DataSize statistics to DataBlock

2683106

broccoliSpicy changed the title ~~feat: datablock statistics(sketch, draft)~~ feat: add NullCount and DataSize to datablock statistics Oct 9, 2024

add documentation to the Encoding enum

71667ff

broccoliSpicy commented Oct 9, 2024

View reviewed changes

broccoliSpicy added 2 commits October 10, 2024 17:03

delete statistics for NullableBlock

1af7ec6

add bitwidth statistics

3ad99c2

broccoliSpicy changed the title ~~feat: add NullCount and DataSize to datablock statistics~~ feat: add bit-width and data-size to datablock statistics Oct 10, 2024

westonpace requested changes Oct 11, 2024

View reviewed changes

broccoliSpicy added 2 commits October 11, 2024 15:06

address PR comments

7c32ba2

add cardinality Stat

2e29b41

broccoliSpicy changed the title ~~feat: add bit-width and data-size to datablock statistics~~ feat: add bit-width, cardinality and data-size to datablock statistics Oct 11, 2024

delete null count test, delete test for fixed-size-list

19009b2

westonpace approved these changes Oct 13, 2024

View reviewed changes

broccoliSpicy added 3 commits October 14, 2024 13:56

Merge branch 'main' of https://github.com/lancedb/lance into statistics

170cbfe

use tuple struct in BlockInfo and UsedEncoding

78d5641

fix lint

607d979

broccoliSpicy merged commit 8f95fbe into lancedb:main Oct 14, 2024
22 checks passed

broccoliSpicy mentioned this pull request Oct 21, 2024

feat: apply general compression for string field's offsets #3019

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add bit-width, cardinality and data-size to datablock statistics #2986

feat: add bit-width, cardinality and data-size to datablock statistics #2986

broccoliSpicy commented Oct 7, 2024 •

edited

Loading

broccoliSpicy Oct 9, 2024 •

edited

Loading

broccoliSpicy Oct 10, 2024 •

edited

Loading

codecov-commenter commented Oct 9, 2024 •

edited

Loading

westonpace left a comment

westonpace Oct 11, 2024

broccoliSpicy Oct 11, 2024 •

edited

Loading

westonpace Oct 11, 2024

broccoliSpicy Oct 11, 2024

westonpace Oct 11, 2024

westonpace Oct 11, 2024

broccoliSpicy Oct 11, 2024

westonpace Oct 11, 2024

broccoliSpicy Oct 11, 2024

westonpace Oct 11, 2024

broccoliSpicy Oct 11, 2024

westonpace Oct 11, 2024

broccoliSpicy Oct 11, 2024

westonpace Oct 11, 2024

broccoliSpicy Oct 11, 2024

westonpace Oct 11, 2024

westonpace left a comment

feat: add bit-width, cardinality and data-size to datablock statistics #2986

feat: add bit-width, cardinality and data-size to datablock statistics #2986

Conversation

broccoliSpicy commented Oct 7, 2024 • edited Loading

broccoliSpicy Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

broccoliSpicy Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Oct 9, 2024 • edited Loading

Codecov Report

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

broccoliSpicy Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

broccoliSpicy commented Oct 7, 2024 •

edited

Loading

broccoliSpicy Oct 9, 2024 •

edited

Loading

broccoliSpicy Oct 10, 2024 •

edited

Loading

codecov-commenter commented Oct 9, 2024 •

edited

Loading

broccoliSpicy Oct 11, 2024 •

edited

Loading