-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add dictionary encoding #3134
feat: add dictionary encoding #3134
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3134 +/- ##
==========================================
- Coverage 77.95% 77.69% -0.26%
==========================================
Files 242 243 +1
Lines 81904 82206 +302
Branches 81904 82206 +302
==========================================
+ Hits 63848 63874 +26
- Misses 14890 15152 +262
- Partials 3166 3180 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
after make append method in DataBlockBuilderImpl use immutable borrow:
|
First 100 rows: first 100 rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done. A few initial questions, mostly around alignment, but I like the direction
rust/lance-encoding/src/data.rs
Outdated
assert!(block.bits_per_offset == 32); | ||
|
||
let offsets = block.offsets.borrow_to_typed_slice::<u32>(); | ||
let offsets = offsets.as_ref(); | ||
let offsets: &[u32] = cast_slice(&block.offsets); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use try_cast_slice
. This cast should always be safe but this isn't in a critical section and probably worth it just to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! fixed.
rust/lance-encoding/src/encoder.rs
Outdated
let encoding = ProtobufUtils::binary_block(); | ||
Ok((encoder, encoding)) | ||
} else { | ||
todo!("Implement BlockCompression for VariableWidth DataBlock with offsets type u32") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo!("Implement BlockCompression for VariableWidth DataBlock with offsets type u32") | |
todo!("Implement BlockCompression for VariableWidth DataBlock with 64 bit offsets") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! fixed
let offsets = offsets.as_ref(); | ||
// the first 4 bytes store the number of values, then 4 bytes for bytes_start_offset, | ||
// then offsets data, then bytes data. | ||
let bytes_start_offset = std::mem::size_of_val(offsets) as u32 + 4 + 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're using u32
directly above it seems you may as well do std::mem::size_of::<u32>()
instead of size_of_val
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using the size_of_val
to get the whole length in bytes of &[u32] here,
std::mem::size_of::<u32>()
can only get me the size of a element.
let output_total_bytes = | ||
((bytes_start_offset as usize + variable_width_data.data.len()) + 3) / 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why the +3 / 4
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is a mistake, thanks!
output.extend_from_slice( | ||
&BLOCK_PAD_BUFFER[..pad_bytes::<BINARY_BLOCK_ALIGNMENT>(output.len())], | ||
); | ||
Ok(LanceBuffer::reinterpret_vec(output)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since output
is already u8
you can just do LanceBuffer::Owned(output)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! fixed.
// pad this chunk to make it align to 4 bytes. | ||
output.extend_from_slice( | ||
&BLOCK_PAD_BUFFER[..pad_bytes::<BINARY_BLOCK_ALIGNMENT>(output.len())], | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think padding should be a concern of the code using the block compressor and not a concern of the block compressor itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed.
let offsets = header[2..2 + num_values as usize + 1].to_vec(); | ||
|
||
Ok(DataBlock::VariableWidth(VariableWidthBlock { | ||
data: LanceBuffer::Owned( | ||
data[bytes_start_offset | ||
..bytes_start_offset + offsets[num_values as usize] as usize] | ||
.to_vec(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to avoid both of these to_vec
copies by using LanceBuffer::slice_with_length
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@@ -289,6 +289,7 @@ impl ValueEncoder { | |||
} | |||
} | |||
|
|||
// fn compress(&self, data: DataBlock) -> Result<(VariableWidthBlock, pb::ArrayEncoding)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// fn compress(&self, data: DataBlock) -> Result<(VariableWidthBlock, pb::ArrayEncoding)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
// These come from the protobuf | ||
dictionary_decompressor: Arc<dyn BlockDecompressor>, | ||
dictionary_buf_position_and_size: (u64, u64), | ||
dictionary_data_alignment: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this alignment for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this dictionary_data_alignment
is used later for converting the raw bytes::Bytes fetched from disk IO to LanceBuffer
LanceBuffer::from_bytes(
dictionary_data,
dictionary.dictionary_data_alignment,
),
@@ -2189,6 +2316,114 @@ impl PrimitiveStructuralEncoder { | |||
}) | |||
} | |||
|
|||
fn dicitionary_encoding(mut data_block: DataBlock, cardinality: u64) -> (DataBlock, DataBlock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe name this dictionary_encode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, yeah, good point, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This PR tries to support dictionary encoding by integrating it with
MiniBlock PageLayout
.The general approach here is:
In a
MiniBlock PageLayout
, there is a optionaldictionary field
that stores a dictionary encoding if thisminiblock
has a dictionary.The rational for this is that if we dictionary encoding something, it's indices will definitely fall into a
MiniBlockLayout
.By doing this, we don't need to have a specific
DictionaryEncoding
, it can be anyArrayEncoding
.The
Dictionary
and theindices
are cascaded into another encoding automatically.Currently, the dictionary is stored inside the page along with
chunk meta data
andchunk data
, this is not ideal and is aTODO
task.This is a draft for discussion with the above idea so I only supported
FixedWidthDataBlock
with this encoding, the effort to add support forVariableWidthData
is trivial.#3123