-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add the basic encode path for 2.1 #3002
feat: add the basic encode path for 2.1 #3002
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3002 +/- ##
==========================================
- Coverage 78.97% 78.24% -0.73%
==========================================
Files 238 238
Lines 75606 76234 +628
Branches 75606 76234 +628
==========================================
- Hits 59707 59647 -60
- Misses 12870 13553 +683
- Partials 3029 3034 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Some minor suggestions
python/src/file.rs
Outdated
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks. Fixed.
rust/lance-encoding/src/decoder.rs
Outdated
@@ -248,15 +248,51 @@ use crate::{BufferScheduler, EncodingsIo}; | |||
// If users are getting batches over 10MiB large then it's time to reduce the batch size | |||
const BATCH_SIZE_BYTES_WARNING: u64 = 10 * 1024 * 1024; | |||
|
|||
/// Top-levle encoding message for a page. Wraps both the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Top-levle encoding message for a page. Wraps both the | |
/// Top-level encoding message for a page. Wraps both the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
DataType::List(_child) | DataType::LargeList(_child) => { | ||
todo!() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be in a follow up PR? Or isn't hit by this code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up. I haven't done lists / repetition levels completely yet.
bits_per_value: 16, | ||
num_values, | ||
}); | ||
let levels_field = Field::new_arrow("", DataType::UInt16, false)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth pulling this out into a static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that in the read PR
.map(|child| child.finish(external_buffers)) | ||
.collect::<FuturesOrdered<_>>(); | ||
async move { | ||
let mut encoded_columns = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut encoded_columns = Vec::new(); | |
let mut encoded_columns = Vec::with_capacity(child_columns.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
rust/lance-file/src/v2/writer.rs
Outdated
return Err(Error::InvalidInput { source: format!("cannot write batch with {} rows because {} rows have already been written and Lance files cannot contain more than 2^32 rows", num_rows, self.rows_written).into(), location: location!() }); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the limit u64::MAX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is. Updated the message.
rust/lance-encoding/src/encoder.rs
Outdated
/// Accessing this data will require 2 IOPS and accessing in a random-access fashion will require | ||
/// a repetition index. | ||
pub trait VariablePerValueCompressor: std::fmt::Debug + Send + Sync { | ||
/// Compress the data into a single buffer where each value is encoded with the same number of bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Compress the data into a single buffer where each value is encoded with the same number of bits | |
/// Compress the data into a single buffer where each value is encoded with different number of bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated the comment.
// For example, 1 would mean there are 2 values in the chunk and 12 would mean there | ||
// are 4Ki values in the chunk. | ||
// | ||
// This must be <= 12 (i.e. <= 4096 values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure this will be a problem here:
in compression algorithms like fastlanes bitpacking
, when the input is less than 1024 values, the compression algorithm itself will pad the input size to 1024 values. the padded values are needed to write to disk and needed for decoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. Compressors are free to pad as much as they want. The last block is allowed to have a non-power-of-two and, if it's less than 1024 values and fastlanes needs to pad then as long as it marks num_bytes
correctly (to include the padding)
} | ||
} | ||
|
||
fn encode_miniblock( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may need to adjust the function signature here later when we want to do recursively encoding(arrays
-> datablock
, Result<EncodedPage>
->Result<DataBlock>
) but I think it is fine now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only MiniBlockCompressor::compress
will need recursion, not encode_miniblock
but I do agree we still need to figure out that recursion still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent work!
Adds a mini-block encoder Adds structural encoder for struct and primitive Adds compressor impl's for value compression
…2 bytes rep/def. Clean up comments
8dd5819
to
2c4966f
Compare
Going to merge on green so I can get the read path PR up (and this code isn't used yet anyways) |
Adds a mini-block encoder
Adds "structural encoder" (2.1 concept) for struct and primitive
Adds compressor impl's for value compression