Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add the basic encode path for 2.1 #3002

Merged
merged 3 commits into from
Oct 15, 2024

Conversation

westonpace
Copy link
Contributor

Adds a mini-block encoder
Adds "structural encoder" (2.1 concept) for struct and primitive
Adds compressor impl's for value compression

@github-actions github-actions bot added enhancement New feature or request python labels Oct 13, 2024
@codecov-commenter
Copy link

codecov-commenter commented Oct 13, 2024

Codecov Report

Attention: Patch coverage is 19.35953% with 554 lines in your changes missing coverage. Please review.

Project coverage is 78.24%. Comparing base (6473f26) to head (2c4966f).

Files with missing lines Patch % Lines
.../lance-encoding/src/encodings/logical/primitive.rs 12.86% 235 Missing and 2 partials ⚠️
rust/lance-encoding/src/encoder.rs 4.27% 111 Missing and 1 partial ⚠️
...ust/lance-encoding/src/encodings/physical/value.rs 0.00% 95 Missing ⚠️
...ust/lance-encoding/src/encodings/logical/struct.rs 17.64% 56 Missing ⚠️
rust/lance-encoding/src/format.rs 0.00% 18 Missing ⚠️
rust/lance-encoding/src/decoder.rs 46.87% 17 Missing ⚠️
rust/lance-file/src/v2/reader.rs 77.77% 3 Missing and 3 partials ⚠️
rust/lance-file/src/v2/writer.rs 71.42% 4 Missing and 2 partials ⚠️
rust/lance-core/src/datatypes/field.rs 0.00% 3 Missing ⚠️
rust/lance-encoding/src/encodings/logical/list.rs 81.25% 2 Missing and 1 partial ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3002      +/-   ##
==========================================
- Coverage   78.97%   78.24%   -0.73%     
==========================================
  Files         238      238              
  Lines       75606    76234     +628     
  Branches    75606    76234     +628     
==========================================
- Hits        59707    59647      -60     
- Misses      12870    13553     +683     
- Partials     3029     3034       +5     
Flag Coverage Δ
unittests 78.24% <19.35%> (-0.73%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Some minor suggestions

)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accident?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks. Fixed.

@@ -248,15 +248,51 @@ use crate::{BufferScheduler, EncodingsIo};
// If users are getting batches over 10MiB large then it's time to reduce the batch size
const BATCH_SIZE_BYTES_WARNING: u64 = 10 * 1024 * 1024;

/// Top-levle encoding message for a page. Wraps both the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Top-levle encoding message for a page. Wraps both the
/// Top-level encoding message for a page. Wraps both the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

Comment on lines +1117 to +1119
DataType::List(_child) | DataType::LargeList(_child) => {
todo!()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be in a follow up PR? Or isn't hit by this code path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up. I haven't done lists / repetition levels completely yet.

bits_per_value: 16,
num_values,
});
let levels_field = Field::new_arrow("", DataType::UInt16, false)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth pulling this out into a static?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that in the read PR

.map(|child| child.finish(external_buffers))
.collect::<FuturesOrdered<_>>();
async move {
let mut encoded_columns = Vec::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut encoded_columns = Vec::new();
let mut encoded_columns = Vec::with_capacity(child_columns.len());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 325 to 326
return Err(Error::InvalidInput { source: format!("cannot write batch with {} rows because {} rows have already been written and Lance files cannot contain more than 2^32 rows", num_rows, self.rows_written).into(), location: location!() });
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the limit u64::MAX?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is. Updated the message.

/// Accessing this data will require 2 IOPS and accessing in a random-access fashion will require
/// a repetition index.
pub trait VariablePerValueCompressor: std::fmt::Debug + Send + Sync {
/// Compress the data into a single buffer where each value is encoded with the same number of bits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Compress the data into a single buffer where each value is encoded with the same number of bits
/// Compress the data into a single buffer where each value is encoded with different number of bits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated the comment.

// For example, 1 would mean there are 2 values in the chunk and 12 would mean there
// are 4Ki values in the chunk.
//
// This must be <= 12 (i.e. <= 4096 values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this will be a problem here:

in compression algorithms like fastlanes bitpacking, when the input is less than 1024 values, the compression algorithm itself will pad the input size to 1024 values. the padded values are needed to write to disk and needed for decoding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. Compressors are free to pad as much as they want. The last block is allowed to have a non-power-of-two and, if it's less than 1024 values and fastlanes needs to pad then as long as it marks num_bytes correctly (to include the padding)

}
}

fn encode_miniblock(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need to adjust the function signature here later when we want to do recursively encoding(arrays -> datablock, Result<EncodedPage> ->Result<DataBlock>) but I think it is fine now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only MiniBlockCompressor::compress will need recursion, not encode_miniblock but I do agree we still need to figure out that recursion still.

Copy link
Contributor

@broccoliSpicy broccoliSpicy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent work!

Adds a mini-block encoder
Adds structural encoder for struct and primitive
Adds compressor impl's for value compression
@westonpace westonpace force-pushed the feat/2.1-encoding-path-2 branch from 8dd5819 to 2c4966f Compare October 15, 2024 14:29
@westonpace
Copy link
Contributor Author

Going to merge on green so I can get the read path PR up (and this code isn't used yet anyways)

@broccoliSpicy broccoliSpicy merged commit f60a6ce into lancedb:main Oct 15, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants