Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: reconnect block compression #2878

Merged
merged 3 commits into from
Sep 18, 2024

Conversation

westonpace
Copy link
Contributor

This allows block compression (zstd) in the narrow case of using it for binary data. A more generalized approach to block compression can be handled as part of 2.1.

@codecov-commenter
Copy link

codecov-commenter commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 81.08108% with 7 lines in your changes missing coverage. Please review.

Project coverage is 77.84%. Comparing base (9c361fe) to head (0f80146).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...st/lance-encoding/src/encodings/physical/binary.rs 78.26% 3 Missing and 2 partials ⚠️
rust/lance-encoding/src/encoder.rs 85.71% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2878      +/-   ##
==========================================
- Coverage   77.98%   77.84%   -0.15%     
==========================================
  Files         231      231              
  Lines       70643    70199     -444     
  Branches    70643    70199     -444     
==========================================
- Hits        55090    54643     -447     
- Misses      12424    12685     +261     
+ Partials     3129     2871     -258     
Flag Coverage Δ
unittests 77.84% <81.08%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably expose the compression level. Then users could tune the compression ratio to compression speed tradeoff.

let bytes_encoding = ProtobufUtils::flat_encoding(
/*bits_per_value=*/ 8,
bytes_buffer_index,
self.compression_scheme.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clippy

Suggested change
self.compression_scheme.clone(),
self.compression_scheme,

@westonpace
Copy link
Contributor Author

We should probably expose the compression level.

@wjones127 ah, now we run into the question of how we should allow this kind of advanced configuration

We could use a separate metadata key lance-encoding:compression-level or we could set the lance-encoding:compression key to a non-trivial value (e.g. { "lance-encoding:compression": "zstd-3" }). (or there are probably a lot of other things we can do).

In general, I don't like exposing too much configuration to the user (which seems hypocritical to say since I think I've added half a dozen environment variables in the last month) without good reason. Why would a user choose something other than the default compression level?

@westonpace westonpace force-pushed the feat/reconnect-compression branch from ff5c1e4 to 0f80146 Compare September 18, 2024 22:21
@westonpace westonpace merged commit ef39518 into lancedb:main Sep 18, 2024
22 checks passed
@niyue
Copy link
Contributor

niyue commented Oct 18, 2024

We should probably expose the compression level

We could use a separate metadata key lance-encoding:compression-level or we could set the lance-encoding:compression key to a non-trivial value (e.g. { "lance-encoding:compression": "zstd-3" })

Why would a user choose something other than the default compression level?

@wjones127 @westonpace
I would like to know whether exposing the compression level as a configurable option is still something being considered. If this is considered, I would be happy to submit a PR for it.

Zstd offers 22 compression levels, each yielding different compression ratios. The compressed data size can vary by up to 50% between levels 1 and 22, depending on data distribution (e.g., [1]). Some databases, like AWS Athena, allow users to specify the Zstd compression level via DDL [2]. Similarly, it would be beneficial if Lance provided the ability to customize compression settings, enabling users to balance time and space efficiency based on their use cases. Thanks.

[1] https://www.reddit.com/r/compression/comments/18e524n/zstd_compression_ratios_by_level/
[2] https://docs.aws.amazon.com/athena/latest/ug/compression-support-zstd-levels.html

@wjones127
Copy link
Contributor

I'm for it, if we can find a sensible place to put it.

@westonpace
Copy link
Contributor Author

I'm ok with exposing it. I'd prefer a entry in the field metadata like lance-encoding:compression-level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants