Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more compressions in S3 Source #5110

Open
Phlair opened this issue Jul 30, 2021 · 5 comments
Open

Support more compressions in S3 Source #5110

Phlair opened this issue Jul 30, 2021 · 5 comments

Comments

@Phlair
Copy link
Contributor

Phlair commented Jul 30, 2021

Tell us about the problem you're trying to solve

Currently S3 (via abstract files source) only supports gzip and bzip2 compressions.

Describe the solution you’d like

Build in support for other compression such as:

  • Zip
  • Lzma
  • Xz
  • Snappy

This should extend to all supported file formats (where appropriate) and sit within abstract-files-source so that any new files storage source built on top of this inherits the compression support.

If that link is broken we may have refactored where that code sits and I forgot to update this, comment @ me and I'll fix them!

@darian-heede
Copy link
Contributor

Hi, is there an update for this issue? I'm especially interested in having zip compression support for this source connector.

@ryjabe
Copy link

ryjabe commented Mar 3, 2023

+1 to this goal, specifically for support of the snappy compression. The lack of support there is blocking users I have been collaborating with.

@YowanR
Copy link
Contributor

YowanR commented Mar 3, 2023

@lazebnyi Can you please scope this issue? Especially support for snappy compression, please?
cc @davydov-d as well
I'd like to understand the LoE involved here, please 🙏

@davydov-d
Copy link
Collaborator

Tell us about the problem you're trying to solve

Currently S3 (via abstract files source) only supports gzip and bzip2 compressions.

Describe the solution you’d like

Build in support for other compression such as:

  • Zip
  • Lzma
  • Xz
  • Snappy

This should extend to all supported file formats (where appropriate) and sit within abstract-files-source so that any new files storage source built on top of this inherits the compression support.

If that link is broken we may have refactored where that code sits and I forgot to update this, comment @ me and I'll fix them!

@YowanR do we need all these compression types for all the supported formats? I yes, the level of effort should be high.
I made a quick research and figured out that currently we support 4 formats (avro, jsonl, parquet and csv). All of them except for the avro format are backed by the pyarrow library. It supports reading bzip2 and gzip by default. Also it supports brotli, lz4, snappy, zstd but as far as I understand we'll need some work to do to integrate these compression types. I did not find any mentions in the doc about a regular zip, lzma or xz support. Talking about the avro format - it is backed by the fastavro library, and its doc says that it supports snappy, deflate, zstandard, bzip2, lz4, xz. So, I think we'd better decompose this task, here's my suggestion based on the LoE:

  1. Support snappy and xz for avro -- only need to verify things work by default
  2. Support snappy for other file formats -- need to do some coding but use existing tools.
  3. Support zip, lzma, xz (except for avro) -- need to find or implement new solutions

@ryjabe
Copy link

ryjabe commented Mar 6, 2023

To add more context to my previous note, the user required snappy for Parquet files.

Thank you Denys for the additional scoping here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Backlog (unscoped)
Development

No branches or pull requests

10 participants