Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow on-the-fly decompression when reading CSV/JSON data? #8323

Open
baggiponte opened this issue Apr 18, 2023 · 3 comments
Open

Allow on-the-fly decompression when reading CSV/JSON data? #8323

baggiponte opened this issue Apr 18, 2023 · 3 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@baggiponte
Copy link
Contributor

Problem description

pandas can decompress json and csv files before reading. As an example:

with pd.read_json("path/to/compressed.json.gz", lines=True, chunksize=100, nrows=1000) as reader:
    for chunk in reader:
        ...

At first I thought polars could/would/should not do it, because I thought that - in order to decompress a file - pandas would have to decompress the whole file under the hood. However, read_json allows reading data in chunks, and I noticed that only the specified chunksize is decompressed.

The bad news is, that this only happens when the engine parameter is set to ujson: when using pyarrow, I noticed that decompression is significantly slower and I assumed this happened because pyarrow has to decompress the whole file (it might be that it simply is slower, but I could't tell).

Do you think that Rust's arrow2 might support that? Or that polars should have this feature?

Would love to help, but unfortunately I am only proficient on the Python side.

@baggiponte baggiponte added the enhancement New feature or an improvement of an existing feature label Apr 18, 2023
@lucifermorningstar1305
Copy link

lucifermorningstar1305 commented Sep 25, 2024

Has this been integrated into Python now?

@ohanf
Copy link
Contributor

ohanf commented Sep 25, 2024

As of 1.7.1 I added support in pl.read_json, please confirm your use case is being met. There was actually support via both of the ndjson read/scan functions before that release. Note that we I believe we do have to decompress the whole file into memory first for both file formats. I don't think we can easily adopt the chunking suggestion with the current implementation, although I only got familiar enough to add what was missing so I could be missing something

@lucifermorningstar1305
Copy link

I am using the scan.ndjson one and it's working so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants