Allow on-the-fly decompression when reading CSV/JSON data? #8323

baggiponte · 2023-04-18T08:06:29Z

Problem description

pandas can decompress json and csv files before reading. As an example:

with pd.read_json("path/to/compressed.json.gz", lines=True, chunksize=100, nrows=1000) as reader:
    for chunk in reader:
        ...

At first I thought polars could/would/should not do it, because I thought that - in order to decompress a file - pandas would have to decompress the whole file under the hood. However, read_json allows reading data in chunks, and I noticed that only the specified chunksize is decompressed.

The bad news is, that this only happens when the engine parameter is set to ujson: when using pyarrow, I noticed that decompression is significantly slower and I assumed this happened because pyarrow has to decompress the whole file (it might be that it simply is slower, but I could't tell).

Do you think that Rust's arrow2 might support that? Or that polars should have this feature?

Would love to help, but unfortunately I am only proficient on the Python side.

The text was updated successfully, but these errors were encountered:

lucifermorningstar1305 · 2024-09-25T04:03:51Z

Has this been integrated into Python now?

ohanf · 2024-09-25T05:28:38Z

As of 1.7.1 I added support in pl.read_json, please confirm your use case is being met. There was actually support via both of the ndjson read/scan functions before that release. Note that we I believe we do have to decompress the whole file into memory first for both file formats. I don't think we can easily adopt the chunking suggestion with the current implementation, although I only got familiar enough to add what was missing so I could be missing something

lucifermorningstar1305 · 2024-10-02T07:01:22Z

I am using the scan.ndjson one and it's working so far.

baggiponte added the enhancement New feature or an improvement of an existing feature label Apr 18, 2023

ohanf mentioned this issue Sep 3, 2024

fix(rust): Refactor decompression checks and add support for decompressing JSON #18536

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow on-the-fly decompression when reading CSV/JSON data? #8323

Allow on-the-fly decompression when reading CSV/JSON data? #8323

baggiponte commented Apr 18, 2023

lucifermorningstar1305 commented Sep 25, 2024 •

edited

Loading

ohanf commented Sep 25, 2024

lucifermorningstar1305 commented Oct 2, 2024

Allow on-the-fly decompression when reading CSV/JSON data? #8323

Allow on-the-fly decompression when reading CSV/JSON data? #8323

Comments

baggiponte commented Apr 18, 2023

Problem description

lucifermorningstar1305 commented Sep 25, 2024 • edited Loading

ohanf commented Sep 25, 2024

lucifermorningstar1305 commented Oct 2, 2024

lucifermorningstar1305 commented Sep 25, 2024 •

edited

Loading