You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At first I thought polars could/would/should not do it, because I thought that - in order to decompress a file - pandas would have to decompress the whole file under the hood. However, read_json allows reading data in chunks, and I noticed that only the specified chunksize is decompressed.
The bad news is, that this only happens when the engine parameter is set to ujson: when using pyarrow, I noticed that decompression is significantly slower and I assumed this happened because pyarrow has to decompress the whole file (it might be that it simply is slower, but I could't tell).
Do you think that Rust's arrow2 might support that? Or that polars should have this feature?
Would love to help, but unfortunately I am only proficient on the Python side.
The text was updated successfully, but these errors were encountered:
As of 1.7.1 I added support in pl.read_json, please confirm your use case is being met. There was actually support via both of the ndjson read/scan functions before that release. Note that we I believe we do have to decompress the whole file into memory first for both file formats. I don't think we can easily adopt the chunking suggestion with the current implementation, although I only got familiar enough to add what was missing so I could be missing something
Problem description
pandas
can decompress json and csv files before reading. As an example:At first I thought
polars
could/would/should not do it, because I thought that - in order to decompress a file -pandas
would have to decompress the whole file under the hood. However,read_json
allows reading data in chunks, and I noticed that only the specifiedchunksize
is decompressed.The bad news is, that this only happens when the
engine
parameter is set toujson
: when usingpyarrow
, I noticed that decompression is significantly slower and I assumed this happened becausepyarrow
has to decompress the whole file (it might be that it simply is slower, but I could't tell).Do you think that Rust's
arrow2
might support that? Or thatpolars
should have this feature?Would love to help, but unfortunately I am only proficient on the Python side.
The text was updated successfully, but these errors were encountered: