You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For further arguments you can pass to pyarrow as a keyword argument, see
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
We actually not accept all arguments for parquet.read_table. We accept (1).arguments specified in dataset_kwargs to parquet.ParquetDataset, and (2).other arguments to Scanner.from_fragment.
This is bad user experience, we shall think about how to improve it. IMO we should hold a whitelist of arguments pass to PyArrow.
The read_parquet API documentation looks clear to me. It seems the internal impl not honoring it accordingly. If not all args are supported, we need to document them at the API.
The Parquet files are read with
ray.data.read_parquet()
API, which has an arg**arrow_parquet_args
to pass through the args topyarrow.read_table()
(https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html)This however doesn't work, for example,
ds = ray.data.read_parquet("example://iris.parquet", read_dictionary=None)
resulting in error:It looks the Parquet sampling for in-memory data size estimation is not handling the pass-through args correctly.
The text was updated successfully, but these errors were encountered: