[Datasets] The read args for pyarrow are not passed through for Parquet #30915

jianoaix · 2022-12-06T03:09:03Z

The Parquet files are read with ray.data.read_parquet() API, which has an arg **arrow_parquet_args to pass through the args to pyarrow.read_table() (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html)

This however doesn't work, for example, ds = ray.data.read_parquet("example://iris.parquet", read_dictionary=None) resulting in error:

(_sample_piece pid=3910021) TypeError: from_fragment() got an unexpected keyword argument 'read_dictionary'
(_sample_piece pid=3910021) 2022-12-06 03:11:15,460	INFO worker.py:763 -- Task failed with retryable exception: TaskID(109125051471eab0ffffffffffffffffffffffff01000000).
(_sample_piece pid=3910021) Traceback (most recent call last):
(_sample_piece pid=3910021)   File "python/ray/_raylet.pyx", line 830, in ray._raylet.execute_task
(_sample_piece pid=3910021)     class_name = actor.__class__.__name__
(_sample_piece pid=3910021)   File "python/ray/_raylet.pyx", line 834, in ray._raylet.execute_task
(_sample_piece pid=3910021)     with core_worker.profile_event(b"task:execute"):
(_sample_piece pid=3910021)   File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 465, in _sample_piece
(_sample_piece pid=3910021)     batches = piece.to_batches(
(_sample_piece pid=3910021)   File "pyarrow/_dataset.pyx", line 945, in pyarrow._dataset.Fragment.to_batches
(_sample_piece pid=3910021)   File "pyarrow/_dataset.pyx", line 928, in pyarrow._dataset.Fragment.scanner
(_sample_piece pid=3910021)   File "pyarrow/_dataset.pyx", line 2365, in pyarrow._dataset.Scanner.from_fragment
(_sample_piece pid=3910021) TypeError: from_fragment() got an unexpected keyword argument 'read_dictionary'

It looks the Parquet sampling for in-memory data size estimation is not handling the pass-through args correctly.

The text was updated successfully, but these errors were encountered:

c21 · 2022-12-16T06:56:01Z

The root cause is pyarrow._dataset.Scanner.from_fragment does not accept read_dictionary argument. User have to workaround with dataset_kwargs:

>>> ds = ray.data.read_parquet("example://iris.parquet", dataset_kwargs={"read_dictionary":["variety"]})
>>> ds
Dataset(num_blocks=1, num_rows=150, schema={sepal.length: double, sepal.width: double, petal.length: double, petal.width: double, variety: dictionary<values=string, indices=int32, ordered=0>})

We have 3 places to pass Parquet arguments:

Create pyarrow.parquet.ParquetDataset: Have to use dataset_kwargs. read_dictionary is a valid argument - https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
Sample Parquet file: No need to use dataset_kwargs. read_dictionary is NOT a valid argument - https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment
Read Parquet file: No need to use dataset_kwargs. read_dictionary is NOT a valid argument - https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment

Our documentation for read_parquet is confusing:

For further arguments you can pass to pyarrow as a keyword argument, see
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

We actually not accept all arguments for parquet.read_table. We accept (1).arguments specified in dataset_kwargs to parquet.ParquetDataset, and (2).other arguments to Scanner.from_fragment.

This is bad user experience, we shall think about how to improve it. IMO we should hold a whitelist of arguments pass to PyArrow.

jianoaix · 2022-12-17T00:21:23Z

The read_parquet API documentation looks clear to me. It seems the internal impl not honoring it accordingly. If not all args are supported, we need to document them at the API.

jianoaix added P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Dec 6, 2022

jianoaix assigned c21 Dec 6, 2022

c21 mentioned this issue Dec 17, 2022

[Datasets] Allow specify batch_size when reading Parquet file #31165

Merged

7 tasks

anyscalesam closed this as completed Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] The read args for pyarrow are not passed through for Parquet #30915

[Datasets] The read args for pyarrow are not passed through for Parquet #30915

jianoaix commented Dec 6, 2022 •

edited

Loading

c21 commented Dec 16, 2022

jianoaix commented Dec 17, 2022

[Datasets] The read args for pyarrow are not passed through for Parquet #30915

[Datasets] The read args for pyarrow are not passed through for Parquet #30915

Comments

jianoaix commented Dec 6, 2022 • edited Loading

c21 commented Dec 16, 2022

jianoaix commented Dec 17, 2022

jianoaix commented Dec 6, 2022 •

edited

Loading