Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] The read args for pyarrow are not passed through for Parquet #30915

Closed
jianoaix opened this issue Dec 6, 2022 · 2 comments
Closed
Assignees
Labels
data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@jianoaix
Copy link
Contributor

jianoaix commented Dec 6, 2022

The Parquet files are read with ray.data.read_parquet() API, which has an arg **arrow_parquet_args to pass through the args to pyarrow.read_table() (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html)

This however doesn't work, for example, ds = ray.data.read_parquet("example://iris.parquet", read_dictionary=None) resulting in error:

(_sample_piece pid=3910021) TypeError: from_fragment() got an unexpected keyword argument 'read_dictionary'
(_sample_piece pid=3910021) 2022-12-06 03:11:15,460	INFO worker.py:763 -- Task failed with retryable exception: TaskID(109125051471eab0ffffffffffffffffffffffff01000000).
(_sample_piece pid=3910021) Traceback (most recent call last):
(_sample_piece pid=3910021)   File "python/ray/_raylet.pyx", line 830, in ray._raylet.execute_task
(_sample_piece pid=3910021)     class_name = actor.__class__.__name__
(_sample_piece pid=3910021)   File "python/ray/_raylet.pyx", line 834, in ray._raylet.execute_task
(_sample_piece pid=3910021)     with core_worker.profile_event(b"task:execute"):
(_sample_piece pid=3910021)   File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 465, in _sample_piece
(_sample_piece pid=3910021)     batches = piece.to_batches(
(_sample_piece pid=3910021)   File "pyarrow/_dataset.pyx", line 945, in pyarrow._dataset.Fragment.to_batches
(_sample_piece pid=3910021)   File "pyarrow/_dataset.pyx", line 928, in pyarrow._dataset.Fragment.scanner
(_sample_piece pid=3910021)   File "pyarrow/_dataset.pyx", line 2365, in pyarrow._dataset.Scanner.from_fragment
(_sample_piece pid=3910021) TypeError: from_fragment() got an unexpected keyword argument 'read_dictionary'

It looks the Parquet sampling for in-memory data size estimation is not handling the pass-through args correctly.

@jianoaix jianoaix added P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Dec 6, 2022
@c21
Copy link
Contributor

c21 commented Dec 16, 2022

The root cause is pyarrow._dataset.Scanner.from_fragment does not accept read_dictionary argument. User have to workaround with dataset_kwargs:

>>> ds = ray.data.read_parquet("example://iris.parquet", dataset_kwargs={"read_dictionary":["variety"]})
>>> ds
Dataset(num_blocks=1, num_rows=150, schema={sepal.length: double, sepal.width: double, petal.length: double, petal.width: double, variety: dictionary<values=string, indices=int32, ordered=0>})

We have 3 places to pass Parquet arguments:

  1. Create pyarrow.parquet.ParquetDataset: Have to use dataset_kwargs. read_dictionary is a valid argument - https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
  2. Sample Parquet file: No need to use dataset_kwargs. read_dictionary is NOT a valid argument - https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment
  3. Read Parquet file: No need to use dataset_kwargs. read_dictionary is NOT a valid argument - https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment

Our documentation for read_parquet is confusing:

For further arguments you can pass to pyarrow as a keyword argument, see
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

We actually not accept all arguments for parquet.read_table. We accept (1).arguments specified in dataset_kwargs to parquet.ParquetDataset, and (2).other arguments to Scanner.from_fragment.

This is bad user experience, we shall think about how to improve it. IMO we should hold a whitelist of arguments pass to PyArrow.

@jianoaix
Copy link
Contributor Author

The read_parquet API documentation looks clear to me. It seems the internal impl not honoring it accordingly. If not all args are supported, we need to document them at the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

3 participants