-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue on page /data/examples/nyc_taxi_basic_processing.html #27738
Comments
file metadata provider is actually not related here. By default Here is one example working on my side (please let me know if it does not work for you): I created a TSV file from wikipedia - https://en.wikipedia.org/wiki/Tab-separated_values .
Then I can read it from Ray:
|
@davidxiaozhi - btw if above example is not working for you, it could be related to the schema of your TSV file. Please share your TSV file here, so we can debug further if needed. |
curl -O https://storage.googleapis.com/criteo-cail-datasets/day_1.gz |
@c21 At first, it was due to the scheme problem. Adding the scheme was enough. However, when there is abnormal data in the data, ray will report an error and cannot be executed. The invalid_row_handler configuration of ParseOptions does not work, skip the problematic row def skip_comment(row):
print("===skip:",row)
return 'skip'
invalid_row_handler = csv.ParseOptions(delimiter="\t",invalid_row_handler=skip_comment)
read_options=csv.ReadOptions(
column_names=DEFAULT_COLUMN_NAMES,
skip_rows=1,
use_threads=True
)
file = ray.data.read_csv("/home/maer/zhipeng.li/data_tmp/data_0.tsv",parallelism=200, ray_remote_args={"num_cpus": 0.25},
read_options=read_options,parse_options=parse_options) Traceback (most recent call last): |
FREQUENCY_THRESHOLD = 3 the dataset for error, the schema is DEFAULT_COLUMN_NAMES to 1 5 110 16 1 0 14 7 1 306 62770d79 e21f5d58 afea442f 945c7fcf 38b02748 6fcd6dcb 3580aa21 28808903 46dedfa6 2e027dc1 0c7c4231 95981d1f 00c5ffb be4ee537 8a0b74cc 4cdc3efa d20856aa b8170bba 9512c20b c38e2f28 14f65a5d 25b1b089 d7c1fc0b 7caf609c 30436bfc ed10571d |
See discussion in #27738 Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
The |
@matthewdeng Do you know what prevents us from being compatible with a more recent PyArrow release? |
@pcmoritz the issue is being tracked in #22310, @clarkzinzow can shed more light on the forward fix for PyArrow 10. |
Ah this is https://issues.apache.org/jira/browse/ARROW-10739 Well, I don't think we can close this bug until that's resolved then :) However if we install pyarrow >= 7.0, manually, I have now verified it works (on the latest Ray master with Kai's fix): test.tsv
import ray
from pyarrow import csv
def skip_comment(row):
print("===skip:",row)
return 'skip'
parse_options = csv.ParseOptions(delimiter="\t",invalid_row_handler=skip_comment)
ds = ray.data.read_csv("test.tsv", parse_options=parse_options, partition_filter=None)
ds.show() Results in the correct
@davidxiaozhi This should unblock you, but let us know if there is more trouble :) |
To use the patch from #28326 / #28327 without upgrading Ray, here is a workaround: #28326 (comment) |
…#27850) See discussion in ray-project#27738 Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Zhi Lin <zl1nn@outlook.com>
…#27850) See discussion in ray-project#27738 Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: ilee300a <ilee300@anyscale.com>
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
I want to use this framework, but I found that there are too few demos and the cost of getting started is very high. For example, I want to read tsv files
He directly reported the following error:
File "/root/conda/lib/python3.9/site-packages/ray/data/read_api.py", line 529, in read_csv
return read_datasource(
File "/root/conda/lib/python3.9/site-packages/ray/data/read_api.py", line 269, in read_datasource
block_list.ensure_metadata_for_first_block()
File "/root/conda/lib/python3.9/site-packages/ray/data/impl/lazy_block_list.py", line 305, in ensure_metadata_for_first_block
metadata = ray.get(metadata_ref)
File "/root/conda/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/root/conda/lib/python3.9/site-packages/ray/worker.py", line 1831, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ArrowInvalid): ray::_execute_read_task() (pid=22123, ip=172.16.15.11)
File "/root/conda/lib/python3.9/site-packages/ray/data/impl/lazy_block_list.py", line 451, in _execute_read_task
block = task()
File "/root/conda/lib/python3.9/site-packages/ray/data/datasource/datasource.py", line 146, in call
for block in result:
File "/root/conda/lib/python3.9/site-packages/ray/data/datasource/file_based_datasource.py", line 212, in read_files
yield output_buffer.next()
File "/root/conda/lib/python3.9/site-packages/ray/data/impl/output_buffer.py", line 74, in next
block = self._buffer.build()
File "/root/conda/lib/python3.9/site-packages/ray/data/impl/delegating_block_builder.py", line 64, in build
return self._builder.build()
File "/root/conda/lib/python3.9/site-packages/ray/data/impl/table_block.py", line 85, in build
return self._concat_tables(tables)
File "/root/conda/lib/python3.9/site-packages/ray/data/impl/arrow_block.py", line 91, in _concat_tables
return pyarrow.concat_tables(tables, promote=True)
File "pyarrow/table.pxi", line 2338, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Can't unify schema with duplicate field names.
By reading the source code and querying the official demo, I initially found that I might need to implement DefaultFileMetadataProvider myself
But I can't find any relevant information at all, and I expect the official to provide a lot of demos, which can reduce our learning cost
The text was updated successfully, but these errors were encountered: