[Datasets] CSV reading of unspecified extension is not robust #26605

holdenk · 2022-07-15T19:03:03Z

What happened + What you expected to happen

Trying to load a CSV file fails because the extension does not match.

fs = fsspec.filesystem('https')
ds = ray.data.read_csv(
    "https://https://gender-pay-gap.service.gov.uk/viewing/download-data/2021",
    filesystem=fs)

Results in

(_prepare_read pid=2326326) Filter: FileExtensionFilter(extensions=['.csv'], allow_if_no_extensions=False)

There is no easy way to override this that I can see.

Versions / Dependencies

ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
Python 3.8.5

Reproduction script

fs = fsspec.filesystem('https')
ds = ray.data.read_csv(
    "https://https://gender-pay-gap.service.gov.uk/viewing/download-data/2021",
    filesystem=fs)

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

clarkzinzow · 2022-07-15T19:15:11Z

@holdenk Have you tried setting partition_filter=None in the read_csv call?

fs = fsspec.filesystem('https')
ds = ray.data.read_csv(
    "https://https://gender-pay-gap.service.gov.uk/viewing/download-data/2021",
    filesystem=fs,
    partition_filter=None)

Docs: https://docs.ray.io/en/master/data/package-ref.html#ray.data.read_csv

pcmoritz · 2022-07-18T19:09:18Z

^ The above seems to work but is pretty counter-intuitive / not clear how to find this flag for a user -- @clarkzinzow @matthewdeng @c21 @jianoaix are there things we can do here to make the UX less painful?

jianoaix · 2022-07-18T20:27:06Z

We can probably improve the documentation, and include the examples of how to read target subset of files, to make the API usage more clear.

holdenk · 2022-07-19T00:49:32Z

That does help. I think, from my perspective, even just making the naming consistent, e.g. I looked how to set a FileExtensionFilter after I ran into this error, but the CSV reader calls the file extension filter a partition_filter (which... doesn't make sense to me? it's not filtering partitions).

c21 · 2022-07-30T00:33:09Z

I think we should improve the error message here. A better error message should look like

Did not find any available files to read from input directory or paths.
Please double check the input files are having expected extensions: [.csv], or set `partition_filter` field to None to disable file filtering.

A little bit more background: we have two types of partition filter: (1).Hive partition, (2).file directory. So I agree the naming is kind of confusing, but it's really just filtering out files based on file extension in directory.

Additional question: @holdenk - ~~curious what's the file extension in your case for thoses CSV files? Is it no file extension?~~ nvm, tried your link in example, it's a CSV file without extension. thought it was a directory.

…und (#27353) User raised issue in #26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found. Signed-off-by: Cheng Su <scnju13@gmail.com>

…und (ray-project#27353) User raised issue in ray-project#26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

…und (ray-project#27353) User raised issue in ray-project#26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

thatcort · 2022-09-12T13:26:25Z

I don't think filtering files based on an assumed file extension is OK. For example, if the file is tab-delimited, will it look for a .tsv file instead of .csv? Best to just let the user specify the file name, no?

Currently read_csv filters out files without .csv extension when reading. This behavior seems to be surprising to users, and reported to be bad user experience in 3+ user reports (#26605). We should change to NOT filter files by default. Verified Arrow (https://arrow.apache.org/docs/python/csv.html) and Spark (https://spark.apache.org/docs/latest/sql-data-sources-csv.html) does not filter out CSV files by default. I don't see a strong reason why we want to do it in a different way in Ray. Added documentation in case users want to use partition_filter to filter out files, and gave an example to filter out files with .csv extension. Also improve the error message when reading CSV file

Currently read_csv filters out files without .csv extension when reading. This behavior seems to be surprising to users, and reported to be bad user experience in 3+ user reports (ray-project#26605). We should change to NOT filter files by default. Verified Arrow (https://arrow.apache.org/docs/python/csv.html) and Spark (https://spark.apache.org/docs/latest/sql-data-sources-csv.html) does not filter out CSV files by default. I don't see a strong reason why we want to do it in a different way in Ray. Added documentation in case users want to use partition_filter to filter out files, and gave an example to filter out files with .csv extension. Also improve the error message when reading CSV file Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

holdenk added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 15, 2022

richardliaw changed the title ~~Ray component: Dataset~~ [data] CSV reading of unspecified extension is not robust Jul 15, 2022

richardliaw added data Ray Data-related issues P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 15, 2022

clarkzinzow changed the title ~~[data] CSV reading of unspecified extension is not robust~~ [Datasets] CSV reading of unspecified extension is not robust Jul 15, 2022

c21 self-assigned this Jul 30, 2022

c21 mentioned this issue Aug 2, 2022

[Datasets] Better error message for partition filtering if no file found #27353

Merged

7 tasks

richardliaw mentioned this issue Aug 26, 2022

[Core] read_csv read_json can't work without a suffix #28124

Closed

richardliaw added this to the Ray 2.1 milestone Oct 1, 2022

c21 mentioned this issue Oct 4, 2022

[Datasets] read_csv not filter out files by default #29032

Merged

7 tasks

clarkzinzow closed this as completed in #29032 Oct 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] CSV reading of unspecified extension is not robust #26605

[Datasets] CSV reading of unspecified extension is not robust #26605

holdenk commented Jul 15, 2022

clarkzinzow commented Jul 15, 2022 •

edited by pcmoritz

Loading

pcmoritz commented Jul 18, 2022 •

edited

Loading

jianoaix commented Jul 18, 2022

holdenk commented Jul 19, 2022

c21 commented Jul 30, 2022 •

edited

Loading

thatcort commented Sep 12, 2022 •

edited

Loading

[Datasets] CSV reading of unspecified extension is not robust #26605

[Datasets] CSV reading of unspecified extension is not robust #26605

Comments

holdenk commented Jul 15, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

clarkzinzow commented Jul 15, 2022 • edited by pcmoritz Loading

pcmoritz commented Jul 18, 2022 • edited Loading

jianoaix commented Jul 18, 2022

holdenk commented Jul 19, 2022

c21 commented Jul 30, 2022 • edited Loading

thatcort commented Sep 12, 2022 • edited Loading

clarkzinzow commented Jul 15, 2022 •

edited by pcmoritz

Loading

pcmoritz commented Jul 18, 2022 •

edited

Loading

c21 commented Jul 30, 2022 •

edited

Loading

thatcort commented Sep 12, 2022 •

edited

Loading