Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] CSV reading of unspecified extension is not robust #26605

Closed
holdenk opened this issue Jul 15, 2022 · 6 comments · Fixed by #29032
Closed

[Datasets] CSV reading of unspecified extension is not robust #26605

holdenk opened this issue Jul 15, 2022 · 6 comments · Fixed by #29032
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@holdenk
Copy link
Contributor

holdenk commented Jul 15, 2022

What happened + What you expected to happen

Trying to load a CSV file fails because the extension does not match.

fs = fsspec.filesystem('https')
ds = ray.data.read_csv(
    "https://https://gender-pay-gap.service.gov.uk/viewing/download-data/2021",
    filesystem=fs)

Results in

(_prepare_read pid=2326326) Filter: FileExtensionFilter(extensions=['.csv'], allow_if_no_extensions=False)

There is no easy way to override this that I can see.

Versions / Dependencies

ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
Python 3.8.5

Reproduction script

fs = fsspec.filesystem('https')
ds = ray.data.read_csv(
    "https://https://gender-pay-gap.service.gov.uk/viewing/download-data/2021",
    filesystem=fs)

Issue Severity

High: It blocks me from completing my task.

@holdenk holdenk added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 15, 2022
@richardliaw richardliaw changed the title Ray component: Dataset [data] CSV reading of unspecified extension is not robust Jul 15, 2022
@richardliaw richardliaw added data Ray Data-related issues P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 15, 2022
@clarkzinzow
Copy link
Contributor

clarkzinzow commented Jul 15, 2022

@holdenk Have you tried setting partition_filter=None in the read_csv call?

fs = fsspec.filesystem('https')
ds = ray.data.read_csv(
    "https://https://gender-pay-gap.service.gov.uk/viewing/download-data/2021",
    filesystem=fs,
    partition_filter=None)

Docs: https://docs.ray.io/en/master/data/package-ref.html#ray.data.read_csv

@clarkzinzow clarkzinzow changed the title [data] CSV reading of unspecified extension is not robust [Datasets] CSV reading of unspecified extension is not robust Jul 15, 2022
@pcmoritz
Copy link
Contributor

pcmoritz commented Jul 18, 2022

^ The above seems to work but is pretty counter-intuitive / not clear how to find this flag for a user -- @clarkzinzow @matthewdeng @c21 @jianoaix are there things we can do here to make the UX less painful?

@jianoaix
Copy link
Contributor

We can probably improve the documentation, and include the examples of how to read target subset of files, to make the API usage more clear.

@holdenk
Copy link
Contributor Author

holdenk commented Jul 19, 2022

That does help. I think, from my perspective, even just making the naming consistent, e.g. I looked how to set a FileExtensionFilter after I ran into this error, but the CSV reader calls the file extension filter a partition_filter (which... doesn't make sense to me? it's not filtering partitions).

@c21
Copy link
Contributor

c21 commented Jul 30, 2022

I think we should improve the error message here. A better error message should look like

Did not find any available files to read from input directory or paths.
Please double check the input files are having expected extensions: [.csv], or set `partition_filter` field to None to disable file filtering.

A little bit more background: we have two types of partition filter: (1).Hive partition, (2).file directory. So I agree the naming is kind of confusing, but it's really just filtering out files based on file extension in directory.

Additional question: @holdenk - curious what's the file extension in your case for thoses CSV files? Is it no file extension? nvm, tried your link in example, it's a CSV file without extension. thought it was a directory.

@c21 c21 self-assigned this Jul 30, 2022
clarkzinzow pushed a commit that referenced this issue Aug 10, 2022
…und (#27353)

User raised issue in #26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found.

Signed-off-by: Cheng Su <scnju13@gmail.com>
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this issue Aug 10, 2022
…und (ray-project#27353)

User raised issue in ray-project#26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this issue Aug 18, 2022
…und (ray-project#27353)

User raised issue in ray-project#26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
@thatcort
Copy link

thatcort commented Sep 12, 2022

I don't think filtering files based on an assumed file extension is OK. For example, if the file is tab-delimited, will it look for a .tsv file instead of .csv? Best to just let the user specify the file name, no?

@richardliaw richardliaw added this to the Ray 2.1 milestone Oct 1, 2022
clarkzinzow pushed a commit that referenced this issue Oct 7, 2022
Currently read_csv filters out files without .csv extension when reading. This behavior seems to be surprising to users, and reported to be bad user experience in 3+ user reports (#26605). We should change to NOT filter files by default.

Verified Arrow (https://arrow.apache.org/docs/python/csv.html) and Spark (https://spark.apache.org/docs/latest/sql-data-sources-csv.html) does not filter out CSV files by default. I don't see a strong reason why we want to do it in a different way in Ray.

Added documentation in case users want to use partition_filter to filter out files, and gave an example to filter out files with .csv extension.

Also improve the error message when reading CSV file
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this issue Dec 19, 2022
Currently read_csv filters out files without .csv extension when reading. This behavior seems to be surprising to users, and reported to be bad user experience in 3+ user reports (ray-project#26605). We should change to NOT filter files by default.

Verified Arrow (https://arrow.apache.org/docs/python/csv.html) and Spark (https://spark.apache.org/docs/latest/sql-data-sources-csv.html) does not filter out CSV files by default. I don't see a strong reason why we want to do it in a different way in Ray.

Added documentation in case users want to use partition_filter to filter out files, and gave an example to filter out files with .csv extension.

Also improve the error message when reading CSV file

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants