-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] CSV reading of unspecified extension is not robust #26605
Comments
@holdenk Have you tried setting fs = fsspec.filesystem('https')
ds = ray.data.read_csv(
"https://https://gender-pay-gap.service.gov.uk/viewing/download-data/2021",
filesystem=fs,
partition_filter=None) Docs: https://docs.ray.io/en/master/data/package-ref.html#ray.data.read_csv |
^ The above seems to work but is pretty counter-intuitive / not clear how to find this flag for a user -- @clarkzinzow @matthewdeng @c21 @jianoaix are there things we can do here to make the UX less painful? |
We can probably improve the documentation, and include the examples of how to read target subset of files, to make the API usage more clear. |
That does help. I think, from my perspective, even just making the naming consistent, e.g. I looked how to set a |
I think we should improve the error message here. A better error message should look like
A little bit more background: we have two types of partition filter: (1).Hive partition, (2).file directory. So I agree the naming is kind of confusing, but it's really just filtering out files based on file extension in directory. Additional question: @holdenk - |
…und (ray-project#27353) User raised issue in ray-project#26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
…und (ray-project#27353) User raised issue in ray-project#26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
I don't think filtering files based on an assumed file extension is OK. For example, if the file is tab-delimited, will it look for a |
Currently read_csv filters out files without .csv extension when reading. This behavior seems to be surprising to users, and reported to be bad user experience in 3+ user reports (#26605). We should change to NOT filter files by default. Verified Arrow (https://arrow.apache.org/docs/python/csv.html) and Spark (https://spark.apache.org/docs/latest/sql-data-sources-csv.html) does not filter out CSV files by default. I don't see a strong reason why we want to do it in a different way in Ray. Added documentation in case users want to use partition_filter to filter out files, and gave an example to filter out files with .csv extension. Also improve the error message when reading CSV file
Currently read_csv filters out files without .csv extension when reading. This behavior seems to be surprising to users, and reported to be bad user experience in 3+ user reports (ray-project#26605). We should change to NOT filter files by default. Verified Arrow (https://arrow.apache.org/docs/python/csv.html) and Spark (https://spark.apache.org/docs/latest/sql-data-sources-csv.html) does not filter out CSV files by default. I don't see a strong reason why we want to do it in a different way in Ray. Added documentation in case users want to use partition_filter to filter out files, and gave an example to filter out files with .csv extension. Also improve the error message when reading CSV file Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
What happened + What you expected to happen
Trying to load a CSV file fails because the extension does not match.
Results in
There is no easy way to override this that I can see.
Versions / Dependencies
ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
Python 3.8.5
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: