Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Better error message for partition filtering if no file found #27353

Merged
merged 4 commits into from
Aug 10, 2022

Conversation

c21
Copy link
Contributor

@c21 c21 commented Aug 2, 2022

Signed-off-by: Cheng Su scnju13@gmail.com

Why are these changes needed?

User raised issue in #26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found.

Before this PR, the error message is:

>>> import ray
>>> ds = ray.data.read_csv("/Users/chengsu/try/image-2")
2022-08-01 16:49:59,805	INFO worker.py:1490 -- Started a local Ray instance.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/chengsu/ray/python/ray/data/read_api.py", line 585, in read_csv
    return read_datasource(
  File "/Users/chengsu/ray/python/ray/data/read_api.py", line 268, in read_datasource
    requested_parallelism, min_safe_parallelism, read_tasks = ray.get(
  File "/Users/chengsu/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/chengsu/ray/python/ray/_private/worker.py", line 2245, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::_get_read_tasks() (pid=27887, ip=127.0.0.1)
  File "/Users/chengsu/ray/python/ray/data/read_api.py", line 1143, in _get_read_tasks
    reader.get_read_tasks(requested_parallelism),
  File "/Users/chengsu/ray/python/ray/data/datasource/file_based_datasource.py", line 442, in get_read_tasks
    np.array_split(paths, parallelism), np.array_split(file_sizes, parallelism)
  File "<__array_function__ internals>", line 180, in array_split
  File "/Users/chengsu/opt/anaconda3/envs/chengsu-dev/lib/python3.9/site-packages/numpy/lib/shape_base.py", line 778, in array_split
    raise ValueError('number sections must be larger than 0.') from None
ValueError: number sections must be larger than 0.

After this PR, the error message is:

>>> import ray
>>> ds = ray.data.read_csv("/Users/chengsu/try/image-2")
2022-08-01 17:12:35,667	INFO worker.py:1490 -- Started a local Ray instance.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/chengsu/ray/python/ray/data/read_api.py", line 585, in read_csv
    return read_datasource(
  File "/Users/chengsu/ray/python/ray/data/read_api.py", line 268, in read_datasource
    requested_parallelism, min_safe_parallelism, read_tasks = ray.get(
  File "/Users/chengsu/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/chengsu/ray/python/ray/_private/worker.py", line 2245, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::_get_read_tasks() (pid=31935, ip=127.0.0.1)
  File "/Users/chengsu/ray/python/ray/data/read_api.py", line 1136, in _get_read_tasks
    reader = ds.create_reader(**kwargs)
  File "/Users/chengsu/ray/python/ray/data/datasource/file_based_datasource.py", line 212, in create_reader
    return _FileBasedDatasourceReader(self, **kwargs)
  File "/Users/chengsu/ray/python/ray/data/datasource/file_based_datasource.py", line 359, in __init__
    raise ValueError(
ValueError: Not found any input file to read from. Please double check 'partition_filter' field is set properly.

Related issue number

#26605

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cleaner fix would be to make a PathFilter, and then we subclass it with PartitionFilter and ExtensionFilter. That should be able to address the root cause for the user (they felt confused by the parameter name being "partition" while it's nothing about partition).

if len(self._paths) == 0:
raise ValueError(
"Not found any input file to read from. Please double "
"check the input files are having expected extension(s): "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessarily doing an extension filtering (it can be just partition filtering yields empty paths) so we should not just point to extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix - thanks for pointing it out. I agree with your point, so changed the error message. I am happy to look into a better API for path filtering as followup. Right now I think the error message is pretty confusing ValueError: number sections must be larger than 0. and worth to fix. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I wasn't meaning to get the API name right in this PR.

For the message itself, if more detailed/specific information wanted, maybe use isinstance() to dispatch PartitionFilter and ExtensionFilter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use isinstance() to dispatch PartitionFilter and ExtensionFilter?

hmm for PathPartitionFilter it seems a little hard to extract a message - we support an arbitrary filter_fn inside it. I changed the error message to be:

"Not found any input file to read from. Please double check
'partition_filter' field is set properly."

c21 added 3 commits August 9, 2022 10:22
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nits!

python/ray/data/datasource/file_based_datasource.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_dataset_formats.py Outdated Show resolved Hide resolved
Signed-off-by: Cheng Su <scnju13@gmail.com>
@c21
Copy link
Contributor Author

c21 commented Aug 10, 2022

The failed book-linkcheck is unrelated, already fixed in master - #27721 .

@c21 c21 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 10, 2022
@clarkzinzow clarkzinzow merged commit 853c859 into ray-project:master Aug 10, 2022
@c21 c21 deleted the filter-empty branch August 10, 2022 19:52
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Aug 10, 2022
…und (ray-project#27353)

User raised issue in ray-project#26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…und (ray-project#27353)

User raised issue in ray-project#26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants