Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR][Numpy] Add numpy narrow waist to Preprocessor and BatchMapper #28418

Merged
merged 40 commits into from
Sep 29, 2022

Conversation

jiaodong
Copy link
Member

@jiaodong jiaodong commented Sep 10, 2022

Why are these changes needed?

We need to add a numpy path in AIR to facilitate deep learning.

Internally we support arrow / pandas as dataset format, but user facing formats should only be pandas / numpy.

Therefore this PR also updated internal dispatch logic for inter-op among different data format & transform format combinations.

Changes

  • Added _transform_numpy() to BatchMapper
  • Added _transform_numpy() to Preprocessor base class
  • Added batch_format field in BatchMapper to match map_batches behavior
  • Default Preprocessor and BatchMapper to batch_format="pandas"
  • Removed all _transform_arrow() related code such that only pandas & numpy are valid transformation types
  • For multiple column arrow / pandas table, in numpy path we transform them into Dict[str, ndarray]
  • For single column arrow / pandas table, in numpy path we transform them into ndarray

Related issue number

#28346, #28522, #28524

Closes #28523

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jiaodong jiaodong marked this pull request as ready for review September 19, 2022 23:17
@jiaodong jiaodong added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 20, 2022
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial comments. Will review more in depth later

python/ray/data/preprocessor.py Outdated Show resolved Hide resolved
python/ray/data/preprocessor.py Outdated Show resolved Hide resolved
@jiaodong
Copy link
Member Author

So chatted with @amogkam a bit offline, a few changes we need:

  1. batch_format should be BatchMapper only thing, and base class Preprocessor should still have fallback paths that decides transformation format based on data type

  2. We only need to add numpy path to DL related preprocessors, no strong need for majority of other ones yet. In the future we should expect to see some numpy-only preprocessor, some pandas-only preprocessor and a few that implements both interface.

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking super good, mostly nits, biggest call out is making sure that we get the "tensor dataset" (our single-tensor-column representation of a collection of tensors) semantics right and that we have proper test coverage thereon.

python/ray/data/preprocessor.py Outdated Show resolved Hide resolved
python/ray/data/preprocessor.py Outdated Show resolved Hide resolved
python/ray/data/preprocessor.py Outdated Show resolved Hide resolved
python/ray/data/preprocessor.py Outdated Show resolved Hide resolved
python/ray/data/preprocessor.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_batch_mapper.py Outdated Show resolved Hide resolved
assert isinstance(with_pandas_and_arrow.transform_batch(table), pyarrow.Table)
assert isinstance(with_numpy.transform_batch(table), (np.ndarray, dict))
assert isinstance(
with_numpy.transform_batch(table_single_column), (np.ndarray, dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since column should still come back as a table unless that single column is representing a "tensor dataset", i.e. its column name is "__value__" and its dtype is TensorDtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline -- at UDF / transform_batch level we still return numpy data type, we can still have single column table in post-processing since numpy data will be converted back to table (arrow/pandas) format

# Auto select data_format = "arrow" -> batch_format = "numpy" for performance
assert isinstance(with_pandas_and_numpy.transform_batch(table), (np.ndarray, dict))
assert isinstance(
with_pandas_and_numpy.transform_batch(table_single_column), (np.ndarray, dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline -- at UDF / transform_batch level we still return numpy data type, we can still have single column table in post-processing since numpy data will be converted back to table (arrow/pandas) format

python/ray/data/tests/test_preprocessors.py Show resolved Hide resolved
python/ray/data/tests/test_preprocessors.py Show resolved Hide resolved
python/ray/data/preprocessor.py Outdated Show resolved Hide resolved
python/ray/data/preprocessor.py Outdated Show resolved Hide resolved
python/ray/data/preprocessor.py Outdated Show resolved Hide resolved
python/ray/data/preprocessor.py Outdated Show resolved Hide resolved
@jiaodong
Copy link
Member Author

Test failures are irrelevant, due to gRPC upgrade on ray client, which this PR did not touch.

python/ray/air/util/data_batch_conversion.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_batch_mapper.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_batch_mapper.py Outdated Show resolved Hide resolved
jiaodong and others added 3 commits September 27, 2022 10:30
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Signed-off-by: Jiao <sophchess@gmail.com>
@jiaodong jiaodong added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Sep 27, 2022
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

python/ray/air/tests/test_data_batch_conversion.py Outdated Show resolved Hide resolved
python/ray/air/tests/test_data_batch_conversion.py Outdated Show resolved Hide resolved
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks lgtm!

scv119 added a commit to scv119/ray that referenced this pull request Oct 17, 2022
rkooo567 pushed a commit that referenced this pull request Oct 22, 2022
This is a quick and relatively safer attempt to address #29324

In #28418 we attempted to unify ray.air utils with shared utils function but triggered expensive ray.data imports.

Where longer term and more robust solution should be #27658
rickyyx pushed a commit that referenced this pull request Oct 24, 2022
This is a quick and relatively safer attempt to address #29324

In #28418 we attempted to unify ray.air utils with shared utils function but triggered expensive ray.data imports.

Where longer term and more robust solution should be #27658
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…r` (ray-project#28418)

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
This is a quick and relatively safer attempt to address ray-project#29324

In ray-project#28418 we attempted to unify ray.air utils with shared utils function but triggered expensive ray.data imports.

Where longer term and more robust solution should be ray-project#27658

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove _transform_arrow from Preprocessors
6 participants