[AIR][Numpy] Add numpy narrow waist to `Preprocessor` and `BatchMapper` #28418

jiaodong · 2022-09-10T00:23:42Z

Why are these changes needed?

We need to add a numpy path in AIR to facilitate deep learning.

Internally we support arrow / pandas as dataset format, but user facing formats should only be pandas / numpy.

Therefore this PR also updated internal dispatch logic for inter-op among different data format & transform format combinations.

Changes

Added _transform_numpy() to BatchMapper
Added _transform_numpy() to Preprocessor base class
Added batch_format field in BatchMapper to match map_batches behavior
Default Preprocessor and BatchMapper to batch_format="pandas"
Removed all _transform_arrow() related code such that only pandas & numpy are valid transformation types
For multiple column arrow / pandas table, in numpy path we transform them into Dict[str, ndarray]
For single column arrow / pandas table, in numpy path we transform them into ndarray

Related issue number

#28346, #28522, #28524

Closes #28523

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

amogkam

Left some initial comments. Will review more in depth later

python/ray/data/preprocessor.py

jiaodong · 2022-09-20T19:15:47Z

So chatted with @amogkam a bit offline, a few changes we need:

batch_format should be BatchMapper only thing, and base class Preprocessor should still have fallback paths that decides transformation format based on data type
We only need to add numpy path to DL related preprocessors, no strong need for majority of other ones yet. In the future we should expect to see some numpy-only preprocessor, some pandas-only preprocessor and a few that implements both interface.

…er path if none is given

clarkzinzow

Looking super good, mostly nits, biggest call out is making sure that we get the "tensor dataset" (our single-tensor-column representation of a collection of tensors) semantics right and that we have proper test coverage thereon.

python/ray/data/preprocessor.py

python/ray/data/tests/test_batch_mapper.py

clarkzinzow · 2022-09-21T15:33:23Z

python/ray/data/tests/test_preprocessors.py

-    assert isinstance(with_pandas_and_arrow.transform_batch(table), pyarrow.Table)
+    assert isinstance(with_numpy.transform_batch(table), (np.ndarray, dict))
+    assert isinstance(
+        with_numpy.transform_batch(table_single_column), (np.ndarray, dict)


Since column should still come back as a table unless that single column is representing a "tensor dataset", i.e. its column name is "__value__" and its dtype is TensorDtype.

discussed offline -- at UDF / transform_batch level we still return numpy data type, we can still have single column table in post-processing since numpy data will be converted back to table (arrow/pandas) format

clarkzinzow · 2022-09-21T15:34:57Z

python/ray/data/tests/test_preprocessors.py

+    # Auto select data_format = "arrow" -> batch_format = "numpy" for performance
+    assert isinstance(with_pandas_and_numpy.transform_batch(table), (np.ndarray, dict))
+    assert isinstance(
+        with_pandas_and_numpy.transform_batch(table_single_column), (np.ndarray, dict)


Ditto here.

discussed offline -- at UDF / transform_batch level we still return numpy data type, we can still have single column table in post-processing since numpy data will be converted back to table (arrow/pandas) format

python/ray/data/tests/test_preprocessors.py

python/ray/data/preprocessor.py

jiaodong · 2022-09-27T16:22:39Z

Test failures are irrelevant, due to gRPC upgrade on ray client, which this PR did not touch.

python/ray/air/util/data_batch_conversion.py

python/ray/data/tests/test_batch_mapper.py

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Signed-off-by: Jiao <sophchess@gmail.com>

clarkzinzow

LGTM!

python/ray/air/tests/test_data_batch_conversion.py

… array bug + chucked array combining

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>

…into numpy_preprocessor

amogkam

Thanks lgtm!

…tchMapper` (ray-project#28418)" This reverts commit 9c39a28.

This is a quick and relatively safer attempt to address #29324 In #28418 we attempted to unify ray.air utils with shared utils function but triggered expensive ray.data imports. Where longer term and more robust solution should be #27658

…r` (ray-project#28418) Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

This is a quick and relatively safer attempt to address ray-project#29324 In ray-project#28418 we attempted to unify ray.air utils with shared utils function but triggered expensive ray.data imports. Where longer term and more robust solution should be ray-project#27658 Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jiaodong added 4 commits September 9, 2022 17:22

wip

792fb9e

wip

ce3d06d

better format with numpy tests

5908083

working without arrow

6c9e95b

jiaodong assigned clarkzinzow and amogkam Sep 14, 2022

jiaodong added 5 commits September 16, 2022 08:50

wip

6a85fbe

Merge branch 'master' into numpy_preprocessor

e8415b2

fixed test_dataset_config test

4f33964

add batch_format to BatchMapper

7f793fa

fix tests

277b23e

jiaodong marked this pull request as ready for review September 19, 2022 23:17

jiaodong requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners September 19, 2022 23:17

jiaodong added air labels Sep 19, 2022

Merge branch 'master' into numpy_preprocessor

0e53b83

jiaodong added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 20, 2022

amogkam reviewed Sep 20, 2022

View reviewed changes

python/ray/data/preprocessor.py Outdated Show resolved Hide resolved

python/ray/data/preprocessor.py Outdated Show resolved Hide resolved

jiaodong added 2 commits September 20, 2022 14:25

move batch format to BatchMapper only, and restore transform type inf…

2058d0b

…er path if none is given

docstring change

c4f083c

jiaodong assigned matthewdeng and ericl Sep 21, 2022

clarkzinzow requested changes Sep 21, 2022

View reviewed changes

ericl reviewed Sep 21, 2022

View reviewed changes

python/ray/data/preprocessor.py Outdated Show resolved Hide resolved

python/ray/data/preprocessor.py Outdated Show resolved Hide resolved

python/ray/data/preprocessor.py Outdated Show resolved Hide resolved

python/ray/data/preprocessor.py Outdated Show resolved Hide resolved

clarkzinzow requested changes Sep 27, 2022

View reviewed changes

python/ray/air/util/data_batch_conversion.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_batch_mapper.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_batch_mapper.py Outdated Show resolved Hide resolved

jiaodong and others added 3 commits September 27, 2022 10:30

Apply suggestions from code review

217e568

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Signed-off-by: Jiao <sophchess@gmail.com>

update tests

9adf4f0

fix tests

450ba34

jiaodong added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Sep 27, 2022

clarkzinzow approved these changes Sep 27, 2022

View reviewed changes

python/ray/air/tests/test_data_batch_conversion.py Outdated Show resolved Hide resolved

python/ray/air/tests/test_data_batch_conversion.py Outdated Show resolved Hide resolved

address comment about arrow tensor format with fixes around extension…

06c5597

… array bug + chucked array combining

clarkzinzow approved these changes Sep 28, 2022

View reviewed changes

amogkam and others added 6 commits September 28, 2022 11:00

backwards compat

b6f3d0c

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>

move transform pyarrow func to air utils

2157762

Merge branch 'numpy_preprocessor' of https://github.com/jiaodong/ray …

ff73f79

…into numpy_preprocessor

travis

7d201d4

fix minimal install

3a5fff9

remove deps of packaging for py3.6-3.8

8a70d8f

richardliaw merged commit 9c39a28 into ray-project:master Sep 29, 2022

jiaodong mentioned this pull request Sep 29, 2022

Add _transform_numpy to all DL preprocessors, ex: BatchMapper #28522

Closed

amogkam reviewed Sep 29, 2022

View reviewed changes

jiaodong mentioned this pull request Sep 29, 2022

Add numpy combination for _transform_batch to all DL preprocessors, ex: BatchMapper #28524

Closed

scv119 mentioned this pull request Oct 17, 2022

[ci][release] many_tasks failed #29324

Closed

scv119 added a commit to scv119/ray that referenced this pull request Oct 17, 2022

Revert "[AIR][Numpy] Add numpy narrow waist to Preprocessor and `Ba…

4b3a3bc

…tchMapper` (ray-project#28418)" This reverts commit 9c39a28.

scv119 mentioned this pull request Oct 17, 2022

Revert "[AIR][Numpy] Add numpy narrow waist to Preprocessor and `Ba… #29406

Closed

7 tasks

jiaodong mentioned this pull request Oct 20, 2022

[AIR] Inline AIR level ray.data imports #29517

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR][Numpy] Add numpy narrow waist to `Preprocessor` and `BatchMapper` #28418

[AIR][Numpy] Add numpy narrow waist to `Preprocessor` and `BatchMapper` #28418

jiaodong commented Sep 10, 2022 •

edited

Loading

amogkam left a comment

jiaodong commented Sep 20, 2022

clarkzinzow left a comment

clarkzinzow Sep 21, 2022

jiaodong Sep 24, 2022

clarkzinzow Sep 21, 2022

jiaodong Sep 24, 2022

jiaodong commented Sep 27, 2022

clarkzinzow left a comment

amogkam left a comment

[AIR][Numpy] Add numpy narrow waist to Preprocessor and BatchMapper #28418

[AIR][Numpy] Add numpy narrow waist to Preprocessor and BatchMapper #28418

Conversation

jiaodong commented Sep 10, 2022 • edited Loading

Why are these changes needed?

Changes

Related issue number

Checks

amogkam left a comment

Choose a reason for hiding this comment

jiaodong commented Sep 20, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Sep 21, 2022

Choose a reason for hiding this comment

jiaodong Sep 24, 2022

Choose a reason for hiding this comment

clarkzinzow Sep 21, 2022

Choose a reason for hiding this comment

jiaodong Sep 24, 2022

Choose a reason for hiding this comment

jiaodong commented Sep 27, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

[AIR][Numpy] Add numpy narrow waist to `Preprocessor` and `BatchMapper` #28418

[AIR][Numpy] Add numpy narrow waist to `Preprocessor` and `BatchMapper` #28418

jiaodong commented Sep 10, 2022 •

edited

Loading