-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR][Numpy] Add numpy narrow waist to Preprocessor
and BatchMapper
#28418
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial comments. Will review more in depth later
So chatted with @amogkam a bit offline, a few changes we need:
|
…er path if none is given
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking super good, mostly nits, biggest call out is making sure that we get the "tensor dataset" (our single-tensor-column representation of a collection of tensors) semantics right and that we have proper test coverage thereon.
assert isinstance(with_pandas_and_arrow.transform_batch(table), pyarrow.Table) | ||
assert isinstance(with_numpy.transform_batch(table), (np.ndarray, dict)) | ||
assert isinstance( | ||
with_numpy.transform_batch(table_single_column), (np.ndarray, dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since column should still come back as a table unless that single column is representing a "tensor dataset", i.e. its column name is "__value__"
and its dtype is TensorDtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline -- at UDF / transform_batch level we still return numpy data type, we can still have single column table in post-processing since numpy data will be converted back to table (arrow/pandas) format
# Auto select data_format = "arrow" -> batch_format = "numpy" for performance | ||
assert isinstance(with_pandas_and_numpy.transform_batch(table), (np.ndarray, dict)) | ||
assert isinstance( | ||
with_pandas_and_numpy.transform_batch(table_single_column), (np.ndarray, dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline -- at UDF / transform_batch level we still return numpy data type, we can still have single column table in post-processing since numpy data will be converted back to table (arrow/pandas) format
Test failures are irrelevant, due to gRPC upgrade on ray client, which this PR did not touch. |
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Signed-off-by: Jiao <sophchess@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
… array bug + chucked array combining
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
…into numpy_preprocessor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks lgtm!
…tchMapper` (ray-project#28418)" This reverts commit 9c39a28.
…r` (ray-project#28418) Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
This is a quick and relatively safer attempt to address ray-project#29324 In ray-project#28418 we attempted to unify ray.air utils with shared utils function but triggered expensive ray.data imports. Where longer term and more robust solution should be ray-project#27658 Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Why are these changes needed?
We need to add a numpy path in AIR to facilitate deep learning.
Internally we support arrow / pandas as dataset format, but user facing formats should only be pandas / numpy.
Therefore this PR also updated internal dispatch logic for inter-op among different data format & transform format combinations.
Changes
batch_format
field inBatchMapper
to matchmap_batches
behaviorPreprocessor
andBatchMapper
tobatch_format="pandas"
Related issue number
#28346, #28522, #28524
Closes #28523
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.