-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR - Datasets] Hide tensor extension from UDFs. #27019
[AIR - Datasets] Hide tensor extension from UDFs. #27019
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @clarkzinzow!
thanks for quick progress on this ! Can you also ensure to kick off nightly release test on this PR with the new API: https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/air_tests/air_benchmarks/workloads/gpu_batch_prediction.py for batch prediction ? Training should be ok given the example you changed covered incremental learning. Depending Balaji's PR #26975 if he merges first the notebooks could change, as fyi. |
On a meta point- I'm struggling a bit to reconcile the performance and UX aspects of this. From the Dataset perspective, preprocessors and predictors are udfs, which we say should not need to use TensorArray (as Datasets will automatically handle it). Yet, we still should be using TensorArray in Predictors and Preprocessors for performance reasons. As a user or as a developer seems like I still need to be aware of TensorArray for performance reasons, so it's not completely hidden away. Let's keep iterating on this in the future, to see if there is a solution that can improve the UX without sacrificing performance. |
+1 I also think we should go deeper regarding how to use |
@amogkam The ideal solution is pretty simple: Pandas needs to finish making extension types first-class types (they have a few bugs/missing features here), Torch and TensorFlow need to respect the |
The current UX push is to make users extension type agnostic while providing the ability to opt-in if they're worried about performance or memory usage, while ensuring that we, as internal library developers, use the extension type in the appropriate places if it results in free performance/efficiency wins. I think this is the current best compromise between end-user UX and perf/efficiency by default. |
Yes both your points make sense! Agreed this is a good compromise for now. |
5abe09a
to
b730545
Compare
b730545
to
fef615e
Compare
@amogkam @jiaodong @matthewdeng @ericl @jianoaix ready for another pass! |
" ])\n", | ||
" df.loc[:, \"image\"] = [\n", | ||
" torchvision_transforms(image).numpy() for image in df[\"image\"]\n", | ||
" ]\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiaodong Note that this is tensor extension type agnostic, but we still need to convert the Torch tensors back to NumPy ndarrays before storing them in the batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, having UDF purely tensor type agnostic and working with numpy + df is already big dex ux improvement :) I think modern DL frameworks all have good zero-copy numpy iterop so it should be much less concerning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im curious what would happen to existing user code / example / tests if they still wrapped output / input with TensorArray
, or they're just working with Ragged tensor but also didn't explicitly disable auto casting ?
""" | ||
if ndarray.dtype.type is np.object_: | ||
try: | ||
# Try to convert the NumPy ndarray to a non-object dtype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows us to support the non-Datasets (and non-tensor extension type) Serve use case.
"TensorArray but the conversion failed, leaving column " | ||
f"as-is: {e}" | ||
) | ||
df.reset_index(drop=True, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slicing opaque object columns fails silently in user-code if we don't reset the index here (Pandas foot-gun that our tensor extension type avoids).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @clarkzinzow!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG for ray/data
Some test failures:
|
@ericl Yep I'm looking into them |
@matthewdeng @ericl All tests and examples are passing, merging to bypass the slow MacOS queue. |
We previously added automatic tensor extension casting on Datasets transformation outputs to allow the user to not have to worry about tensor column casting; however, this current state creates several issues: 1. Not all tensors are supported, which means that we’ll need to have an opaque object dtype (i.e. ndarray of ndarray pointers) fallback for the Pandas-only case. Known unsupported tensor use cases: a. Heterogeneous-shaped (i.e. ragged) tensors b. Struct arrays 2. UDFs will expect a NumPy column and won’t know what to do with our TensorArray type. E.g., torchvision transforms don’t respect the array protocol (which they should), and instead only support Torch tensors and NumPy ndarrays; passing a TensorArray column or a TensorArrayElement (a single item in the TensorArray column) fails. Implicit casting with object dtype fallback on UDF outputs can make the input type to downstream UDFs nondeterministic, where the user won’t know if they’ll get a TensorArray column or an object dtype column. 3. The tensor extension cast fallback warning spams the logs. This PR: 1. Adds automatic casting of tensor extension columns to NumPy ndarray columns for Datasets UDF inputs, meaning the UDFs will never have to see tensor extensions and that the UDF input column types will be consistent and deterministic; this fixes both (2) and (3). 2. No longer implicitly falls back to an opaque object dtype when TensorArray casting fails (e.g. for ragged tensors), and instead raises an error; this fixes (4) but removes our support for (1). 3. Adds a global enable_tensor_extension_casting config flag, which is True by default, that controls whether we perform this automatic casting. Turning off the implicit casting provides a path for (1), where the tensor extension can be avoided if working with ragged tensors in Pandas land. Turning off this flag also allows the user to explicitly control their tensor extension casting, if they want to work with it in their UDFs in order to reap the benefits of less data copies, more efficient slicing, stronger column typing, etc.
We previously added automatic tensor extension casting on Datasets transformation outputs to allow the user to not have to worry about tensor column casting; however, this current state creates several issues: 1. Not all tensors are supported, which means that we’ll need to have an opaque object dtype (i.e. ndarray of ndarray pointers) fallback for the Pandas-only case. Known unsupported tensor use cases: a. Heterogeneous-shaped (i.e. ragged) tensors b. Struct arrays 2. UDFs will expect a NumPy column and won’t know what to do with our TensorArray type. E.g., torchvision transforms don’t respect the array protocol (which they should), and instead only support Torch tensors and NumPy ndarrays; passing a TensorArray column or a TensorArrayElement (a single item in the TensorArray column) fails. Implicit casting with object dtype fallback on UDF outputs can make the input type to downstream UDFs nondeterministic, where the user won’t know if they’ll get a TensorArray column or an object dtype column. 3. The tensor extension cast fallback warning spams the logs. This PR: 1. Adds automatic casting of tensor extension columns to NumPy ndarray columns for Datasets UDF inputs, meaning the UDFs will never have to see tensor extensions and that the UDF input column types will be consistent and deterministic; this fixes both (2) and (3). 2. No longer implicitly falls back to an opaque object dtype when TensorArray casting fails (e.g. for ragged tensors), and instead raises an error; this fixes (4) but removes our support for (1). 3. Adds a global enable_tensor_extension_casting config flag, which is True by default, that controls whether we perform this automatic casting. Turning off the implicit casting provides a path for (1), where the tensor extension can be avoided if working with ragged tensors in Pandas land. Turning off this flag also allows the user to explicitly control their tensor extension casting, if they want to work with it in their UDFs in order to reap the benefits of less data copies, more efficient slicing, stronger column typing, etc. Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
We previously added automatic tensor extension casting on Datasets transformation outputs to allow the user to not have to worry about tensor column casting; however, this current state creates several issues:
a. Heterogeneous-shaped (i.e. ragged) tensors
b. Struct arrays
This PR:
TensorArray
casting fails (e.g. for ragged tensors), and instead raises an error; this fixes (4) but removes our support for (1).enable_tensor_extension_casting
config flag, which isTrue
by default, that controls whether we perform this automatic casting. Turning off the implicit casting provides a path for (1), where the tensor extension can be avoided if working with ragged tensors in Pandas land. Turning off this flag also allows the user to explicitly control their tensor extension casting, if they want to work with it in their UDFs in order to reap the benefits of less data copies, more efficient slicing, stronger column typing, etc.Checks
scripts/format.sh
to lint the changes in this PR.