Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR - Datasets] Hide tensor extension from UDFs. #27019

Merged

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Jul 26, 2022

We previously added automatic tensor extension casting on Datasets transformation outputs to allow the user to not have to worry about tensor column casting; however, this current state creates several issues:

  1. Not all tensors are supported, which means that we’ll need to have an opaque object dtype (i.e. ndarray of ndarray pointers) fallback for the Pandas-only case. Known unsupported tensor use cases:
    a. Heterogeneous-shaped (i.e. ragged) tensors
    b. Struct arrays
  2. UDFs will expect a NumPy column and won’t know what to do with our TensorArray type. E.g., torchvision transforms don’t respect the array protocol (which they should), and instead only support Torch tensors and NumPy ndarrays; passing a TensorArray column or a TensorArrayElement (a single item in the TensorArray column) fails.
  3. Implicit casting with object dtype fallback on UDF outputs can make the input type to downstream UDFs nondeterministic, where the user won’t know if they’ll get a TensorArray column or an object dtype column.
  4. The tensor extension cast fallback warning spams the logs.

This PR:

  • Adds automatic casting of tensor extension columns to NumPy ndarray columns for Datasets UDF inputs, meaning the UDFs will never have to see tensor extensions and that the UDF input column types will be consistent and deterministic; this fixes both (2) and (3).
  • No longer implicitly falls back to an opaque object dtype when TensorArray casting fails (e.g. for ragged tensors), and instead raises an error; this fixes (4) but removes our support for (1).
  • Adds a global enable_tensor_extension_casting config flag, which is True by default, that controls whether we perform this automatic casting. Turning off the implicit casting provides a path for (1), where the tensor extension can be avoided if working with ragged tensors in Pandas land. Turning off this flag also allows the user to explicitly control their tensor extension casting, if they want to work with it in their UDFs in order to reap the benefits of less data copies, more efficient slicing, stronger column typing, etc.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @clarkzinzow!

@jiaodong
Copy link
Member

jiaodong commented Jul 26, 2022

thanks for quick progress on this ! Can you also ensure to kick off nightly release test on this PR with the new API: https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/air_tests/air_benchmarks/workloads/gpu_batch_prediction.py

for batch prediction ? Training should be ok given the example you changed covered incremental learning.

Depending Balaji's PR #26975 if he merges first the notebooks could change, as fyi.

@amogkam
Copy link
Contributor

amogkam commented Jul 26, 2022

On a meta point- I'm struggling a bit to reconcile the performance and UX aspects of this.

From the Dataset perspective, preprocessors and predictors are udfs, which we say should not need to use TensorArray (as Datasets will automatically handle it). Yet, we still should be using TensorArray in Predictors and Preprocessors for performance reasons.

As a user or as a developer seems like I still need to be aware of TensorArray for performance reasons, so it's not completely hidden away. Let's keep iterating on this in the future, to see if there is a solution that can improve the UX without sacrificing performance.

@jiaodong
Copy link
Member

Let's keep iterating on this in the future, to see if there is a solution that can improve the UX without sacrificing performance.

+1 I also think we should go deeper regarding how to use TensorArray in the right way for both end user and perf. So far we have consensus on UX (hide it from user) but still missing on perf implications (for AIR developer) and future plans (support ragged tensor-like with right UX and perf)

@clarkzinzow
Copy link
Contributor Author

Let's keep iterating on this in the future, to see if there is a solution that can improve the UX without sacrificing performance.

@amogkam The ideal solution is pretty simple: Pandas needs to finish making extension types first-class types (they have a few bugs/missing features here), Torch and TensorFlow need to respect the __array__ protocol which has become standard, and we need to support heterogeneous-shaped tensors in our extension type. At that point, we can always use the tensor extension type, and the user will never have to be aware of the extension type since conversion to/from NumPy/Torch/TensorFlow will happen automatically via the __array__ protocol and the Pandas/Arrow extension type registries.

@clarkzinzow
Copy link
Contributor Author

The current UX push is to make users extension type agnostic while providing the ability to opt-in if they're worried about performance or memory usage, while ensuring that we, as internal library developers, use the extension type in the appropriate places if it results in free performance/efficiency wins. I think this is the current best compromise between end-user UX and perf/efficiency by default.

@amogkam
Copy link
Contributor

amogkam commented Jul 26, 2022

Yes both your points make sense! Agreed this is a good compromise for now.

@clarkzinzow clarkzinzow force-pushed the air/fix/hide-tensor-extension branch 3 times, most recently from 5abe09a to b730545 Compare July 27, 2022 01:41
@clarkzinzow clarkzinzow force-pushed the air/fix/hide-tensor-extension branch from b730545 to fef615e Compare July 27, 2022 01:49
@clarkzinzow
Copy link
Contributor Author

@amogkam @jiaodong @matthewdeng @ericl @jianoaix ready for another pass!

" ])\n",
" df.loc[:, \"image\"] = [\n",
" torchvision_transforms(image).numpy() for image in df[\"image\"]\n",
" ]\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiaodong Note that this is tensor extension type agnostic, but we still need to convert the Torch tensors back to NumPy ndarrays before storing them in the batch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, having UDF purely tensor type agnostic and working with numpy + df is already big dex ux improvement :) I think modern DL frameworks all have good zero-copy numpy iterop so it should be much less concerning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im curious what would happen to existing user code / example / tests if they still wrapped output / input with TensorArray, or they're just working with Ragged tensor but also didn't explicitly disable auto casting ?

"""
if ndarray.dtype.type is np.object_:
try:
# Try to convert the NumPy ndarray to a non-object dtype.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows us to support the non-Datasets (and non-tensor extension type) Serve use case.

"TensorArray but the conversion failed, leaving column "
f"as-is: {e}"
)
df.reset_index(drop=True, inplace=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slicing opaque object columns fails silently in user-code if we don't reset the index here (Pandas foot-gun that our tensor extension type avoids).

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @clarkzinzow!

Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG for ray/data

@ericl
Copy link
Contributor

ericl commented Jul 27, 2022

Some test failures:

data = array([array([1, 2]), array([3, 4]), array([5, 6])], dtype=object)
  |  
  | def _transform_np_array(data: np.ndarray):
  | if not isinstance(data, np.ndarray) and hasattr(data, '__array__'):
  | data = np.array(data, copy=False)
  | if len(data.shape) != 2:
  | raise ValueError('Expecting 2 dimensional numpy.ndarray, got: ',
  | >                            data.shape)
  | E           ValueError: ('Expecting 2 dimensional numpy.ndarray, got: ', (3,))

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 27, 2022
@clarkzinzow
Copy link
Contributor Author

@ericl Yep I'm looking into them

@clarkzinzow
Copy link
Contributor Author

@matthewdeng @ericl All tests and examples are passing, merging to bypass the slow MacOS queue.

@clarkzinzow clarkzinzow merged commit df124d0 into ray-project:master Jul 28, 2022
@clarkzinzow clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 28, 2022
clarkzinzow added a commit to clarkzinzow/ray that referenced this pull request Jul 28, 2022

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
We previously added automatic tensor extension casting on Datasets transformation outputs to allow the user to not have to worry about tensor column casting; however, this current state creates several issues:

1. Not all tensors are supported, which means that we’ll need to have an opaque object dtype (i.e. ndarray of ndarray pointers) fallback for the Pandas-only case. Known unsupported tensor use cases:
a. Heterogeneous-shaped (i.e. ragged) tensors
b. Struct arrays
2. UDFs will expect a NumPy column and won’t know what to do with our TensorArray type. E.g., torchvision transforms don’t respect the array protocol (which they should), and instead only support Torch tensors and NumPy ndarrays; passing a TensorArray column or a TensorArrayElement (a single item in the TensorArray column) fails.
Implicit casting with object dtype fallback on UDF outputs can make the input type to downstream UDFs nondeterministic, where the user won’t know if they’ll get a TensorArray column or an object dtype column.
3. The tensor extension cast fallback warning spams the logs.

This PR:

1. Adds automatic casting of tensor extension columns to NumPy ndarray columns for Datasets UDF inputs, meaning the UDFs will never have to see tensor extensions and that the UDF input column types will be consistent and deterministic; this fixes both (2) and (3).
2. No longer implicitly falls back to an opaque object dtype when TensorArray casting fails (e.g. for ragged tensors), and instead raises an error; this fixes (4) but removes our support for (1).
3. Adds a global enable_tensor_extension_casting config flag, which is True by default, that controls whether we perform this automatic casting. Turning off the implicit casting provides a path for (1), where the tensor extension can be avoided if working with ragged tensors in Pandas land. Turning off this flag also allows the user to explicitly control their tensor extension casting, if they want to work with it in their UDFs in order to reap the benefits of less data copies, more efficient slicing, stronger column typing, etc.
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
We previously added automatic tensor extension casting on Datasets transformation outputs to allow the user to not have to worry about tensor column casting; however, this current state creates several issues:

1. Not all tensors are supported, which means that we’ll need to have an opaque object dtype (i.e. ndarray of ndarray pointers) fallback for the Pandas-only case. Known unsupported tensor use cases:
a. Heterogeneous-shaped (i.e. ragged) tensors
b. Struct arrays
2. UDFs will expect a NumPy column and won’t know what to do with our TensorArray type. E.g., torchvision transforms don’t respect the array protocol (which they should), and instead only support Torch tensors and NumPy ndarrays; passing a TensorArray column or a TensorArrayElement (a single item in the TensorArray column) fails.
Implicit casting with object dtype fallback on UDF outputs can make the input type to downstream UDFs nondeterministic, where the user won’t know if they’ll get a TensorArray column or an object dtype column.
3. The tensor extension cast fallback warning spams the logs.

This PR:

1. Adds automatic casting of tensor extension columns to NumPy ndarray columns for Datasets UDF inputs, meaning the UDFs will never have to see tensor extensions and that the UDF input column types will be consistent and deterministic; this fixes both (2) and (3).
2. No longer implicitly falls back to an opaque object dtype when TensorArray casting fails (e.g. for ragged tensors), and instead raises an error; this fixes (4) but removes our support for (1).
3. Adds a global enable_tensor_extension_casting config flag, which is True by default, that controls whether we perform this automatic casting. Turning off the implicit casting provides a path for (1), where the tensor extension can be avoided if working with ragged tensors in Pandas land. Turning off this flag also allows the user to explicitly control their tensor extension casting, if they want to work with it in their UDFs in order to reap the benefits of less data copies, more efficient slicing, stronger column typing, etc.

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants