Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for integer array indexing #900

Merged
merged 31 commits into from
Feb 22, 2025

Conversation

kgryte
Copy link
Contributor

@kgryte kgryte commented Feb 17, 2025

This PR:

  • closes RFC: indexing with multi-dimensional integer arrays #669 by adding support for integer array indexing to the specification.
  • constrains support to indexing tuples which combine integers and integer arrays.
  • Punts supporting tuples containing integer arrays along with ellipses and slices to a future revision of the standard.
  • specifies the behavior of "vectorized" indexing, as implemented in NumPy et al.
  • Does not limit integer arrays to only one-dimension, even though Dask does not support multi-dimensional integer arrays.

Notes

  • Based on test results in test indexing with arrays array-api-tests#341, NumPy, CuPy, PyTorch, JAX, and Dask all implement at least a limited form of vectorized indexing. Dask does not currently support providing a multi-dimensional integer array in order to return an array having more than one-dimension.

References

@kgryte kgryte added API extension Adds new functions or objects to the API. topic: Indexing Array indexing. labels Feb 17, 2025
@kgryte kgryte added this to the v2024 milestone Feb 17, 2025
@kgryte kgryte requested a review from ev-br February 17, 2025 13:07
@kgryte
Copy link
Contributor Author

kgryte commented Feb 17, 2025

cc @shoyer

Co-authored-by: Evgeni Burovski <evgeny.burovskiy@gmail.com>
@jni
Copy link

jni commented Feb 17, 2025

Is there a reason for the spec to be limited to 1D arrays/tuple of 1D arrays? ie I do this all the time:

import numpy as np
arr = np.array([5, 6, 7, 8])
idx = np.array([[0, 1], [1, 2], [2, 3]])
print(arr[idx])

which prints:

[[5 6]
 [6 7]
 [7 8]]

Do any of the above libraries fail this test?

@ev-br
Copy link
Member

ev-br commented Feb 17, 2025

One other thing I don't immediately see in this addition is whether this relates go __getitem__ only or also __setitem__.

@kgryte
Copy link
Contributor Author

kgryte commented Feb 17, 2025

Do any of the above libraries fail this test?

@jni Dask seems to fail with a multi-dimensional integer array, but will work with one-dimensional integer arrays. @ev-br Can confirm.

@shoyer
Copy link
Contributor

shoyer commented Feb 17, 2025

  • And further limits integer arrays to one-dimension, as Dask does not support integer arrays having more than one dimension.

I would really like to see this restriction relaxed, which significantly limits the utility of this feature. Dask could absolutely support this via existing functionality (e.g., broadcast_arrays and ravel before indexing, followed by reshape).

Not every indexing operation can be supported efficiently in the fully distributed case, but it is far better to support indexing with multi-dimensional arrays in the standard with compatibility code in dask, than to force users to do indexing with flattened 1D arrays, which can have severe negative performance implications and makes it even harder to write efficient distributed code.

@rgommers
Copy link
Member

Dask could absolutely support this via existing functionality (e.g., broadcast_arrays and ravel before indexing, followed by reshape).

Is the only reason Dask doesn't have it that no one has done the work? You still have a Dask commit bit, right? Any sense of whether a PR there will meet resistance?

Copy link
Contributor

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore the "I feel it is a bit verbose", obviously. Mostly saw the same things Stephan did, I think a few points should be highlighted more, because it is easy to miss the interesting bits if you know indexing well enough.

Notes/Questions:

  • Do we need to spell out whether tuple subclasses should be supported or are undefined?
  • Should we note the fact that if the result is a copy or not depends on library specific rules? (if a library supports view semantics.)
  • I think it makes sense to prescribe an IndexError on non-integer arrays. (Again, de-factor prescribed be early text probably.)

@shoyer
Copy link
Contributor

shoyer commented Feb 18, 2025

Dask could absolutely support this via existing functionality (e.g., broadcast_arrays and ravel before indexing, followed by reshape).

Is the only reason Dask doesn't have it that no one has done the work? You still have a Dask commit bit, right? Any sense of whether a PR there will meet resistance?

Efficient fully general indexing doesn't fit easily into Dask's data model, because Dask needs to understand the structure of the distributed computation without evaluating data. So if you index a dask array by another chunked dask array, there is no way to avoid the need to do all-to-all transfer between chunks, basically duplicating the data on every distributed worker.

That said, the same thing is true for dask.array.take(), which Dask does implement.

I do probably still have a commit bit for Dask, but it's been many years since I contributed anything to the project.

@kgryte
Copy link
Contributor Author

kgryte commented Feb 20, 2025

Thanks, @ev-br, @shoyer, and @seberg for the reviews. I've updated the proposed text based on your feedback with generalization to multi-dimensional integer arrays.

@kgryte
Copy link
Contributor Author

kgryte commented Feb 21, 2025

We discussed this PR during the workgroup meeting on Feb 20, 2025, and arrived at the following consensus:

  1. Dask should be able to support multi-dimensional array indexing. While random access of multiple elements may not be particularly efficient in a distributed context, there is no technical barrier altogether preventing its implementation.
  2. While ndonnx does not currently support bracket integer array indexing, it does support integer array indexing via its gatherND API: https://onnx.ai/onnx/operators/onnx__GatherND.html. As such, it should be possible to implement.
  3. While Dask does not currently support indexing with multiple non-zero-dimensional integer arrays, it was determined that, without being able to provide multiple integer arrays, integer array indexing would be too limited to be useful.
  4. Similarly, being able to index with integer arrays having more than one dimension was considered desirable. In the compat layer, we should be able to add a helper function which can provide a workaround for Dask. Namely, flatten and then unravel.
  5. While lists and other sequences can help with ergonomics, consensus was to only specify behavior for integer arrays, not array-like values.
  6. __setitem__ semantics are punted to v2025 and are to be left unspecified for the 2024 revision.
  7. Integer arrays should have the default array index data type in order for array indexing operations to be portable across array libraries.

Based on the above, I've updated the text accordingly. The main updates were items 6-7. Otherwise, the text was given thumbs of approval during the workgroup meeting.

Copy link
Member

@ev-br ev-br left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo optional nits.

Co-authored-by: Evgeni Burovski <evgeny.burovskiy@gmail.com>
@kgryte
Copy link
Contributor Author

kgryte commented Feb 22, 2025

As we discussed this PR during the workgroup meeting and no further comments have been made, I'll go ahead and merge. Any further adjustments can be made as errata to the 2024 specification revision. Thanks all!

@kgryte kgryte merged commit 0498721 into data-apis:main Feb 22, 2025
3 checks passed
@kgryte kgryte deleted the feat/integer-array-indexing branch February 22, 2025 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API. topic: Indexing Array indexing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC: indexing with multi-dimensional integer arrays
7 participants