-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: improve ragged indexing docs #2247
Conversation
@jpivarski I noticed recently that boolean indexing is not strictly required to have the same shape as the underlying array, which is a reflection of the fact that we normalise boolean arrays to integers without knowledge of the array being indexed. As I see it, this is a policy decision. If we want to permit this, then we don't need to fix anything. If not, then we probably need to avoid this normalisation and explicitly handle the boolean arrays in each content's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is new text, and it looks good; explaining not just the slicing rules but how to use them with argmax
and such.
We can get another tutorial "for free" by moving this docstring:
awkward/src/awkward/highlevel.py
Lines 530 to 948 in 082e485
All methods of selecting items described in | |
[NumPy indexing](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html) | |
are supported with one exception | |
([combining advanced and basic indexing](https://numpy.org/doc/stable/user/basics.indexing.html#combining-advanced-and-basic-indexing) | |
with basic indexes *between* two advanced indexes: the definition | |
NumPy chose for the result does not have a generalization beyond | |
rectilinear arrays). | |
The `where` parameter can be any of the following or a tuple of | |
the following. | |
* **An integer** selects one element. Like Python/NumPy, it is | |
zero-indexed: `0` is the first item, `1` is the second, etc. | |
Negative indexes count from the end of the list: `-1` is the | |
last, `-2` is the second-to-last, etc. | |
Indexes beyond the size of the array, either because they're too | |
large or because they're too negative, raise errors. In | |
particular, some nested lists might contain a desired element | |
while others don't; this would raise an error. | |
* **A slice** (either a Python `slice` object or the | |
`start:stop:step` syntax) selects a range of elements. The | |
`start` and `stop` values are zero-indexed; `start` is inclusive | |
and `stop` is exclusive, like Python/NumPy. Negative `step` | |
values are allowed, but a `step` of `0` is an error. Slices | |
beyond the size of the array are not errors but are truncated, | |
like Python/NumPy. | |
* **A string** selects a tuple or record field, even if its | |
position in the tuple is to the left of the dimension where the | |
tuple/record is defined. (See <<<projection>>> below.) This is | |
similar to NumPy's | |
[field access](https://numpy.org/doc/stable/user/basics.indexing.html#field-access), | |
except that strings are allowed in the same tuple with other | |
slice types. While record fields have names, tuple fields are | |
integer strings, such as `"0"`, `"1"`, `"2"` (always | |
non-negative). Be careful to distinguish these from non-string | |
integers. | |
* **An iterable of strings** (not the top-level tuple) selects | |
multiple tuple/record fields. | |
* **An ellipsis** (either the Python `Ellipsis` object or the | |
`...` syntax) skips as many dimensions as needed to put the | |
rest of the slice items to the innermost dimensions. | |
* **A np.newaxis** or its equivalent, None, does not select items | |
but introduces a new regular dimension in the output with size | |
`1`. This is a convenient way to explicitly choose a dimension | |
for broadcasting. | |
* **A boolean array** with the same length as the current dimension | |
(or any iterable, other than the top-level tuple) selects elements | |
corresponding to each True value in the array, dropping those | |
that correspond to each False. The behavior is similar to | |
NumPy's | |
[compress](https://docs.scipy.org/doc/numpy/reference/generated/numpy.compress.html) | |
function. | |
* **An integer array** (or any iterable, other than the top-level | |
tuple) selects elements like a single integer, but produces a | |
regular dimension of as many as are desired. The array can have | |
any length, any order, and it can have duplicates and incomplete | |
coverage. The behavior is similar to NumPy's | |
[take](https://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html) | |
function. | |
* **An integer Array with missing (None) items** selects multiple | |
values by index, as above, but None values are passed through | |
to the output. This behavior matches pyarrow's | |
[Array.take](https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.take) | |
which also manages arrays with missing values. See | |
<<<option indexing>>> below. | |
* **An Array of nested lists**, ultimately containing booleans or | |
integers and having the same lengths of lists at each level as | |
the Array to which they're applied, selects by boolean or by | |
integer at the deeply nested level. Missing items at any level | |
above the deepest level must broadcast. See <<<nested indexing>>> below. | |
A tuple of the above applies each slice item to a dimension of the | |
data, which can be very expressive. More than one flat boolean/integer | |
array are "iterated as one" as described in the | |
[NumPy documentation](https://numpy.org/doc/stable/user/basics.indexing.html#integer-array-indexing). | |
Filtering | |
********* | |
A common use of selection by boolean arrays is to filter a dataset by | |
some property. For instance, to get the odd values of | |
>>> array = ak.Array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) | |
one can put an array expression with True for each odd value inside | |
square brackets: | |
>>> array[array % 2 == 1] | |
<Array [1, 3, 5, 7, 9] type='5 * int64'> | |
This technique is so common in NumPy and Pandas data analysis that it | |
is often read as a syntax, rather than a consequence of array slicing. | |
The extension to nested arrays like | |
>>> array = ak.Array([[[0, 1, 2], [], [3, 4], [5]], [[6, 7, 8], [9]]]) | |
allows us to use the same syntax more generally. | |
>>> array[array % 2 == 1] | |
<Array [[[1], [], [3], [5]], [[7], [9]]] type='2 * var * var * int64'> | |
In this example, the boolean array is itself nested (see | |
<<<nested indexing>>> below). | |
>>> array % 2 == 1 | |
<Array [[[False, True, False], ..., [True]], ...] type='2 * var * var * bool'> | |
This also applies to data with record structures. | |
For nested data, we often need to select the first or first two | |
elements from variable-length lists. That can be a problem if some | |
lists are empty. A function like #ak.num can be useful for first | |
selecting by the lengths of lists. | |
>>> array = ak.Array([[1.1, 2.2, 3.3], | |
... [], | |
... [4.4, 5.5], | |
... [6.6], | |
... [], | |
... [7.7, 8.8, 9.9]]) | |
... | |
>>> array[ak.num(array) > 0, 0] | |
<Array [1.1, 4.4, 6.6, 7.7] type='4 * float64'> | |
>>> array[ak.num(array) > 1, 1] | |
<Array [2.2, 5.5, 8.8] type='3 * float64'> | |
It's sometimes also a problem that "cleaning" the dataset by dropping | |
empty lists changes its alignment, so that it can no longer be used | |
in calculations with "uncleaned" data. For this, #ak.mask can be | |
useful because it inserts None in positions that fail the filter, | |
rather than removing them. | |
>>> ak.mask(array, ak.num(array) > 1) | |
<Array [[1.1, 2.2, 3.3], ..., [7.7, ..., 9.9]] type='6 * option[var * float64]'> | |
Note, however, that the `0` or `1` to pick the first or second | |
item of each nested list is in the second dimension, so the first | |
dimension of the slice must be a `:`. | |
>>> ak.mask(array, ak.num(array) > 1)[:, 0] | |
<Array [1.1, None, 4.4, None, None, 7.7] type='6 * ?float64'> | |
>>> ak.mask(array, ak.num(array) > 1)[:, 1] | |
<Array [2.2, None, 5.5, None, None, 8.8] type='6 * ?float64'> | |
Another syntax for | |
ak.mask(array, array_of_booleans) | |
is | |
array.mask[array_of_booleans] | |
(which is 5 characters away from simply filtering the `array`). | |
Projection | |
********** | |
The following | |
>>> array = ak.Array([[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [2, 2]}], | |
... [{"x": 3.3, "y": [3, 3, 3]}], | |
... [{"x": 0, "y": []}, {"x": 1.1, "y": [1, 1, 1]}]]) | |
has records inside of nested lists: | |
>>> array.type.show() | |
3 * var * { | |
x: float64, | |
y: var * int64 | |
} | |
In principle, one should select nested lists before record fields, | |
>>> array[2, :, "x"] | |
<Array [0, 1.1] type='2 * float64'> | |
>>> array[::2, :, "x"] | |
<Array [[1.1, 2.2], [0, 1.1]] type='2 * var * float64'> | |
but it's also possible to select record fields first. | |
>>> array["x"] | |
<Array [[1.1, 2.2], [3.3], [0, 1.1]] type='3 * var * float64'> | |
The string can "commute" to the left through integers and slices to | |
get the same result as it would in its "natural" position. | |
>>> array[2, :, "x"] | |
<Array [0, 1.1] type='2 * float64'> | |
>>> array[2, "x", :] | |
<Array [0, 1.1] type='2 * float64'> | |
>>> array["x", 2, :] | |
<Array [0, 1.1] type='2 * float64'> | |
The is analogous to selecting rows (integer indexes) before columns | |
(string names) or columns before rows, except that the rows are | |
more complex (like a Pandas | |
[MultiIndex](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)). | |
This would be an expensive operation in a typical object-oriented | |
environment, in which the records with fields `"x"` and `"y"` are | |
akin to C structs, but for columnar Awkward Arrays, projecting | |
through all records to produce an array of nested lists of `"x"` | |
values just changes the metadata (no loop over data, and therefore | |
fast). | |
Thus, data analysts should think of records as fluid objects that | |
can be easily projected apart and zipped back together with | |
#ak.zip. | |
Note, however, that while a column string can "commute" with row | |
indexes to the left of its position in the tree, it can't commute | |
to the right. For example, it's possible to use slices inside | |
`"y"` because `"y"` is a list: | |
>>> array[0, :, "y"] | |
<Array [[1], [2, 2]] type='2 * var * int64'> | |
>>> array[0, :, "y", 0] | |
<Array [1, 2] type='2 * int64'> | |
but it's not possible to move `"y"` to the right | |
>>> array[0, :, 0, "y"] | |
IndexError: while attempting to slice | |
<Array [[{x: 1.1, y: [1]}, {...}], ...] type='3 * var * {x: float64, y:...'> | |
with | |
(0, :, 0, 'y') | |
at inner NumpyArray of length 2, using sub-slice (0). | |
because the `array[0, :, 0, ...]` slice applies to both `"x"` and | |
`"y"` before `"y"` is selected, and `"x"` is a one-dimensional | |
NumpyArray that can't take more than its share of slices. | |
Finally, note that the dot (`__getattr__`) syntax is equivalent to a single | |
string in a slice (`__getitem__`) if the field name is a valid Python | |
identifier and doesn't conflict with #ak.Array methods or properties. | |
>>> array.x | |
<Array [[1.1, 2.2], [3.3], [0, 1.1]] type='3 * var * float64'> | |
>>> array.y | |
<Array [[[1], [2, 2]], ..., [[], [1, ...]]] type='3 * var * var * int64'> | |
Nested Projection | |
***************** | |
If records are nested within records, you can use a series of strings in | |
the selector to drill down. For instance, with the following | |
>>> array = ak.Array([ | |
... {"a": {"x": 1, "y": 2}, "b": {"x": 10, "y": 20}, "c": {"x": 1.1, "y": 2.2}}, | |
... {"a": {"x": 1, "y": 2}, "b": {"x": 10, "y": 20}, "c": {"x": 1.1, "y": 2.2}}, | |
... {"a": {"x": 1, "y": 2}, "b": {"x": 10, "y": 20}, "c": {"x": 1.1, "y": 2.2}}]) | |
we can go directly to the numerical data by specifying a string for the | |
outer field and a string for the inner field. | |
>>> array["a", "x"] | |
<Array [1, 1, 1] type='3 * int64'> | |
>>> array["a", "y"] | |
<Array [2, 2, 2] type='3 * int64'> | |
>>> array["b", "y"] | |
<Array [20, 20, 20] type='3 * int64'> | |
>>> array["c", "y"] | |
<Array [2.2, 2.2, 2.2] type='3 * float64'> | |
As with single projections, the dot (`__getattr__`) syntax is equivalent | |
to a single string in a slice (`__getitem__`) if the field name is a valid | |
Python identifier and doesn't conflict with #ak.Array methods or properties. | |
>>> array.a.x | |
<Array [1, 1, 1] type='3 * int64'> | |
You can even get every field of the same name within an outer record using | |
a list of field names for the outer record. The following selects the `"x"` | |
field of `"a"`, `"b"`, and `"c"` records: | |
>>> array[["a", "b", "c"], "x"].show() | |
[{a: 1, b: 10, c: 1.1}, | |
{a: 1, b: 10, c: 1.1}, | |
{a: 1, b: 10, c: 1.1}] | |
You don't need to get all fields: | |
>>> array[["a", "b"], "x"].show() | |
[{a: 1, b: 10}, | |
{a: 1, b: 10}, | |
{a: 1, b: 10}] | |
And you can select lists of field names at all levels: | |
>>> array[["a", "b"], ["x", "y"]].show() | |
[{a: {x: 1, y: 2}, b: {x: 10, y: 20}}, | |
{a: {x: 1, y: 2}, b: {x: 10, y: 20}}, | |
{a: {x: 1, y: 2}, b: {x: 10, y: 20}}] | |
Option indexing | |
*************** | |
NumPy arrays can be sliced by all of the above slice types except | |
arrays with missing values and arrays with nested lists, both of | |
which are inexpressible in NumPy. Missing values, represented by | |
None in Python, are called option types (#ak.types.OptionType) in | |
Awkward Array and can be used as a slice. | |
For example, | |
>>> array = ak.Array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]) | |
can be sliced with a boolean array | |
>>> array[[False, False, False, False, True, False, True, False, True]] | |
<Array [5.5, 7.7, 9.9] type='3 * float64'> | |
or a boolean array containing None values: | |
>>> array[[False, False, False, False, True, None, True, None, True]] | |
<Array [5.5, None, 7.7, None, 9.9] type='5 * ?float64'> | |
Similarly for arrays of integers and None: | |
>>> array[[0, 1, None, None, 7, 8]] | |
<Array [1.1, 2.2, None, None, 8.8, 9.9] type='6 * ?float64'> | |
This is the same behavior as pyarrow's | |
[Array.take](https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.take), | |
which establishes a convention for how to interpret slice arrays | |
with option type: | |
>>> import pyarrow as pa | |
>>> array = pa.array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]) | |
>>> array.take(pa.array([0, 1, None, None, 7, 8])) | |
<pyarrow.lib.DoubleArray object at 0x7efc7f060210> | |
[ | |
1.1, | |
2.2, | |
null, | |
null, | |
8.8, | |
9.9 | |
] | |
Nested indexing | |
*************** | |
Awkward Array's nested lists can be used as slices as well, as long | |
as the type at the deepest level of nesting is boolean or integer. | |
For example, | |
>>> array = ak.Array([[[0.0, 1.1, 2.2], [], [3.3, 4.4]], [], [[5.5]]]) | |
can be sliced at the top level with one-dimensional arrays: | |
>>> array[[False, True, True]] | |
<Array [[], [[5.5]]] type='2 * var * var * float64'> | |
>>> array[[1, 2]] | |
<Array [[], [[5.5]]] type='2 * var * var * float64'> | |
with singly nested lists: | |
>>> array[[[False, True, True], [], [True]]] | |
<Array [[[], [3.3, 4.4]], [], [[5.5]]] type='3 * var * var * float64'> | |
>>> array[[[1, 2], [], [0]]] | |
<Array [[[], [3.3, 4.4]], [], [[5.5]]] type='3 * var * var * float64'> | |
and with doubly nested lists: | |
>>> array[[[[False, True, False], [], [True, False]], [], [[False]]]] | |
<Array [[[1.1], [], [3.3]], [], [[]]] type='3 * var * var * float64'> | |
>>> array[[[[1], [], [0]], [], [[]]]] | |
<Array [[[1.1], [], [3.3]], [], [[]]] type='3 * var * var * float64'> | |
The key thing is that the nested slice has the same number of elements | |
as the array it's slicing at every level of nesting that it reproduces. | |
This is similar to the requirement that boolean arrays have the same | |
length as the array they're filtering. | |
This kind of slicing is useful because NumPy's | |
[universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html) | |
produce arrays with the same structure as the original array, which | |
can then be used as filters. | |
>>> ((array * 10) % 2 == 1).show() | |
[[[False, True, False], [], [True, False]], | |
[], | |
[[True]]] | |
>>> (array[(array * 10) % 2 == 1]).show() | |
[[[1.1], [], [3.3]], | |
[], | |
[[5.5]]] | |
Functions whose names start with "arg" return index positions, which | |
can be used with the integer form. | |
>>> np.argmax(array, axis=-1).show() | |
[[2, None, 1], | |
[], | |
[0]] | |
>>> array[np.argmax(array, axis=-1)].show() | |
[[[3.3, 4.4], None, []], | |
[], | |
[[5.5]]] | |
Here, the `np.argmax` returns the integer position of the maximum | |
element or None for empty arrays. It's a nice example of | |
<<<option indexing>>> with <<<nested indexing>>>. | |
When applying a nested index with missing (None) entries at levels | |
higher than the last level, the indexer must have the same dimension | |
as the array being indexed, and the resulting output will have missing | |
entries at the corresponding locations, e.g. for | |
>>> array[ [[[0, None, 2, None, None], None, [1]], None, [[0]]] ].show() | |
[[[0, None, 2.2, None, None], None, [4.4]], | |
None, | |
[[5.5]]] | |
the sub-list at entry 0,0 is extended as the masked entries are | |
acting at the last level, while the higher levels of the indexer all | |
have the same dimension as the array being indexed. |
into the tutorial area where it will be more visible.
We've used three words now for mostly the same thing: "jagged" (I started with that because Wikipedia preferred it), "ragged" (this is what I should have used, because it's more widespread in the SciPy community), and "awkward" (new here). Are you using a different word than "ragged" because it also includes missing values? I wonder if "ragged, masked indexing" might be better, since it ties in with a word the reader might already know.
I try to use capitalization consistently and have decided to capitalize "Awkward Array" and even "Awkward" when it's used as an adjective: "Awkward indexing." If it's lowercase, it will less likely be recognized as a brand name, and then it takes on the ordinary English meaning of "clumsy or difficult."
TL;DR
how-to-filter-ragged.md
.how-to-filter-masked.md
.