Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: improve ragged indexing docs #2247

Merged
merged 3 commits into from
Feb 16, 2023
Merged

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Feb 16, 2023

TL;DR

  • Slightly rework how-to-filter-ragged.md.
  • Add how-to-filter-masked.md.
  • Rename "ragged indexing" to "Awkward indexing".

@agoose77 agoose77 temporarily deployed to docs-preview February 16, 2023 11:06 — with GitHub Actions Inactive
@agoose77 agoose77 requested a review from jpivarski February 16, 2023 11:35
@agoose77 agoose77 marked this pull request as ready for review February 16, 2023 11:35
@agoose77 agoose77 temporarily deployed to docs-preview February 16, 2023 11:41 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

@jpivarski I noticed recently that boolean indexing is not strictly required to have the same shape as the underlying array, which is a reflection of the fact that we normalise boolean arrays to integers without knowledge of the array being indexed.

As I see it, this is a policy decision. If we want to permit this, then we don't need to fix anything. If not, then we probably need to avoid this normalisation and explicitly handle the boolean arrays in each content's _getitem_XXX.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new text, and it looks good; explaining not just the slicing rules but how to use them with argmax and such.

We can get another tutorial "for free" by moving this docstring:

All methods of selecting items described in
[NumPy indexing](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html)
are supported with one exception
([combining advanced and basic indexing](https://numpy.org/doc/stable/user/basics.indexing.html#combining-advanced-and-basic-indexing)
with basic indexes *between* two advanced indexes: the definition
NumPy chose for the result does not have a generalization beyond
rectilinear arrays).
The `where` parameter can be any of the following or a tuple of
the following.
* **An integer** selects one element. Like Python/NumPy, it is
zero-indexed: `0` is the first item, `1` is the second, etc.
Negative indexes count from the end of the list: `-1` is the
last, `-2` is the second-to-last, etc.
Indexes beyond the size of the array, either because they're too
large or because they're too negative, raise errors. In
particular, some nested lists might contain a desired element
while others don't; this would raise an error.
* **A slice** (either a Python `slice` object or the
`start:stop:step` syntax) selects a range of elements. The
`start` and `stop` values are zero-indexed; `start` is inclusive
and `stop` is exclusive, like Python/NumPy. Negative `step`
values are allowed, but a `step` of `0` is an error. Slices
beyond the size of the array are not errors but are truncated,
like Python/NumPy.
* **A string** selects a tuple or record field, even if its
position in the tuple is to the left of the dimension where the
tuple/record is defined. (See <<<projection>>> below.) This is
similar to NumPy's
[field access](https://numpy.org/doc/stable/user/basics.indexing.html#field-access),
except that strings are allowed in the same tuple with other
slice types. While record fields have names, tuple fields are
integer strings, such as `"0"`, `"1"`, `"2"` (always
non-negative). Be careful to distinguish these from non-string
integers.
* **An iterable of strings** (not the top-level tuple) selects
multiple tuple/record fields.
* **An ellipsis** (either the Python `Ellipsis` object or the
`...` syntax) skips as many dimensions as needed to put the
rest of the slice items to the innermost dimensions.
* **A np.newaxis** or its equivalent, None, does not select items
but introduces a new regular dimension in the output with size
`1`. This is a convenient way to explicitly choose a dimension
for broadcasting.
* **A boolean array** with the same length as the current dimension
(or any iterable, other than the top-level tuple) selects elements
corresponding to each True value in the array, dropping those
that correspond to each False. The behavior is similar to
NumPy's
[compress](https://docs.scipy.org/doc/numpy/reference/generated/numpy.compress.html)
function.
* **An integer array** (or any iterable, other than the top-level
tuple) selects elements like a single integer, but produces a
regular dimension of as many as are desired. The array can have
any length, any order, and it can have duplicates and incomplete
coverage. The behavior is similar to NumPy's
[take](https://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html)
function.
* **An integer Array with missing (None) items** selects multiple
values by index, as above, but None values are passed through
to the output. This behavior matches pyarrow's
[Array.take](https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.take)
which also manages arrays with missing values. See
<<<option indexing>>> below.
* **An Array of nested lists**, ultimately containing booleans or
integers and having the same lengths of lists at each level as
the Array to which they're applied, selects by boolean or by
integer at the deeply nested level. Missing items at any level
above the deepest level must broadcast. See <<<nested indexing>>> below.
A tuple of the above applies each slice item to a dimension of the
data, which can be very expressive. More than one flat boolean/integer
array are "iterated as one" as described in the
[NumPy documentation](https://numpy.org/doc/stable/user/basics.indexing.html#integer-array-indexing).
Filtering
*********
A common use of selection by boolean arrays is to filter a dataset by
some property. For instance, to get the odd values of
>>> array = ak.Array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
one can put an array expression with True for each odd value inside
square brackets:
>>> array[array % 2 == 1]
<Array [1, 3, 5, 7, 9] type='5 * int64'>
This technique is so common in NumPy and Pandas data analysis that it
is often read as a syntax, rather than a consequence of array slicing.
The extension to nested arrays like
>>> array = ak.Array([[[0, 1, 2], [], [3, 4], [5]], [[6, 7, 8], [9]]])
allows us to use the same syntax more generally.
>>> array[array % 2 == 1]
<Array [[[1], [], [3], [5]], [[7], [9]]] type='2 * var * var * int64'>
In this example, the boolean array is itself nested (see
<<<nested indexing>>> below).
>>> array % 2 == 1
<Array [[[False, True, False], ..., [True]], ...] type='2 * var * var * bool'>
This also applies to data with record structures.
For nested data, we often need to select the first or first two
elements from variable-length lists. That can be a problem if some
lists are empty. A function like #ak.num can be useful for first
selecting by the lengths of lists.
>>> array = ak.Array([[1.1, 2.2, 3.3],
... [],
... [4.4, 5.5],
... [6.6],
... [],
... [7.7, 8.8, 9.9]])
...
>>> array[ak.num(array) > 0, 0]
<Array [1.1, 4.4, 6.6, 7.7] type='4 * float64'>
>>> array[ak.num(array) > 1, 1]
<Array [2.2, 5.5, 8.8] type='3 * float64'>
It's sometimes also a problem that "cleaning" the dataset by dropping
empty lists changes its alignment, so that it can no longer be used
in calculations with "uncleaned" data. For this, #ak.mask can be
useful because it inserts None in positions that fail the filter,
rather than removing them.
>>> ak.mask(array, ak.num(array) > 1)
<Array [[1.1, 2.2, 3.3], ..., [7.7, ..., 9.9]] type='6 * option[var * float64]'>
Note, however, that the `0` or `1` to pick the first or second
item of each nested list is in the second dimension, so the first
dimension of the slice must be a `:`.
>>> ak.mask(array, ak.num(array) > 1)[:, 0]
<Array [1.1, None, 4.4, None, None, 7.7] type='6 * ?float64'>
>>> ak.mask(array, ak.num(array) > 1)[:, 1]
<Array [2.2, None, 5.5, None, None, 8.8] type='6 * ?float64'>
Another syntax for
ak.mask(array, array_of_booleans)
is
array.mask[array_of_booleans]
(which is 5 characters away from simply filtering the `array`).
Projection
**********
The following
>>> array = ak.Array([[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [2, 2]}],
... [{"x": 3.3, "y": [3, 3, 3]}],
... [{"x": 0, "y": []}, {"x": 1.1, "y": [1, 1, 1]}]])
has records inside of nested lists:
>>> array.type.show()
3 * var * {
x: float64,
y: var * int64
}
In principle, one should select nested lists before record fields,
>>> array[2, :, "x"]
<Array [0, 1.1] type='2 * float64'>
>>> array[::2, :, "x"]
<Array [[1.1, 2.2], [0, 1.1]] type='2 * var * float64'>
but it's also possible to select record fields first.
>>> array["x"]
<Array [[1.1, 2.2], [3.3], [0, 1.1]] type='3 * var * float64'>
The string can "commute" to the left through integers and slices to
get the same result as it would in its "natural" position.
>>> array[2, :, "x"]
<Array [0, 1.1] type='2 * float64'>
>>> array[2, "x", :]
<Array [0, 1.1] type='2 * float64'>
>>> array["x", 2, :]
<Array [0, 1.1] type='2 * float64'>
The is analogous to selecting rows (integer indexes) before columns
(string names) or columns before rows, except that the rows are
more complex (like a Pandas
[MultiIndex](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)).
This would be an expensive operation in a typical object-oriented
environment, in which the records with fields `"x"` and `"y"` are
akin to C structs, but for columnar Awkward Arrays, projecting
through all records to produce an array of nested lists of `"x"`
values just changes the metadata (no loop over data, and therefore
fast).
Thus, data analysts should think of records as fluid objects that
can be easily projected apart and zipped back together with
#ak.zip.
Note, however, that while a column string can "commute" with row
indexes to the left of its position in the tree, it can't commute
to the right. For example, it's possible to use slices inside
`"y"` because `"y"` is a list:
>>> array[0, :, "y"]
<Array [[1], [2, 2]] type='2 * var * int64'>
>>> array[0, :, "y", 0]
<Array [1, 2] type='2 * int64'>
but it's not possible to move `"y"` to the right
>>> array[0, :, 0, "y"]
IndexError: while attempting to slice
<Array [[{x: 1.1, y: [1]}, {...}], ...] type='3 * var * {x: float64, y:...'>
with
(0, :, 0, 'y')
at inner NumpyArray of length 2, using sub-slice (0).
because the `array[0, :, 0, ...]` slice applies to both `"x"` and
`"y"` before `"y"` is selected, and `"x"` is a one-dimensional
NumpyArray that can't take more than its share of slices.
Finally, note that the dot (`__getattr__`) syntax is equivalent to a single
string in a slice (`__getitem__`) if the field name is a valid Python
identifier and doesn't conflict with #ak.Array methods or properties.
>>> array.x
<Array [[1.1, 2.2], [3.3], [0, 1.1]] type='3 * var * float64'>
>>> array.y
<Array [[[1], [2, 2]], ..., [[], [1, ...]]] type='3 * var * var * int64'>
Nested Projection
*****************
If records are nested within records, you can use a series of strings in
the selector to drill down. For instance, with the following
>>> array = ak.Array([
... {"a": {"x": 1, "y": 2}, "b": {"x": 10, "y": 20}, "c": {"x": 1.1, "y": 2.2}},
... {"a": {"x": 1, "y": 2}, "b": {"x": 10, "y": 20}, "c": {"x": 1.1, "y": 2.2}},
... {"a": {"x": 1, "y": 2}, "b": {"x": 10, "y": 20}, "c": {"x": 1.1, "y": 2.2}}])
we can go directly to the numerical data by specifying a string for the
outer field and a string for the inner field.
>>> array["a", "x"]
<Array [1, 1, 1] type='3 * int64'>
>>> array["a", "y"]
<Array [2, 2, 2] type='3 * int64'>
>>> array["b", "y"]
<Array [20, 20, 20] type='3 * int64'>
>>> array["c", "y"]
<Array [2.2, 2.2, 2.2] type='3 * float64'>
As with single projections, the dot (`__getattr__`) syntax is equivalent
to a single string in a slice (`__getitem__`) if the field name is a valid
Python identifier and doesn't conflict with #ak.Array methods or properties.
>>> array.a.x
<Array [1, 1, 1] type='3 * int64'>
You can even get every field of the same name within an outer record using
a list of field names for the outer record. The following selects the `"x"`
field of `"a"`, `"b"`, and `"c"` records:
>>> array[["a", "b", "c"], "x"].show()
[{a: 1, b: 10, c: 1.1},
{a: 1, b: 10, c: 1.1},
{a: 1, b: 10, c: 1.1}]
You don't need to get all fields:
>>> array[["a", "b"], "x"].show()
[{a: 1, b: 10},
{a: 1, b: 10},
{a: 1, b: 10}]
And you can select lists of field names at all levels:
>>> array[["a", "b"], ["x", "y"]].show()
[{a: {x: 1, y: 2}, b: {x: 10, y: 20}},
{a: {x: 1, y: 2}, b: {x: 10, y: 20}},
{a: {x: 1, y: 2}, b: {x: 10, y: 20}}]
Option indexing
***************
NumPy arrays can be sliced by all of the above slice types except
arrays with missing values and arrays with nested lists, both of
which are inexpressible in NumPy. Missing values, represented by
None in Python, are called option types (#ak.types.OptionType) in
Awkward Array and can be used as a slice.
For example,
>>> array = ak.Array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
can be sliced with a boolean array
>>> array[[False, False, False, False, True, False, True, False, True]]
<Array [5.5, 7.7, 9.9] type='3 * float64'>
or a boolean array containing None values:
>>> array[[False, False, False, False, True, None, True, None, True]]
<Array [5.5, None, 7.7, None, 9.9] type='5 * ?float64'>
Similarly for arrays of integers and None:
>>> array[[0, 1, None, None, 7, 8]]
<Array [1.1, 2.2, None, None, 8.8, 9.9] type='6 * ?float64'>
This is the same behavior as pyarrow's
[Array.take](https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.take),
which establishes a convention for how to interpret slice arrays
with option type:
>>> import pyarrow as pa
>>> array = pa.array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
>>> array.take(pa.array([0, 1, None, None, 7, 8]))
<pyarrow.lib.DoubleArray object at 0x7efc7f060210>
[
1.1,
2.2,
null,
null,
8.8,
9.9
]
Nested indexing
***************
Awkward Array's nested lists can be used as slices as well, as long
as the type at the deepest level of nesting is boolean or integer.
For example,
>>> array = ak.Array([[[0.0, 1.1, 2.2], [], [3.3, 4.4]], [], [[5.5]]])
can be sliced at the top level with one-dimensional arrays:
>>> array[[False, True, True]]
<Array [[], [[5.5]]] type='2 * var * var * float64'>
>>> array[[1, 2]]
<Array [[], [[5.5]]] type='2 * var * var * float64'>
with singly nested lists:
>>> array[[[False, True, True], [], [True]]]
<Array [[[], [3.3, 4.4]], [], [[5.5]]] type='3 * var * var * float64'>
>>> array[[[1, 2], [], [0]]]
<Array [[[], [3.3, 4.4]], [], [[5.5]]] type='3 * var * var * float64'>
and with doubly nested lists:
>>> array[[[[False, True, False], [], [True, False]], [], [[False]]]]
<Array [[[1.1], [], [3.3]], [], [[]]] type='3 * var * var * float64'>
>>> array[[[[1], [], [0]], [], [[]]]]
<Array [[[1.1], [], [3.3]], [], [[]]] type='3 * var * var * float64'>
The key thing is that the nested slice has the same number of elements
as the array it's slicing at every level of nesting that it reproduces.
This is similar to the requirement that boolean arrays have the same
length as the array they're filtering.
This kind of slicing is useful because NumPy's
[universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
produce arrays with the same structure as the original array, which
can then be used as filters.
>>> ((array * 10) % 2 == 1).show()
[[[False, True, False], [], [True, False]],
[],
[[True]]]
>>> (array[(array * 10) % 2 == 1]).show()
[[[1.1], [], [3.3]],
[],
[[5.5]]]
Functions whose names start with "arg" return index positions, which
can be used with the integer form.
>>> np.argmax(array, axis=-1).show()
[[2, None, 1],
[],
[0]]
>>> array[np.argmax(array, axis=-1)].show()
[[[3.3, 4.4], None, []],
[],
[[5.5]]]
Here, the `np.argmax` returns the integer position of the maximum
element or None for empty arrays. It's a nice example of
<<<option indexing>>> with <<<nested indexing>>>.
When applying a nested index with missing (None) entries at levels
higher than the last level, the indexer must have the same dimension
as the array being indexed, and the resulting output will have missing
entries at the corresponding locations, e.g. for
>>> array[ [[[0, None, 2, None, None], None, [1]], None, [[0]]] ].show()
[[[0, None, 2.2, None, None], None, [4.4]],
None,
[[5.5]]]
the sub-list at entry 0,0 is extended as the masked entries are
acting at the last level, while the higher levels of the indexer all
have the same dimension as the array being indexed.

into the tutorial area where it will be more visible.

We've used three words now for mostly the same thing: "jagged" (I started with that because Wikipedia preferred it), "ragged" (this is what I should have used, because it's more widespread in the SciPy community), and "awkward" (new here). Are you using a different word than "ragged" because it also includes missing values? I wonder if "ragged, masked indexing" might be better, since it ties in with a word the reader might already know.

I try to use capitalization consistently and have decided to capitalize "Awkward Array" and even "Awkward" when it's used as an adjective: "Awkward indexing." If it's lowercase, it will less likely be recognized as a brand name, and then it takes on the ordinary English meaning of "clumsy or difficult."

@jpivarski jpivarski merged commit 32ca867 into main Feb 16, 2023
@jpivarski jpivarski deleted the agoose77/docs-addition-to-indexing branch February 16, 2023 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants