docs: improve ragged indexing docs #2247

agoose77 · 2023-02-16T11:00:34Z

TL;DR

Slightly rework how-to-filter-ragged.md.
Add how-to-filter-masked.md.
Rename "ragged indexing" to "Awkward indexing".

agoose77 · 2023-02-16T12:16:44Z

@jpivarski I noticed recently that boolean indexing is not strictly required to have the same shape as the underlying array, which is a reflection of the fact that we normalise boolean arrays to integers without knowledge of the array being indexed.

As I see it, this is a policy decision. If we want to permit this, then we don't need to fix anything. If not, then we probably need to avoid this normalisation and explicitly handle the boolean arrays in each content's _getitem_XXX.

jpivarski

This is new text, and it looks good; explaining not just the slicing rules but how to use them with argmax and such.

We can get another tutorial "for free" by moving this docstring:

awkward/src/awkward/highlevel.py

Lines 530 to 948 in 082e485

    
                   All methods of selecting items described in 
        
                   [NumPy indexing](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html) 
        
                   are supported with one exception 
        
                   ([combining advanced and basic indexing](https://numpy.org/doc/stable/user/basics.indexing.html#combining-advanced-and-basic-indexing) 
        
                   with basic indexes *between* two advanced indexes: the definition 
        
                   NumPy chose for the result does not have a generalization beyond 
        
                   rectilinear arrays). 
        
                   The `where` parameter can be any of the following or a tuple of 
        
                   the following. 
        
                   * **An integer** selects one element. Like Python/NumPy, it is 
        
                     zero-indexed: `0` is the first item, `1` is the second, etc. 
        
                     Negative indexes count from the end of the list: `-1` is the 
        
                     last, `-2` is the second-to-last, etc. 
        
                     Indexes beyond the size of the array, either because they're too 
        
                     large or because they're too negative, raise errors. In 
        
                     particular, some nested lists might contain a desired element 
        
                     while others don't; this would raise an error. 
        
                   * **A slice** (either a Python `slice` object or the 
        
                     `start:stop:step` syntax) selects a range of elements. The 
        
                     `start` and `stop` values are zero-indexed; `start` is inclusive 
        
                     and `stop` is exclusive, like Python/NumPy. Negative `step` 
        
                     values are allowed, but a `step` of `0` is an error. Slices 
        
                     beyond the size of the array are not errors but are truncated, 
        
                     like Python/NumPy. 
        
                   * **A string** selects a tuple or record field, even if its 
        
                     position in the tuple is to the left of the dimension where the 
        
                     tuple/record is defined. (See <<<projection>>> below.) This is 
        
                     similar to NumPy's 
        
                     [field access](https://numpy.org/doc/stable/user/basics.indexing.html#field-access), 
        
                     except that strings are allowed in the same tuple with other 
        
                     slice types. While record fields have names, tuple fields are 
        
                     integer strings, such as `"0"`, `"1"`, `"2"` (always 
        
                     non-negative). Be careful to distinguish these from non-string 
        
                     integers. 
        
                   * **An iterable of strings** (not the top-level tuple) selects 
        
                     multiple tuple/record fields. 
        
                   * **An ellipsis** (either the Python `Ellipsis` object or the 
        
                     `...` syntax) skips as many dimensions as needed to put the 
        
                     rest of the slice items to the innermost dimensions. 
        
                   * **A np.newaxis** or its equivalent, None, does not select items 
        
                     but introduces a new regular dimension in the output with size 
        
                     `1`. This is a convenient way to explicitly choose a dimension 
        
                     for broadcasting. 
        
                   * **A boolean array** with the same length as the current dimension 
        
                     (or any iterable, other than the top-level tuple) selects elements 
        
                     corresponding to each True value in the array, dropping those 
        
                     that correspond to each False. The behavior is similar to 
        
                     NumPy's 
        
                     [compress](https://docs.scipy.org/doc/numpy/reference/generated/numpy.compress.html) 
        
                     function. 
        
                   * **An integer array** (or any iterable, other than the top-level 
        
                     tuple) selects elements like a single integer, but produces a 
        
                     regular dimension of as many as are desired. The array can have 
        
                     any length, any order, and it can have duplicates and incomplete 
        
                     coverage. The behavior is similar to NumPy's 
        
                     [take](https://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html) 
        
                     function. 
        
                   * **An integer Array with missing (None) items** selects multiple 
        
                     values by index, as above, but None values are passed through 
        
                     to the output. This behavior matches pyarrow's 
        
                     [Array.take](https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.take) 
        
                     which also manages arrays with missing values. See 
        
                     <<<option indexing>>> below. 
        
                   * **An Array of nested lists**, ultimately containing booleans or 
        
                     integers and having the same lengths of lists at each level as 
        
                     the Array to which they're applied, selects by boolean or by 
        
                     integer at the deeply nested level. Missing items at any level 
        
                     above the deepest level must broadcast. See <<<nested indexing>>> below. 
        
                   A tuple of the above applies each slice item to a dimension of the 
        
                   data, which can be very expressive. More than one flat boolean/integer 
        
                   array are "iterated as one" as described in the 
        
                   [NumPy documentation](https://numpy.org/doc/stable/user/basics.indexing.html#integer-array-indexing). 
        
                   Filtering 
        
                   ********* 
        
                   A common use of selection by boolean arrays is to filter a dataset by 
        
                   some property. For instance, to get the odd values of 
        
                       >>> array = ak.Array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 
        
                   one can put an array expression with True for each odd value inside 
        
                   square brackets: 
        
                       >>> array[array % 2 == 1] 
        
                       <Array [1, 3, 5, 7, 9] type='5 * int64'> 
        
                   This technique is so common in NumPy and Pandas data analysis that it 
        
                   is often read as a syntax, rather than a consequence of array slicing. 
        
                   The extension to nested arrays like 
        
                       >>> array = ak.Array([[[0, 1, 2], [], [3, 4], [5]], [[6, 7, 8], [9]]]) 
        
                   allows us to use the same syntax more generally. 
        
                       >>> array[array % 2 == 1] 
        
                       <Array [[[1], [], [3], [5]], [[7], [9]]] type='2 * var * var * int64'> 
        
                   In this example, the boolean array is itself nested (see 
        
                   <<<nested indexing>>> below). 
        
                       >>> array % 2 == 1 
        
                       <Array [[[False, True, False], ..., [True]], ...] type='2 * var * var * bool'> 
        
                   This also applies to data with record structures. 
        
                   For nested data, we often need to select the first or first two 
        
                   elements from variable-length lists. That can be a problem if some 
        
                   lists are empty. A function like #ak.num can be useful for first 
        
                   selecting by the lengths of lists. 
        
                       >>> array = ak.Array([[1.1, 2.2, 3.3], 
        
                       ...                   [], 
        
                       ...                   [4.4, 5.5], 
        
                       ...                   [6.6], 
        
                       ...                   [], 
        
                       ...                   [7.7, 8.8, 9.9]]) 
        
                       ... 
        
                       >>> array[ak.num(array) > 0, 0] 
        
                       <Array [1.1, 4.4, 6.6, 7.7] type='4 * float64'> 
        
                       >>> array[ak.num(array) > 1, 1] 
        
                       <Array [2.2, 5.5, 8.8] type='3 * float64'> 
        
                   It's sometimes also a problem that "cleaning" the dataset by dropping 
        
                   empty lists changes its alignment, so that it can no longer be used 
        
                   in calculations with "uncleaned" data. For this, #ak.mask can be 
        
                   useful because it inserts None in positions that fail the filter, 
        
                   rather than removing them. 
        
                       >>> ak.mask(array, ak.num(array) > 1) 
        
                       <Array [[1.1, 2.2, 3.3], ..., [7.7, ..., 9.9]] type='6 * option[var * float64]'> 
        
                   Note, however, that the `0` or `1` to pick the first or second 
        
                   item of each nested list is in the second dimension, so the first 
        
                   dimension of the slice must be a `:`. 
        
                       >>> ak.mask(array, ak.num(array) > 1)[:, 0] 
        
                       <Array [1.1, None, 4.4, None, None, 7.7] type='6 * ?float64'> 
        
                       >>> ak.mask(array, ak.num(array) > 1)[:, 1] 
        
                       <Array [2.2, None, 5.5, None, None, 8.8] type='6 * ?float64'> 
        
                   Another syntax for 
        
                       ak.mask(array, array_of_booleans) 
        
                   is 
        
                       array.mask[array_of_booleans] 
        
                   (which is 5 characters away from simply filtering the `array`). 
        
                   Projection 
        
                   ********** 
        
                   The following 
        
                       >>> array = ak.Array([[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [2, 2]}], 
        
                       ...                   [{"x": 3.3, "y": [3, 3, 3]}], 
        
                       ...                   [{"x": 0, "y": []}, {"x": 1.1, "y": [1, 1, 1]}]]) 
        
                   has records inside of nested lists: 
        
                       >>> array.type.show() 
        
                       3 * var * { 
        
                           x: float64, 
        
                           y: var * int64 
        
                       } 
        
                   In principle, one should select nested lists before record fields, 
        
                       >>> array[2, :, "x"] 
        
                       <Array [0, 1.1] type='2 * float64'> 
        
                       >>> array[::2, :, "x"] 
        
                       <Array [[1.1, 2.2], [0, 1.1]] type='2 * var * float64'> 
        
                   but it's also possible to select record fields first. 
        
                       >>> array["x"] 
        
                       <Array [[1.1, 2.2], [3.3], [0, 1.1]] type='3 * var * float64'> 
        
                   The string can "commute" to the left through integers and slices to 
        
                   get the same result as it would in its "natural" position. 
        
                       >>> array[2, :, "x"] 
        
                       <Array [0, 1.1] type='2 * float64'> 
        
                       >>> array[2, "x", :] 
        
                       <Array [0, 1.1] type='2 * float64'> 
        
                       >>> array["x", 2, :] 
        
                       <Array [0, 1.1] type='2 * float64'> 
        
                   The is analogous to selecting rows (integer indexes) before columns 
        
                   (string names) or columns before rows, except that the rows are 
        
                   more complex (like a Pandas 
        
                   [MultiIndex](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)). 
        
                   This would be an expensive operation in a typical object-oriented 
        
                   environment, in which the records with fields `"x"` and `"y"` are 
        
                   akin to C structs, but for columnar Awkward Arrays, projecting 
        
                   through all records to produce an array of nested lists of `"x"` 
        
                   values just changes the metadata (no loop over data, and therefore 
        
                   fast). 
        
                   Thus, data analysts should think of records as fluid objects that 
        
                   can be easily projected apart and zipped back together with 
        
                   #ak.zip. 
        
                   Note, however, that while a column string can "commute" with row 
        
                   indexes to the left of its position in the tree, it can't commute 
        
                   to the right. For example, it's possible to use slices inside 
        
                   `"y"` because `"y"` is a list: 
        
                       >>> array[0, :, "y"] 
        
                       <Array [[1], [2, 2]] type='2 * var * int64'> 
        
                       >>> array[0, :, "y", 0] 
        
                       <Array [1, 2] type='2 * int64'> 
        
                   but it's not possible to move `"y"` to the right 
        
                       >>> array[0, :, 0, "y"] 
        
                       IndexError: while attempting to slice 
        
                           <Array [[{x: 1.1, y: [1]}, {...}], ...] type='3 * var * {x: float64, y:...'> 
        
                       with 
        
                           (0, :, 0, 'y') 
        
                       at inner NumpyArray of length 2, using sub-slice (0). 
        
                   because the `array[0, :, 0, ...]` slice applies to both `"x"` and 
        
                   `"y"` before `"y"` is selected, and `"x"` is a one-dimensional 
        
                   NumpyArray that can't take more than its share of slices. 
        
                   Finally, note that the dot (`__getattr__`) syntax is equivalent to a single 
        
                   string in a slice (`__getitem__`) if the field name is a valid Python 
        
                   identifier and doesn't conflict with #ak.Array methods or properties. 
        
                       >>> array.x 
        
                       <Array [[1.1, 2.2], [3.3], [0, 1.1]] type='3 * var * float64'> 
        
                       >>> array.y 
        
                       <Array [[[1], [2, 2]], ..., [[], [1, ...]]] type='3 * var * var * int64'> 
        
                   Nested Projection 
        
                   ***************** 
        
                   If records are nested within records, you can use a series of strings in 
        
                   the selector to drill down. For instance, with the following 
        
                       >>> array = ak.Array([ 
        
                       ...     {"a": {"x": 1, "y": 2}, "b": {"x": 10, "y": 20}, "c": {"x": 1.1, "y": 2.2}}, 
        
                       ...     {"a": {"x": 1, "y": 2}, "b": {"x": 10, "y": 20}, "c": {"x": 1.1, "y": 2.2}}, 
        
                       ...     {"a": {"x": 1, "y": 2}, "b": {"x": 10, "y": 20}, "c": {"x": 1.1, "y": 2.2}}]) 
        
                   we can go directly to the numerical data by specifying a string for the 
        
                   outer field and a string for the inner field. 
        
                       >>> array["a", "x"] 
        
                       <Array [1, 1, 1] type='3 * int64'> 
        
                       >>> array["a", "y"] 
        
                       <Array [2, 2, 2] type='3 * int64'> 
        
                       >>> array["b", "y"] 
        
                       <Array [20, 20, 20] type='3 * int64'> 
        
                       >>> array["c", "y"] 
        
                       <Array [2.2, 2.2, 2.2] type='3 * float64'> 
        
                   As with single projections, the dot (`__getattr__`) syntax is equivalent 
        
                   to a single string in a slice (`__getitem__`) if the field name is a valid 
        
                   Python identifier and doesn't conflict with #ak.Array methods or properties. 
        
                       >>> array.a.x 
        
                       <Array [1, 1, 1] type='3 * int64'> 
        
                   You can even get every field of the same name within an outer record using 
        
                   a list of field names for the outer record. The following selects the `"x"` 
        
                   field of `"a"`, `"b"`, and `"c"` records: 
        
                       >>> array[["a", "b", "c"], "x"].show() 
        
                       [{a: 1, b: 10, c: 1.1}, 
        
                        {a: 1, b: 10, c: 1.1}, 
        
                        {a: 1, b: 10, c: 1.1}] 
        
                   You don't need to get all fields: 
        
                       >>> array[["a", "b"], "x"].show() 
        
                       [{a: 1, b: 10}, 
        
                        {a: 1, b: 10}, 
        
                        {a: 1, b: 10}] 
        
                   And you can select lists of field names at all levels: 
        
                       >>> array[["a", "b"], ["x", "y"]].show() 
        
                       [{a: {x: 1, y: 2}, b: {x: 10, y: 20}}, 
        
                        {a: {x: 1, y: 2}, b: {x: 10, y: 20}}, 
        
                        {a: {x: 1, y: 2}, b: {x: 10, y: 20}}] 
        
                   Option indexing 
        
                   *************** 
        
                   NumPy arrays can be sliced by all of the above slice types except 
        
                   arrays with missing values and arrays with nested lists, both of 
        
                   which are inexpressible in NumPy. Missing values, represented by 
        
                   None in Python, are called option types (#ak.types.OptionType) in 
        
                   Awkward Array and can be used as a slice. 
        
                   For example, 
        
                       >>> array = ak.Array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]) 
        
                   can be sliced with a boolean array 
        
                       >>> array[[False, False, False, False, True, False, True, False, True]] 
        
                       <Array [5.5, 7.7, 9.9] type='3 * float64'> 
        
                   or a boolean array containing None values: 
        
                       >>> array[[False, False, False, False, True, None, True, None, True]] 
        
                       <Array [5.5, None, 7.7, None, 9.9] type='5 * ?float64'> 
        
                   Similarly for arrays of integers and None: 
        
                       >>> array[[0, 1, None, None, 7, 8]] 
        
                       <Array [1.1, 2.2, None, None, 8.8, 9.9] type='6 * ?float64'> 
        
                   This is the same behavior as pyarrow's 
        
                   [Array.take](https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.take), 
        
                   which establishes a convention for how to interpret slice arrays 
        
                   with option type: 
        
                       >>> import pyarrow as pa 
        
                       >>> array = pa.array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]) 
        
                       >>> array.take(pa.array([0, 1, None, None, 7, 8])) 
        
                       <pyarrow.lib.DoubleArray object at 0x7efc7f060210> 
        
                       [ 
        
                         1.1, 
        
                         2.2, 
        
                         null, 
        
                         null, 
        
                         8.8, 
        
                         9.9 
        
                       ] 
        
                   Nested indexing 
        
                   *************** 
        
                   Awkward Array's nested lists can be used as slices as well, as long 
        
                   as the type at the deepest level of nesting is boolean or integer. 
        
                   For example, 
        
                       >>> array = ak.Array([[[0.0, 1.1, 2.2], [], [3.3, 4.4]], [], [[5.5]]]) 
        
                   can be sliced at the top level with one-dimensional arrays: 
        
                       >>> array[[False, True, True]] 
        
                       <Array [[], [[5.5]]] type='2 * var * var * float64'> 
        
                       >>> array[[1, 2]] 
        
                       <Array [[], [[5.5]]] type='2 * var * var * float64'> 
        
                   with singly nested lists: 
        
                       >>> array[[[False, True, True], [], [True]]] 
        
                       <Array [[[], [3.3, 4.4]], [], [[5.5]]] type='3 * var * var * float64'> 
        
                       >>> array[[[1, 2], [], [0]]] 
        
                       <Array [[[], [3.3, 4.4]], [], [[5.5]]] type='3 * var * var * float64'> 
        
                   and with doubly nested lists: 
        
                       >>> array[[[[False, True, False], [], [True, False]], [], [[False]]]] 
        
                       <Array [[[1.1], [], [3.3]], [], [[]]] type='3 * var * var * float64'> 
        
                       >>> array[[[[1], [], [0]], [], [[]]]] 
        
                       <Array [[[1.1], [], [3.3]], [], [[]]] type='3 * var * var * float64'> 
        
                   The key thing is that the nested slice has the same number of elements 
        
                   as the array it's slicing at every level of nesting that it reproduces. 
        
                   This is similar to the requirement that boolean arrays have the same 
        
                   length as the array they're filtering. 
        
                   This kind of slicing is useful because NumPy's 
        
                   [universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html) 
        
                   produce arrays with the same structure as the original array, which 
        
                   can then be used as filters. 
        
                       >>> ((array * 10) % 2 == 1).show() 
        
                       [[[False, True, False], [], [True, False]], 
        
                        [], 
        
                        [[True]]] 
        
                       >>> (array[(array * 10) % 2 == 1]).show() 
        
                       [[[1.1], [], [3.3]], 
        
                        [], 
        
                        [[5.5]]] 
        
                   Functions whose names start with "arg" return index positions, which 
        
                   can be used with the integer form. 
        
                       >>> np.argmax(array, axis=-1).show() 
        
                       [[2, None, 1], 
        
                        [], 
        
                        [0]] 
        
                       >>> array[np.argmax(array, axis=-1)].show() 
        
                       [[[3.3, 4.4], None, []], 
        
                        [], 
        
                        [[5.5]]] 
        
                   Here, the `np.argmax` returns the integer position of the maximum 
        
                   element or None for empty arrays. It's a nice example of 
        
                   <<<option indexing>>> with <<<nested indexing>>>. 
        
                   When applying a nested index with missing (None) entries at levels 
        
                   higher than the last level, the indexer must have the same dimension 
        
                   as the array being indexed, and the resulting output will have missing 
        
                   entries at the corresponding locations, e.g. for 
        
                       >>> array[ [[[0, None, 2, None, None], None, [1]], None, [[0]]] ].show() 
        
                       [[[0, None, 2.2, None, None], None, [4.4]], 
        
                        None, 
        
                        [[5.5]]] 
        
                   the sub-list at entry 0,0 is extended as the masked entries are 
        
                   acting at the last level, while the higher levels of the indexer all 
        
                   have the same dimension as the array being indexed.

into the tutorial area where it will be more visible.

We've used three words now for mostly the same thing: "jagged" (I started with that because Wikipedia preferred it), "ragged" (this is what I should have used, because it's more widespread in the SciPy community), and "awkward" (new here). Are you using a different word than "ragged" because it also includes missing values? I wonder if "ragged, masked indexing" might be better, since it ties in with a word the reader might already know.

I try to use capitalization consistently and have decided to capitalize "Awkward Array" and even "Awkward" when it's used as an adjective: "Awkward indexing." If it's lowercase, it will less likely be recognized as a brand name, and then it takes on the ordinary English meaning of "clumsy or difficult."

agoose77 added 2 commits February 16, 2023 10:14

docs: improve ragged indexing docs

d10964b

wip: work on explaining masking

cb59d50

agoose77 temporarily deployed to docs-preview February 16, 2023 11:06 — with GitHub Actions Inactive

agoose77 requested a review from jpivarski February 16, 2023 11:35

agoose77 marked this pull request as ready for review February 16, 2023 11:35

Merge branch 'main' into agoose77/docs-addition-to-indexing

8bc663b

agoose77 temporarily deployed to docs-preview February 16, 2023 11:41 — with GitHub Actions Inactive

jpivarski approved these changes Feb 16, 2023

View reviewed changes

jpivarski merged commit 32ca867 into main Feb 16, 2023

jpivarski deleted the agoose77/docs-addition-to-indexing branch February 16, 2023 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: improve ragged indexing docs #2247

docs: improve ragged indexing docs #2247

agoose77 commented Feb 16, 2023 •

edited

Loading

agoose77 commented Feb 16, 2023

jpivarski left a comment

	All methods of selecting items described in
	[NumPy indexing](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html)
	are supported with one exception
	([combining advanced and basic indexing](https://numpy.org/doc/stable/user/basics.indexing.html#combining-advanced-and-basic-indexing)
	with basic indexes between two advanced indexes: the definition
	NumPy chose for the result does not have a generalization beyond
	rectilinear arrays).

	The `where` parameter can be any of the following or a tuple of
	the following.

	* An integer selects one element. Like Python/NumPy, it is
	zero-indexed: `0` is the first item, `1` is the second, etc.
	Negative indexes count from the end of the list: `-1` is the
	last, `-2` is the second-to-last, etc.
	Indexes beyond the size of the array, either because they're too
	large or because they're too negative, raise errors. In
	particular, some nested lists might contain a desired element
	while others don't; this would raise an error.
	* A slice (either a Python `slice` object or the
	`start:stop:step` syntax) selects a range of elements. The
	`start` and `stop` values are zero-indexed; `start` is inclusive
	and `stop` is exclusive, like Python/NumPy. Negative `step`
	values are allowed, but a `step` of `0` is an error. Slices
	beyond the size of the array are not errors but are truncated,
	like Python/NumPy.
	* A string selects a tuple or record field, even if its
	position in the tuple is to the left of the dimension where the
	tuple/record is defined. (See <<<projection>>> below.) This is
	similar to NumPy's
	[field access](https://numpy.org/doc/stable/user/basics.indexing.html#field-access),
	except that strings are allowed in the same tuple with other
	slice types. While record fields have names, tuple fields are
	integer strings, such as `"0"`, `"1"`, `"2"` (always
	non-negative). Be careful to distinguish these from non-string
	integers.
	* An iterable of strings (not the top-level tuple) selects
	multiple tuple/record fields.
	* An ellipsis (either the Python `Ellipsis` object or the
	`...` syntax) skips as many dimensions as needed to put the
	rest of the slice items to the innermost dimensions.
	* A np.newaxis or its equivalent, None, does not select items
	but introduces a new regular dimension in the output with size
	`1`. This is a convenient way to explicitly choose a dimension
	for broadcasting.
	* A boolean array with the same length as the current dimension
	(or any iterable, other than the top-level tuple) selects elements
	corresponding to each True value in the array, dropping those
	that correspond to each False. The behavior is similar to
	NumPy's
	[compress](https://docs.scipy.org/doc/numpy/reference/generated/numpy.compress.html)
	function.
	* An integer array (or any iterable, other than the top-level
	tuple) selects elements like a single integer, but produces a
	regular dimension of as many as are desired. The array can have
	any length, any order, and it can have duplicates and incomplete
	coverage. The behavior is similar to NumPy's
	[take](https://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html)
	function.
	* An integer Array with missing (None) items selects multiple
	values by index, as above, but None values are passed through
	to the output. This behavior matches pyarrow's
	[Array.take](https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.take)
	which also manages arrays with missing values. See
	<<<option indexing>>> below.
	* An Array of nested lists, ultimately containing booleans or
	integers and having the same lengths of lists at each level as
	the Array to which they're applied, selects by boolean or by
	integer at the deeply nested level. Missing items at any level
	above the deepest level must broadcast. See <<<nested indexing>>> below.

	A tuple of the above applies each slice item to a dimension of the
	data, which can be very expressive. More than one flat boolean/integer
	array are "iterated as one" as described in the
	[NumPy documentation](https://numpy.org/doc/stable/user/basics.indexing.html#integer-array-indexing).

	Filtering
	*********

	A common use of selection by boolean arrays is to filter a dataset by
	some property. For instance, to get the odd values of

	>>> array = ak.Array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

	one can put an array expression with True for each odd value inside
	square brackets:

	>>> array[array % 2 == 1]
	<Array [1, 3, 5, 7, 9] type='5 * int64'>

	This technique is so common in NumPy and Pandas data analysis that it
	is often read as a syntax, rather than a consequence of array slicing.

	The extension to nested arrays like

	>>> array = ak.Array([[[0, 1, 2], [], [3, 4], [5]], [[6, 7, 8], [9]]])

	allows us to use the same syntax more generally.

	>>> array[array % 2 == 1]
	<Array [[[1], [], [3], [5]], [[7], [9]]] type='2 * var * var * int64'>

	In this example, the boolean array is itself nested (see
	<<<nested indexing>>> below).

	>>> array % 2 == 1
	<Array [[[False, True, False], ..., [True]], ...] type='2 * var * var * bool'>

	This also applies to data with record structures.

	For nested data, we often need to select the first or first two
	elements from variable-length lists. That can be a problem if some
	lists are empty. A function like #ak.num can be useful for first
	selecting by the lengths of lists.

	>>> array = ak.Array([[1.1, 2.2, 3.3],
	... [],
	... [4.4, 5.5],
	... [6.6],
	... [],
	... [7.7, 8.8, 9.9]])
	...
	>>> array[ak.num(array) > 0, 0]
	<Array [1.1, 4.4, 6.6, 7.7] type='4 * float64'>
	>>> array[ak.num(array) > 1, 1]
	<Array [2.2, 5.5, 8.8] type='3 * float64'>

	It's sometimes also a problem that "cleaning" the dataset by dropping
	empty lists changes its alignment, so that it can no longer be used
	in calculations with "uncleaned" data. For this, #ak.mask can be
	useful because it inserts None in positions that fail the filter,
	rather than removing them.

	>>> ak.mask(array, ak.num(array) > 1)
	<Array [[1.1, 2.2, 3.3], ..., [7.7, ..., 9.9]] type='6 * option[var * float64]'>

	Note, however, that the `0` or `1` to pick the first or second
	item of each nested list is in the second dimension, so the first
	dimension of the slice must be a `:`.

	>>> ak.mask(array, ak.num(array) > 1)[:, 0]
	<Array [1.1, None, 4.4, None, None, 7.7] type='6 * ?float64'>
	>>> ak.mask(array, ak.num(array) > 1)[:, 1]
	<Array [2.2, None, 5.5, None, None, 8.8] type='6 * ?float64'>

	Another syntax for

	ak.mask(array, array_of_booleans)

	is

	array.mask[array_of_booleans]

	(which is 5 characters away from simply filtering the `array`).

	Projection
	**********

	The following

	>>> array = ak.Array([[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [2, 2]}],
	... [{"x": 3.3, "y": [3, 3, 3]}],
	... [{"x": 0, "y": []}, {"x": 1.1, "y": [1, 1, 1]}]])

	has records inside of nested lists:

	>>> array.type.show()
	3 * var * {
	x: float64,
	y: var * int64
	}

	In principle, one should select nested lists before record fields,

	>>> array[2, :, "x"]
	<Array [0, 1.1] type='2 * float64'>
	>>> array[::2, :, "x"]
	<Array [[1.1, 2.2], [0, 1.1]] type='2 * var * float64'>

	but it's also possible to select record fields first.

	>>> array["x"]
	<Array [[1.1, 2.2], [3.3], [0, 1.1]] type='3 * var * float64'>

	The string can "commute" to the left through integers and slices to
	get the same result as it would in its "natural" position.

	>>> array[2, :, "x"]
	<Array [0, 1.1] type='2 * float64'>
	>>> array[2, "x", :]
	<Array [0, 1.1] type='2 * float64'>
	>>> array["x", 2, :]
	<Array [0, 1.1] type='2 * float64'>

	The is analogous to selecting rows (integer indexes) before columns
	(string names) or columns before rows, except that the rows are
	more complex (like a Pandas
	[MultiIndex](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)).
	This would be an expensive operation in a typical object-oriented
	environment, in which the records with fields `"x"` and `"y"` are
	akin to C structs, but for columnar Awkward Arrays, projecting
	through all records to produce an array of nested lists of `"x"`
	values just changes the metadata (no loop over data, and therefore
	fast).

	Thus, data analysts should think of records as fluid objects that
	can be easily projected apart and zipped back together with
	#ak.zip.

	Note, however, that while a column string can "commute" with row
	indexes to the left of its position in the tree, it can't commute
	to the right. For example, it's possible to use slices inside
	`"y"` because `"y"` is a list:

	>>> array[0, :, "y"]
	<Array [[1], [2, 2]] type='2 * var * int64'>
	>>> array[0, :, "y", 0]
	<Array [1, 2] type='2 * int64'>

	but it's not possible to move `"y"` to the right

	>>> array[0, :, 0, "y"]
	IndexError: while attempting to slice
	<Array [[{x: 1.1, y: [1]}, {...}], ...] type='3 * var * {x: float64, y:...'>
	with
	(0, :, 0, 'y')
	at inner NumpyArray of length 2, using sub-slice (0).

	because the `array[0, :, 0, ...]` slice applies to both `"x"` and
	`"y"` before `"y"` is selected, and `"x"` is a one-dimensional
	NumpyArray that can't take more than its share of slices.

	Finally, note that the dot (`__getattr__`) syntax is equivalent to a single
	string in a slice (`__getitem__`) if the field name is a valid Python
	identifier and doesn't conflict with #ak.Array methods or properties.

	>>> array.x
	<Array [[1.1, 2.2], [3.3], [0, 1.1]] type='3 * var * float64'>
	>>> array.y
	<Array [[[1], [2, 2]], ..., [[], [1, ...]]] type='3 * var * var * int64'>

	Nested Projection
	*****************

	If records are nested within records, you can use a series of strings in
	the selector to drill down. For instance, with the following

	>>> array = ak.Array([
	... {"a": {"x": 1, "y": 2}, "b": {"x": 10, "y": 20}, "c": {"x": 1.1, "y": 2.2}},
	... {"a": {"x": 1, "y": 2}, "b": {"x": 10, "y": 20}, "c": {"x": 1.1, "y": 2.2}},
	... {"a": {"x": 1, "y": 2}, "b": {"x": 10, "y": 20}, "c": {"x": 1.1, "y": 2.2}}])

	we can go directly to the numerical data by specifying a string for the
	outer field and a string for the inner field.

	>>> array["a", "x"]
	<Array [1, 1, 1] type='3 * int64'>
	>>> array["a", "y"]
	<Array [2, 2, 2] type='3 * int64'>
	>>> array["b", "y"]
	<Array [20, 20, 20] type='3 * int64'>
	>>> array["c", "y"]
	<Array [2.2, 2.2, 2.2] type='3 * float64'>

	As with single projections, the dot (`__getattr__`) syntax is equivalent
	to a single string in a slice (`__getitem__`) if the field name is a valid
	Python identifier and doesn't conflict with #ak.Array methods or properties.

	>>> array.a.x
	<Array [1, 1, 1] type='3 * int64'>

	You can even get every field of the same name within an outer record using
	a list of field names for the outer record. The following selects the `"x"`
	field of `"a"`, `"b"`, and `"c"` records:

	>>> array[["a", "b", "c"], "x"].show()
	[{a: 1, b: 10, c: 1.1},
	{a: 1, b: 10, c: 1.1},
	{a: 1, b: 10, c: 1.1}]

	You don't need to get all fields:

	>>> array[["a", "b"], "x"].show()
	[{a: 1, b: 10},
	{a: 1, b: 10},
	{a: 1, b: 10}]

	And you can select lists of field names at all levels:

	>>> array[["a", "b"], ["x", "y"]].show()
	[{a: {x: 1, y: 2}, b: {x: 10, y: 20}},
	{a: {x: 1, y: 2}, b: {x: 10, y: 20}},
	{a: {x: 1, y: 2}, b: {x: 10, y: 20}}]

	Option indexing
	***************

	NumPy arrays can be sliced by all of the above slice types except
	arrays with missing values and arrays with nested lists, both of
	which are inexpressible in NumPy. Missing values, represented by
	None in Python, are called option types (#ak.types.OptionType) in
	Awkward Array and can be used as a slice.

	For example,

	>>> array = ak.Array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])

	can be sliced with a boolean array

	>>> array[[False, False, False, False, True, False, True, False, True]]
	<Array [5.5, 7.7, 9.9] type='3 * float64'>

	or a boolean array containing None values:

	>>> array[[False, False, False, False, True, None, True, None, True]]
	<Array [5.5, None, 7.7, None, 9.9] type='5 * ?float64'>

	Similarly for arrays of integers and None:

	>>> array[[0, 1, None, None, 7, 8]]
	<Array [1.1, 2.2, None, None, 8.8, 9.9] type='6 * ?float64'>

	This is the same behavior as pyarrow's
	[Array.take](https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.take),
	which establishes a convention for how to interpret slice arrays
	with option type:

	>>> import pyarrow as pa
	>>> array = pa.array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
	>>> array.take(pa.array([0, 1, None, None, 7, 8]))
	<pyarrow.lib.DoubleArray object at 0x7efc7f060210>
	[
	1.1,
	2.2,
	null,
	null,
	8.8,
	9.9
	]

	Nested indexing
	***************

	Awkward Array's nested lists can be used as slices as well, as long
	as the type at the deepest level of nesting is boolean or integer.

	For example,

	>>> array = ak.Array([[[0.0, 1.1, 2.2], [], [3.3, 4.4]], [], [[5.5]]])

	can be sliced at the top level with one-dimensional arrays:

	>>> array[[False, True, True]]
	<Array [[], [[5.5]]] type='2 * var * var * float64'>
	>>> array[[1, 2]]
	<Array [[], [[5.5]]] type='2 * var * var * float64'>

	with singly nested lists:

	>>> array[[[False, True, True], [], [True]]]
	<Array [[[], [3.3, 4.4]], [], [[5.5]]] type='3 * var * var * float64'>
	>>> array[[[1, 2], [], [0]]]
	<Array [[[], [3.3, 4.4]], [], [[5.5]]] type='3 * var * var * float64'>

	and with doubly nested lists:

	>>> array[[[[False, True, False], [], [True, False]], [], [[False]]]]
	<Array [[[1.1], [], [3.3]], [], [[]]] type='3 * var * var * float64'>
	>>> array[[[[1], [], [0]], [], [[]]]]
	<Array [[[1.1], [], [3.3]], [], [[]]] type='3 * var * var * float64'>

	The key thing is that the nested slice has the same number of elements
	as the array it's slicing at every level of nesting that it reproduces.
	This is similar to the requirement that boolean arrays have the same
	length as the array they're filtering.

	This kind of slicing is useful because NumPy's
	[universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
	produce arrays with the same structure as the original array, which
	can then be used as filters.

	>>> ((array * 10) % 2 == 1).show()
	[[[False, True, False], [], [True, False]],
	[],
	[[True]]]
	>>> (array[(array * 10) % 2 == 1]).show()
	[[[1.1], [], [3.3]],
	[],
	[[5.5]]]

	Functions whose names start with "arg" return index positions, which
	can be used with the integer form.

	>>> np.argmax(array, axis=-1).show()
	[[2, None, 1],
	[],
	[0]]
	>>> array[np.argmax(array, axis=-1)].show()
	[[[3.3, 4.4], None, []],
	[],
	[[5.5]]]

	Here, the `np.argmax` returns the integer position of the maximum
	element or None for empty arrays. It's a nice example of
	<<<option indexing>>> with <<<nested indexing>>>.

	When applying a nested index with missing (None) entries at levels
	higher than the last level, the indexer must have the same dimension
	as the array being indexed, and the resulting output will have missing
	entries at the corresponding locations, e.g. for

	>>> array[ [[[0, None, 2, None, None], None, [1]], None, [[0]]] ].show()
	[[[0, None, 2.2, None, None], None, [4.4]],
	None,
	[[5.5]]]

	the sub-list at entry 0,0 is extended as the masked entries are
	acting at the last level, while the higher levels of the indexer all
	have the same dimension as the array being indexed.

docs: improve ragged indexing docs #2247

docs: improve ragged indexing docs #2247

Conversation

agoose77 commented Feb 16, 2023 • edited Loading

TL;DR

agoose77 commented Feb 16, 2023

jpivarski left a comment

Choose a reason for hiding this comment

agoose77 commented Feb 16, 2023 •

edited

Loading