From d10964b88be04469fe1f304a93127a8621ed85aa Mon Sep 17 00:00:00 2001 From: Angus Hollands Date: Thu, 16 Feb 2023 10:14:33 +0000 Subject: [PATCH 1/2] docs: improve ragged indexing docs --- docs/user-guide/how-to-filter-ragged.md | 150 ++++++++++++++---------- 1 file changed, 85 insertions(+), 65 deletions(-) diff --git a/docs/user-guide/how-to-filter-ragged.md b/docs/user-guide/how-to-filter-ragged.md index 8442b073af..d132c05f3c 100644 --- a/docs/user-guide/how-to-filter-ragged.md +++ b/docs/user-guide/how-to-filter-ragged.md @@ -4,15 +4,15 @@ jupytext: extension: .md format_name: myst format_version: 0.13 - jupytext_version: 1.14.1 + jupytext_version: 1.14.4 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- -How to filter lists within arrays using ragged slicing -====================================================== +How to filter with a ragged index array +======================================= ```{code-cell} ipython3 import awkward as ak @@ -21,8 +21,6 @@ import numpy as np ## What is ragged slicing? -+++ - One of the most powerful features of NumPy is the expressiveness of its indexing system. A NumPy array [can be sliced in many different ways](https://numpy.org/doc/stable/user/basics.indexing.html#basic-indexing), such as with a single integer, or an array of integers. Awkward Array implements most of these indexing styles, but adds an additional variant: _ragged indexing_. +++ @@ -67,13 +65,14 @@ To produce this result, we need ragged indexing. +++ +(how-to-filter-masked:building-a-ragged-index)= ## Building a ragged index +++ Ragged indexing requires an index array that 1. has a structure matching the array being sliced **up to** (but not including) the final dimension of the index -2. has at _least_ one ragged (`var`) dimension. +2. has at _least_ one ragged (`var`) dimension **or** contain missing values By structure, we mean the number of sublists in each dimension, which can be seen with {func}`ak.num`: @@ -162,11 +161,12 @@ Clearly this index produces the result that we were aiming for! +++ +(how-to-filter-ragged:indexing-with-argmin-and-argmax)= ## Indexing with `argmin` and `argmax` +++ -Ragged indexing is especially useful when combined with the positional {func}`ak.argmin` and {func}`ak.argmax` reducers. These functions accept an `keepdims=True` argument that can be used to keep _the same number of dimensions_ as the original array. +Ragged indexing is especially useful when combined with the positional {func}`ak.argmin` and {func}`ak.argmax` reducers. These functions accept an `keepdims=True` argument that can be used to keep _the same number of dimensions_ as the original array. There is also a `mask_identity` argument is explained in {ref}`how-to-filter-ragged:indexing-with-missing-values`. For now, we will set it to `False`. ```{code-cell} ipython3 array = ak.Array( @@ -179,45 +179,59 @@ array = ak.Array( array ``` -Without `keepdims=True`, all reducers collapse a dimension of the original array +With `keepdims=False`, all reducers collapse a dimension of the original array: ```{code-cell} ipython3 -ak.argmin(array, axis=1) +ak.argmin(array, axis=1, keepdims=False, mask_identity=False) ``` If we try and use this index to slice `array`, it will likely not produce the result we might initially expect: ```{code-cell} ipython3 -array[ak.argmin(array, axis=1)] +array[ak.argmin(array, axis=1, keepdims=False, mask_identity=False)] ``` Instead of pulling out the smallest items in `array` along `axis=1`, we have simply re-arranged the sublists of `array` along `axis=0`. Our index has only a single dimension, so for each value in `ak.argmin(array, axis=-1)`, Awkward pulls out the corresponding item from `array`. We want to pull values out of the _second_ dimension, so our index array needs to be two dimensional. +++ -Let's now look at what happens with `keepdims=True`: +Let's now look at what happens with `keepdims=True`. The result is a two dimensional, fully regular array, with no missing values: ```{code-cell} ipython3 -ak.argmin(array, axis=-1, keepdims=True) +ak.argmin(array, axis=-1, keepdims=True, mask_identity=False) ``` +Before we can use this as an index array, we need to convert _at least_ one dimension to a ragged dimension. This follows from rule (2) described in {ref}`how-to-filter-masked:building-a-ragged-index`. + ```{code-cell} ipython3 -array[ak.argmin(array, axis=-1, keepdims=True)] +ak.from_regular( + ak.argmin(array, axis=-1, keepdims=True, mask_identity=False) +) ``` -This now produces the expected result! +We can now use this array to index into `array`: -+++ +```{code-cell} ipython3 +array[ + ak.from_regular( + ak.argmin(array, axis=-1, keepdims=True, mask_identity=False) + ) +] +``` -## Filtering with missing sublists +it produces the expected result! +++ -Ragged indexing supports using `None` in place of empty sublists within an index. For example +(how-to-filter-ragged:indexing-with-missing-values)= +## Indexing with missing values + +The `keepdims` argument of {func}`ak.argmin` and {func}`ak.argmax` is very useful for creating index arrays. However, what happens when it is not possible to define a minimum value, i.e. for empty sublists? Let's first create an array that contains empty sublists: ```{code-cell} ipython3 array = ak.Array( [ + [], [10, 3, 2, 9], [4, 5, 5, 12, 6], [], @@ -227,86 +241,92 @@ array = ak.Array( array ``` -Let's use build a ragged index to pull some values out of `array`. Rather than using empty lists, we can use `None` to mask out sublists that we don't care about: +Awkward reducers accept a `mask_identity` argument that yields the identity value for the reducer instead of `None` when reducing empty lists. Computing {func}`ak.argmax`, we find that the identity for the {func}`ak.argmax` is `-1`: ```{code-cell} ipython3 -array[ - [ - [0, 1], - None, - [], - [2], - ], -] +ak.argmax(array, keepdims=True, axis=-1, mask_identity=False) ``` -If we compare this with simply providing an empty sublist, - ```{code-cell} ipython3 -array[ - [ - [0, 1], - [], - [], - [2], - ], -] +ak.argmax(array, keepdims=True, axis=-1, mask_identity=True) ``` -we can see that the `None` value introduces an - -+++ +What happens if we try and use the array produced with `mask_identity=False` to index into `array`? -## Filtering with booleans - -+++ - -Awkward Array's ragged indexing is a generalisation of the advanced indexing supported by NumPy. It is therefore reasonable to ask whether Awkward supports ragged indexing with boolean values, selecting only values for which the index is `True`. Let's create an array of integers: +As discussed in {ref}`how-to-filter-ragged:indexing-with-argmin-and-argmax`, we first need to convert _at least_ one dimension to a ragged dimension ```{code-cell} ipython3 -numbers = ak.Array( - [ - [0, 1, 2, 3], - [4, 5, 6], - [8, 9, 10, 11, 12], - ] +index = ak.from_regular( + ak.argmax(array, keepdims=True, axis=-1, mask_identity=False) ) ``` -We can use ragged indexing to keep only the even values. Let's generate a boolean mask with the same structure as `numbers`. In order for there to be a single boolean value for each item in `numbers`, the filter array must have exactly the same number of elements. Ufuncs are powerful means of generating boolean masks, as they directly preserve the exact structure of the original array: +Now, if we try and index into `array` with `index`, it will raise an exception ```{code-cell} ipython3 -is_even = (numbers % 2) == 0 -``` +:tags: [raises-exception] -```{code-cell} ipython3 -numbers +array[index] ``` +From the error message, it is clear that for some sublist(s) the index `-1` is out of range. This makes sense; some of our sublists are empty, meaning that there is no valid integer to index into them. + +Now let's look at the result of indexing with `mask_identity=True`. + ```{code-cell} ipython3 -is_even +index = ak.argmax(array, keepdims=True, axis=-1, mask_identity=True) ``` -Now we can use `is_even` to slice `numbers`: +Because it contains an option type, we do not need to convert the index array to a ragged array. We can see that this index succeeds: ```{code-cell} ipython3 -numbers[is_even] +array[index] ``` -Note that this is different to what would happen with NumPy's boolean indexing: +Here, the missing values in the index array correspond to missing values _in the output array_. + ++++ + +## Indexing with missing sublists + +Ragged indexing also supports using `None` in place of _empty sublists_ within an index. For example, given the following array ```{code-cell} ipython3 -numbers_np = np.array( +array = ak.Array( [ - [0, 1, 2, 3], - [4, 5, 6, 7], - [8, 9, 10, 11], + [10, 3, 2, 9], + [4, 5, 5, 12, 6], + [], + [8, 9, -1], ] ) +array ``` +let's use build a ragged index to pull out some particular values. Rather than using empty lists, we can use `None` to mask out sublists that we don't care about: + ```{code-cell} ipython3 -numbers_np[(numbers_np % 2) == 0] +array[ + [ + [0, 1], + None, + [], + [2], + ], +] +``` + +If we compare this with simply providing an empty sublist, + +```{code-cell} ipython3 +array[ + [ + [0, 1], + [], + [], + [2], + ], +] ``` -NumPy, lacking a ragged array structure, has to flatten the result whereas Awkward Array preserves the number of dimensions in the result. +we can see that the `None` value introduces an option-type into the final result. `None` values can be used at _any_ level in the index array to introduce an option-type at that depth in the result. From cb59d5014944a2619a4fb0dcfcb89eb84012c796 Mon Sep 17 00:00:00 2001 From: Angus Hollands Date: Thu, 16 Feb 2023 10:59:34 +0000 Subject: [PATCH 2/2] wip: work on explaining masking --- docs/_toc.yml | 4 +- docs/user-guide/how-to-filter-masked.md | 124 +++++++++++++++++++++-- docs/user-guide/how-to-filter-ragged.md | 127 ++++++++---------------- 3 files changed, 157 insertions(+), 98 deletions(-) diff --git a/docs/_toc.yml b/docs/_toc.yml index 5a13d5e92a..9f50ff13ce 100644 --- a/docs/_toc.yml +++ b/docs/_toc.yml @@ -99,9 +99,9 @@ subtrees: - file: user-guide/how-to-filter-cut-mask title: "Cuts vs. masks [todo]" - file: user-guide/how-to-filter-ragged - title: "Slicing lists within arrays" + title: "Using ragged arrays" - file: user-guide/how-to-filter-masked - title: "Slices with missing values [todo]" + title: "Using arrays with missing values" - file: user-guide/how-to-restructure title: "Restructuring data" diff --git a/docs/user-guide/how-to-filter-masked.md b/docs/user-guide/how-to-filter-masked.md index c40cc613a1..9305374a82 100644 --- a/docs/user-guide/how-to-filter-masked.md +++ b/docs/user-guide/how-to-filter-masked.md @@ -4,20 +4,128 @@ jupytext: extension: .md format_name: myst format_version: 0.13 - jupytext_version: 1.10.3 + jupytext_version: 1.14.4 kernelspec: - display_name: Python 3 + display_name: Python 3 (ipykernel) language: python name: python3 --- -How to use slices that have missing values -========================================== +How to filter with arrays containing missing values +=================================================== -**This is a stub:** I intend to write this article, but haven't yet. +```{code-cell} ipython3 +import awkward as ak +import numpy as np +``` -If you need it soon, create an issue saying so and I'll make it a higher priority. +(how-to-filter-ragged:indexing-with-missing-values)= +## Indexing with missing values +In {ref}`how-to-filter-masked:building-an-awkward-index`, we looked building arrays of integers to perform awkward indexing using {func}`ak.argmin` and {func}`ak.argmax`. In particular, the `keepdims` argument of {func}`ak.argmin` and {func}`ak.argmax` is very useful for creating arrays that can be used to index into the original array. However, reducers such as {func}`ak.argmax` behave differently when they are asked to operate upon empty lists. -[![](../image/github-issues-documentation.png)](https://github.com/scikit-hep/awkward-1.0/issues/new?assignees=&labels=docs&template=documentation.md&title=) +Let's first create an array that contains empty sublists: -The text of your issue doesn't have to be much more than a link to this page, so I can be sure which page you're referring to. If you add details about how and why you need it, however, I may be able to tailor the text to help you more. +```{code-cell} ipython3 +array = ak.Array( + [ + [], + [10, 3, 2, 9], + [4, 5, 5, 12, 6], + [], + [8, 9, -1], + ] +) +array +``` + +Awkward reducers accept a `mask_identity` argument, which changes the {attr}`ak.Array.type` and the values of the result: + +```{code-cell} ipython3 +ak.argmax(array, keepdims=True, axis=-1, mask_identity=False) +``` + +```{code-cell} ipython3 +ak.argmax(array, keepdims=True, axis=-1, mask_identity=True) +``` + +Setting `mask_identity=True` yields the identity value for the reducer instead of `None` when reducing empty lists. From the above examples of {func}`ak.argmax`, we can see that the identity for the {func}`ak.argmax` is `-1`: What happens if we try and use the array produced with `mask_identity=False` to index into `array`? + ++++ + +As discussed in {ref}`how-to-filter-ragged:indexing-with-argmin-and-argmax`, we first need to convert _at least_ one dimension to a ragged dimension + +```{code-cell} ipython3 +index = ak.from_regular( + ak.argmax(array, keepdims=True, axis=-1, mask_identity=False) +) +``` + +Now, if we try and index into `array` with `index`, it will raise an exception + +```{code-cell} ipython3 +:tags: [raises-exception] + +array[index] +``` + +From the error message, it is clear that for some sublist(s) the index `-1` is out of range. This makes sense; some of our sublists are empty, meaning that there is no valid integer to index into them. + +Now let's look at the result of indexing with `mask_identity=True`. + +```{code-cell} ipython3 +index = ak.argmax(array, keepdims=True, axis=-1, mask_identity=True) +``` + +Because it contains an option type, `index` already satisfies rule (2) in {ref}`how-to-filter-masked:building-an-awkward-index`, and we do not need to convert it to a ragged array. We can see that this index succeeds: + +```{code-cell} ipython3 +array[index] +``` + +Here, the missing values in the index array correspond to missing values _in the output array_. + ++++ + +## Indexing with missing sublists + +Ragged indexing also supports using `None` in place of _empty sublists_ within an index. For example, given the following array + +```{code-cell} ipython3 +array = ak.Array( + [ + [10, 3, 2, 9], + [4, 5, 5, 12, 6], + [], + [8, 9, -1], + ] +) +array +``` + +let's use build a ragged index to pull out some particular values. Rather than using empty lists, we can use `None` to mask out sublists that we don't care about: + +```{code-cell} ipython3 +array[ + [ + [0, 1], + None, + [], + [2], + ], +] +``` + +If we compare this with simply providing an empty sublist, + +```{code-cell} ipython3 +array[ + [ + [0, 1], + [], + [], + [2], + ], +] +``` + +we can see that the `None` value introduces an option-type into the final result. `None` values can be used at _any_ level in the index array to introduce an option-type at that depth in the result. diff --git a/docs/user-guide/how-to-filter-ragged.md b/docs/user-guide/how-to-filter-ragged.md index d132c05f3c..f573f29d3d 100644 --- a/docs/user-guide/how-to-filter-ragged.md +++ b/docs/user-guide/how-to-filter-ragged.md @@ -11,17 +11,17 @@ kernelspec: name: python3 --- -How to filter with a ragged index array -======================================= +How to filter with ragged arrays +================================ ```{code-cell} ipython3 import awkward as ak import numpy as np ``` -## What is ragged slicing? +## What is awkward indexing? -One of the most powerful features of NumPy is the expressiveness of its indexing system. A NumPy array [can be sliced in many different ways](https://numpy.org/doc/stable/user/basics.indexing.html#basic-indexing), such as with a single integer, or an array of integers. Awkward Array implements most of these indexing styles, but adds an additional variant: _ragged indexing_. +One of the most powerful features of NumPy is the expressiveness of its indexing system. A NumPy array [can be sliced in many different ways](https://numpy.org/doc/stable/user/basics.indexing.html#basic-indexing), such as with a single integer, or an array of integers. Awkward Array implements most of these indexing styles, but adds an additional variant: _awkward indexing_. +++ @@ -61,16 +61,16 @@ type: 3 * var * var * float64 +++ -To produce this result, we need ragged indexing. +To produce this result, we need awkward indexing. +++ -(how-to-filter-masked:building-a-ragged-index)= -## Building a ragged index +(how-to-filter-masked:building-an-awkward-index)= +## Building an awkward index +++ -Ragged indexing requires an index array that +Awkward indexing requires an index array that 1. has a structure matching the array being sliced **up to** (but not including) the final dimension of the index 2. has at _least_ one ragged (`var`) dimension **or** contain missing values @@ -90,11 +90,11 @@ ak.num(array, axis=0) ak.num(array, axis=1) ``` -To put this more simply, the final dimension of the ragged index is used to pull items out of the array. Therefore, Awkward needs the preceeding dimensions to line up! +To put this more simply, the final dimension of the awkward index is used to pull items out of the array. Therefore, Awkward needs the preceeding dimensions to line up! +++ -Recall that we wanted to pull out the following result from `array` using ragged indexing: +Recall that we wanted to pull out the following result from `array` using awkward indexing: ``` [[[], [3.3], [7.7]], [], @@ -133,7 +133,7 @@ array ak.local_index(array) ``` -To create our ragged index, all we need to do is create an array _like_ `ak.local_index(array)`, but with only the local indices that we want to keep, i.e. +To create our awkward index, all we need to do is create an array _like_ `ak.local_index(array)`, but with only the local indices that we want to keep, i.e. ```{code-cell} ipython3 index = ak.Array( @@ -151,7 +151,7 @@ We can see that this array matches the leading structure of `array`, and has at index.type.show() ``` -Let's see what slicing `array` with this ragged index looks like: +Let's see what slicing `array` with this awkward index looks like: ```{code-cell} ipython3 array[index] @@ -166,7 +166,7 @@ Clearly this index produces the result that we were aiming for! +++ -Ragged indexing is especially useful when combined with the positional {func}`ak.argmin` and {func}`ak.argmax` reducers. These functions accept an `keepdims=True` argument that can be used to keep _the same number of dimensions_ as the original array. There is also a `mask_identity` argument is explained in {ref}`how-to-filter-ragged:indexing-with-missing-values`. For now, we will set it to `False`. +Awkward indexing is especially useful when combined with the positional {func}`ak.argmin` and {func}`ak.argmax` reducers. These functions accept an `keepdims=True` argument that can be used to keep _the same number of dimensions_ as the original array. There is also a `mask_identity` argument is explained in {ref}`how-to-filter-ragged:indexing-with-missing-values`. For now, we will set it to `False`. ```{code-cell} ipython3 array = ak.Array( @@ -201,7 +201,7 @@ Let's now look at what happens with `keepdims=True`. The result is a two dimensi ak.argmin(array, axis=-1, keepdims=True, mask_identity=False) ``` -Before we can use this as an index array, we need to convert _at least_ one dimension to a ragged dimension. This follows from rule (2) described in {ref}`how-to-filter-masked:building-a-ragged-index`. +Before we can use this as an index array, we need to convert _at least_ one dimension to a ragged dimension. This follows from rule (2) described in {ref}`how-to-filter-masked:building-an-awkward-index`. ```{code-cell} ipython3 ak.from_regular( @@ -223,110 +223,61 @@ it produces the expected result! +++ -(how-to-filter-ragged:indexing-with-missing-values)= -## Indexing with missing values +## Filtering with booleans +As described in {ref}`how-to-filter-masked:building-an-awkward-index`, Awkward Array's awkward indexing is a generalisation of the advanced indexing supported by NumPy. It is therefore reasonable to ask whether Awkward supports awkward indexing with +_boolean_ values, selecting only values for which the index is `True`. -The `keepdims` argument of {func}`ak.argmin` and {func}`ak.argmax` is very useful for creating index arrays. However, what happens when it is not possible to define a minimum value, i.e. for empty sublists? Let's first create an array that contains empty sublists: +Let's create an array of integers: ```{code-cell} ipython3 -array = ak.Array( +numbers = ak.Array( [ - [], - [10, 3, 2, 9], - [4, 5, 5, 12, 6], - [], - [8, 9, -1], + [0, 1, 2, 3], + [4, 5, 6], + [8, 9, 10, 11, 12], ] ) -array ``` -Awkward reducers accept a `mask_identity` argument that yields the identity value for the reducer instead of `None` when reducing empty lists. Computing {func}`ak.argmax`, we find that the identity for the {func}`ak.argmax` is `-1`: +We can use awkward indexing to keep only the even values. Let's generate a boolean mask with the same structure as `numbers`. In order for there to be a single boolean value for each item in `numbers`, the filter array must have exactly the same number of elements. Ufuncs, such as {func}`np.mod`, are powerful tools for generating boolean masks, as they directly preserve the exact structure of the original array: ```{code-cell} ipython3 -ak.argmax(array, keepdims=True, axis=-1, mask_identity=False) +is_even = (numbers % 2) == 0 +is_even ``` ```{code-cell} ipython3 -ak.argmax(array, keepdims=True, axis=-1, mask_identity=True) +numbers ``` -What happens if we try and use the array produced with `mask_identity=False` to index into `array`? - -As discussed in {ref}`how-to-filter-ragged:indexing-with-argmin-and-argmax`, we first need to convert _at least_ one dimension to a ragged dimension +Now we can use `is_even` to slice `numbers`: ```{code-cell} ipython3 -index = ak.from_regular( - ak.argmax(array, keepdims=True, axis=-1, mask_identity=False) -) +numbers[is_even] ``` -Now, if we try and index into `array` with `index`, it will raise an exception +Note that this is different to what would happen with NumPy's boolean indexing: ```{code-cell} ipython3 -:tags: [raises-exception] - -array[index] -``` - -From the error message, it is clear that for some sublist(s) the index `-1` is out of range. This makes sense; some of our sublists are empty, meaning that there is no valid integer to index into them. - -Now let's look at the result of indexing with `mask_identity=True`. - -```{code-cell} ipython3 -index = ak.argmax(array, keepdims=True, axis=-1, mask_identity=True) -``` - -Because it contains an option type, we do not need to convert the index array to a ragged array. We can see that this index succeeds: - -```{code-cell} ipython3 -array[index] -``` - -Here, the missing values in the index array correspond to missing values _in the output array_. - -+++ - -## Indexing with missing sublists - -Ragged indexing also supports using `None` in place of _empty sublists_ within an index. For example, given the following array - -```{code-cell} ipython3 -array = ak.Array( +numbers_np = np.array( [ - [10, 3, 2, 9], - [4, 5, 5, 12, 6], - [], - [8, 9, -1], + [0, 1, 2, 3], + [4, 5, 6, 7], + [8, 9, 10, 11], ] ) -array ``` -let's use build a ragged index to pull out some particular values. Rather than using empty lists, we can use `None` to mask out sublists that we don't care about: - ```{code-cell} ipython3 -array[ - [ - [0, 1], - None, - [], - [2], - ], -] +numbers_np[(numbers_np % 2) == 0] ``` -If we compare this with simply providing an empty sublist, +NumPy, lacking a ragged array structure, has to flatten the result whereas Awkward Array preserves the number of dimensions in the result. ```{code-cell} ipython3 -array[ - [ - [0, 1], - [], - [], - [2], - ], +numbers[ + [[True, False, True, False], + [False], + [False, True, False]] ] ``` - -we can see that the `None` value introduces an option-type into the final result. `None` values can be used at _any_ level in the index array to introduce an option-type at that depth in the result.