jennybc_lists_lesson.Rmd

---
title: "Exploring and Extracting Data in Lists"
author: "Jenny Bryan, adapted by Clarke Iakovakis"
output: 
  html_document:
    df_print: paged
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
    toc_depth: 2
---

```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", error = TRUE, cache = TRUE)
library(png)
library(repurrrsive)
```

## Attribution

This lesson was created and is copyrighted by Jenny Bryan, available at <https://jennybc.github.io/purrr-tutorial/ls00_inspect-explore.html> and distributed under the terms of a [Creative Commons BY-NC 4.0 License](http://creativecommons.org/licenses/by-nc/4.0/). It has been adapted by Clarke Iakovakis and the adaptation is likewise distributed under a [Creative Commons BY-NC 4.0 License](http://creativecommons.org/licenses/by-nc/4.0/).

```{r,fig.height=1,echo=FALSE}
cc <- readPNG(file.path("./images/cc bync.png"))
grid::grid.raster(cc)
```

## Binder link to this notebook:

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ciakovx/ciakovx.github.io/master?filepath=jennybc_lists_lesson.ipynb)

<https://mybinder.org/v2/gh/ciakovx/ciakovx.github.io/master?filepath=jennybc_lists_lesson.ipynb>

## Load packages

Load purrr and repurrrsive, which contains recursive list examples.

```{r}
library(purrr)
library(repurrrsive)
```

## Inspect and explore

List inspection is very important and also fairly miserable. Before you can apply a function to every element of a list, you'd better understand the list!

You need to develop a toolkit for list inspection. Be on the look out for:

  * What is the length of the list?
  * Are the components homogeneous, i.e. do they have the same overall structure, albeit containing different data?
  * Note the length, names, and types of the constituent objects.
  
> I have no idea what's in this list or what its structure is! Please send help.

Understand this is **situation normal**, especially when your list comes from querying a poorly documented API. This is often true even when your list has been created completely within R. How many of us perfectly understand the structure of a fitted linear model object? You just have to embark on a voyage of discovery and figure out what's in there. Happy trails.

### Indexing, review

Remember, there are 3 ways to pull elements out of a list:

  * The `$` operator. Extracts a single element by name. Name can be unquoted, if syntactic.
    ```{r}
    x <- list(a = "a", b = 2)
    x$a
    x$b
    ```
  * `[[` a.k.a. double square bracket. Extracts a single element by name or position. Name must be quoted, if provided directly. Name or position can also be stored in a variable.
    ```{r}
    x <- list(a = "a", b = 2)
    x[["a"]]
    x[[2]]
    nm <- "a"
    x[[nm]]
    i <- 2
    x[[i]]
    ```
  * `[` a.k.a. single square bracket. Regular vector indexing. For a list input, this always returns a list!
    ```{r}
    x <- list(a = "a", b = 2)
    x["a"]
    x[c("a", "b")]
    x[c(FALSE, TRUE)]
    ```

### `str()`

`str()` can help with basic list inspection, although it's still rather frustrating. Learn to love the `max.level` and `list.len` arguments. You can use them to keep the output of `str()` down to a manageable volume.

Once you begin to suspect or trust that your list is homogeneous, i.e. consists of sub-lists with similar structure, it's often a good idea to do an in-depth study of a single element. In general, remember you can combine list inspection via `str(..., list.len = x, max.level = y)` with single `[` and double `[[` square bracket indexing.

The repurrrsive package provides examples of lists. We explore them below, to lay the groundwork for other lessons, and to demonstrate list inspection strategies.

### listviewer and RStudio's Object Explorer

 The RStudio IDE (v1.1 and higher) offers an [Object Explorer](https://blog.rstudio.com/2017/08/22/rstudio-v1-1-preview-object-explorer/) that provides interactive inspection and code generation tools for hierarchical objects, such as lists. You can invoke it via the GUI or in code as `View(YOUR_UGLY_LIST)`.
 
However, that won't help you expose list exploration in something like this website. I am using the [listviewer](https://CRAN.R-project.org/package=listviewer) package to do this below. It allows you to expose list exploration in a rendered `.Rmd` document.
To replicate this experience locally, call, e.g., `listviewer::jsonedit(got_chars, mode = "view")`.


```{r}
library(listviewer)
```

## Wes Anderson color palettes

`wesanderson` is a simple list containing color palettes from the [wesanderson package](https://cran.r-project.org/package=wesanderson). Each component is a palette, named after a movie, and contains a character vector of colors as hexadecimal triplets.

```{r}
str(wesanderson)
```

### Explore `wesanderson`

```{r echo = FALSE}
jsonedit(wesanderson, mode = "view", elementId = "wesanderson")
```

You can get a similar experience in RStudio via `View(wesanderson)`.

## Game of Thrones POV characters

`got_chars` is a list with information on the `r length(got_chars)` point-of-view characters from the first five books in the Song of Ice and Fire series by George R. R. Martin. Retrieved from [An API Of Ice And Fire](https://anapioficeandfire.com). Each component corresponds to one character and contains `r length(got_chars[[1]])` components which are named atomic vectors of various lengths and types.

```{r}
str(got_chars, list.len = 3)
str(got_chars[[1]], list.len = 8)
```

### Explore `got_chars`

```{r echo = FALSE}
jsonedit(number_unnamed(got_chars), mode = "view", elementId = "got_chars")
```

You can get a similar experience in RStudio via `View(got_chars)`.

## GitHub users and repositories

`gh_users` is a list with information on 6 GitHub users. `gh_repos` is a nested list, also of length 6, where each component is another list with information on up to 30 of that user's repositories. Retrieved from the [GitHub API](https://developer.github.com/v3/).

```{r}
str(gh_users, max.level = 1)
```

### Explore `gh_users`

```{r echo = FALSE}
jsonedit(number_unnamed(gh_users), mode = "view", elementId = "gh_users")
```

You can get a similar experience in RStudio via `View(gh_users)`.

### Explore `gh_repos`

```{r echo = FALSE}
jsonedit(number_unnamed(gh_repos), mode = "view", elementId = "gh_repos")
```

You can get a similar experience in RStudio via `View(gh_repos)`.

#### Exercises

1. Read the documentation on `str()`. What does `max.level` control? Apply `str()` to `wesanderson` and/or `got_chars` and experiment with `max.level = 0`, `max.level = 1`, and `max.level = 2`. Which will you use in practice with deeply nested lists?
1. What does the `list.len` argument of `str()` control? What is its default value? Call `str()` on `got_chars` and then on a single component of `got_chars` with `list.len` set to a value much smaller than the default. What range of values do you think you'll use in real life?
1. Call `str()` on `got_chars`, specifying both `max.level` and `list.len`.
1. Call `str()` on the first element of `got_chars`, i.e. the first Game of Thrones character. Use what you've learned to pick an appropriate combination of `max.level` and `list.len`.

## Vectorized and "list-ized" operations

Recall that many operations "just work" in a vectorized fashion in R:

```{r}
(3:5) ^ 2
sqrt(c(9, 16, 25))
```

Through the magic of R, the operations "raise to the power of 2" and "take the square root" were applied to each individual element of the numeric vector input. Someone -- but not you! -- has written a `for()` loop:

```{r eval = FALSE}
for (i in 1:n) {
  output[[i]] <- f(input[[i]])
}
```

Automatic vectorization is possible because our input is an atomic vector: the individual atoms are always of length one, always of uniform type.

What if the input is a list? You have to be more intentional to apply a function `f()` to each element of a list, i.e. to "list-ize" computation. This makes sense because the data structure itself does not guarantee that it makes any sense at all to apply a common function `f()` to each element of the list. You must guarantee that.

`purrr::map()` is a function for applying a function to each element of a list. The [closest base R function](bk01_base-functions.html) is `lapply()`. Here's how the square root example of the above would look if the input was in a list.

```{r}
map(c(9, 16, 25), sqrt)
```

A template for basic `map()` usage:

```{r eval = FALSE}
map(YOUR_LIST, YOUR_FUNCTION)
```

Below we explore these useful features of `purrr::map()` and friends:

  * Shortcuts for `YOUR_FUNCTION` when you want to extract list elements by name or position
  * Simplify and specify the type of output via `map_chr()`, `map_lgl()`, etc.
  
This is where you begin to see the differences between `purrr::map()` and `base::lapply()`.

### Name and position shortcuts

Who are these Game of Thrones characters?

We want the elements with name "name", so we do this (we restrict to the first few elements purely to conserve space):

```{r}
map(got_chars[1:4], "name")
```

We are exploiting one of purrr's most useful features: a shortcut to create a function that extracts an element based on its name.

A companion shortcut is used if you provide a positive integer to `map()`. This creates a function that extracts an element based on position.

The 3rd element of each character's list is his or her name and we get them like so:

```{r}
map(got_chars[5:8], 3)
```

To recap, here are two shortcuts for making the `.f` function that `map()` will apply:

  * provide "TEXT" to extract the element named "TEXT"
    - equivalent to `function(x) x[["TEXT"]]`
  * provide `i` to extract the `i`-th element
    - equivalent to `function(x) x[[i]]`

You will frequently see `map()` used together with [the pipe `%>%`](http://r4ds.had.co.nz/pipes.html). These calls produce the same result as the above.

```{r eval = FALSE}
got_chars %>% 
  map("name")
got_chars %>% 
  map(3)
```

#### Exercises

1. Use `names()` to inspect the names of the list elements associated with a single character. What is the index or position of the `playedBy` element? Use the character and position shortcuts to extract the `playedBy` elements for all characters.
1. What happens if you use the character shortcut with a string that does not appear in the lists' names?
1. What happens if you use the position shortcut with a number greater than the length of the lists?
1. What if these shortcuts did not exist? Write a function that takes a list and a string as input and returns the list element that bears the name in the string. Apply this to `got_chars` via `map()`. Do you get the same result as with the shortcut? Reflect on code length and readability.
1. Write another function that takes a list and an integer as input and returns the list element at that position. Apply this to `got_chars` via `map()`. How does this result and process compare with the shortcut?

### Type-specific map

`map()` always returns a list, even if all the elements have the same flavor and are of length one. But in that case, you might prefer a simpler object: **an atomic vector**.

If you expect `map()` to return output that can be turned into an atomic vector, it is best to use a type-specific variant of `map()`. This is more efficient than using `map()` to get a list and then simplifying the result in a second step. Also purrr will alert you to any problems, i.e. if one or more inputs has the wrong type or length. This is the [increased rigor about type alluded to in the section about coercion](bk00_vectors-and-lists.html#coercion).

Our current examples are suitable for demonstrating `map_chr()`, since the requested elements are always character.

```{r}
map_chr(got_chars[9:12], "name")
map_chr(got_chars[13:16], 3)
```

Besides `map_chr()`, there are other variants of `map()`, with the target type conveyed by the name:

  * `map_lgl()`, `map_int()`, `map_dbl()`
  
#### Exercises

1. For each character, the second element is named "id". This is the character's id in the [API Of Ice And Fire](https://anapioficeandfire.com). Use a type-specific form of `map()` and an extraction shortcut to extract these ids into an integer vector.
1. Use your list inspection strategies to find the list element that is logical. There is one! Use a type-specific form of `map()` and an extraction shortcut to extract these values for all characters into a logical vector.

### Extract multiple values

What if you want to retrieve multiple elements? Such as the character's name and culture? First, recall how we do this with the list for a single user:

```{r}
got_chars[[3]][c("name", "culture", "gender", "born")]
```

We use single square bracket indexing and a character vector to index by name. How will we ram this into the `map()` framework? To paraphrase Chambers, ["everything that happens in R is a function call"](http://adv-r.had.co.nz/Functions.html#all-calls) and indexing with `[` is no exception.

It feels (and maybe looks) weird, but we can map `[` just like any other function. Recall `map()` usage:

```{r eval = FALSE}
map(.x, .f, ...)
```

The function `.f` will be `[`. And we finally get to use `...`! This is where we pass the character vector of the names of our desired elements. We inspect the result for two characters.

```{r}
x <- map(got_chars, `[`, c("name", "culture", "gender", "born"))
str(x[16:17])
```

Some people find this ugly and might prefer the `extract()` function from magrittr.

```{r}
library(magrittr)
x <- map(got_chars, extract, c("name", "culture", "gender", "born"))
str(x[18:19])
```

#### Exercises

1. Use your list inspection skills to determine the position of the elements named "name", "gender", "culture", "born", and "died". Map `[` or `magrittr::extract()` over users, requesting these elements by position instead of name.

### Data frame output

We just learned how to extract multiple elements per user by mapping `[`. But, since `[` is non-simplifying, each user's elements are returned in a list. And, as it must, `map()` itself returns list. We've traded one recursive list for another recursive list, albeit a slightly less complicated one.

How can we "stack up" these results row-wise, i.e. one row per user and variables for "name", "gender", etc.? A data frame would be the perfect data structure for this information.

This is what `map_dfr()` is for.

```{r}
map_dfr(got_chars, extract, c("name", "culture", "gender", "id", "born", "alive"))
```

Finally! A data frame! Hallelujah!

Notice how the variables have been automatically type converted. It's a beautiful thing. Until it's not. When programming, it is safer, but more cumbersome, to explicitly specify type and build your data frame the usual way.

```{r}
library(tibble)
got_chars %>% {
  tibble(
       name = map_chr(., "name"),
    culture = map_chr(., "culture"),
     gender = map_chr(., "gender"),       
         id = map_int(., "id"),
       born = map_chr(., "born"),
      alive = map_lgl(., "alive")
  )
}
```

*Syntax notes: The dot `.` above is the placeholder for the primary input: `got_chars` in this case. The curly braces `{}` surrounding the `tibble()` call prevent `got_chars` from being passed in as the first argument of `tibble()`.*

#### Exercises

1. Use `map_dfr()` to create the same data frame as above, but indexing with a vector of positive integers instead of names.