-
Notifications
You must be signed in to change notification settings - Fork 2
/
jennybc_lists_lesson.Rmd
345 lines (234 loc) · 15.1 KB
/
jennybc_lists_lesson.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
---
title: "Exploring and Extracting Data in Lists"
author: "Jenny Bryan, adapted by Clarke Iakovakis"
output:
html_document:
df_print: paged
toc: true
toc_float:
collapsed: false
smooth_scroll: false
toc_depth: 2
---
```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", error = TRUE, cache = TRUE)
library(png)
library(repurrrsive)
```
## Attribution
This lesson was created and is copyrighted by Jenny Bryan, available at <https://jennybc.github.io/purrr-tutorial/ls00_inspect-explore.html> and distributed under the terms of a [Creative Commons BY-NC 4.0 License](http://creativecommons.org/licenses/by-nc/4.0/). It has been adapted by Clarke Iakovakis and the adaptation is likewise distributed under a [Creative Commons BY-NC 4.0 License](http://creativecommons.org/licenses/by-nc/4.0/).
```{r,fig.height=1,echo=FALSE}
cc <- readPNG(file.path("./images/cc bync.png"))
grid::grid.raster(cc)
```
## Binder link to this notebook:
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ciakovx/ciakovx.github.io/master?filepath=jennybc_lists_lesson.ipynb)
<https://mybinder.org/v2/gh/ciakovx/ciakovx.github.io/master?filepath=jennybc_lists_lesson.ipynb>
## Load packages
Load purrr and repurrrsive, which contains recursive list examples.
```{r}
library(purrr)
library(repurrrsive)
```
## Inspect and explore
List inspection is very important and also fairly miserable. Before you can apply a function to every element of a list, you'd better understand the list!
You need to develop a toolkit for list inspection. Be on the look out for:
* What is the length of the list?
* Are the components homogeneous, i.e. do they have the same overall structure, albeit containing different data?
* Note the length, names, and types of the constituent objects.
> I have no idea what's in this list or what its structure is! Please send help.
Understand this is **situation normal**, especially when your list comes from querying a poorly documented API. This is often true even when your list has been created completely within R. How many of us perfectly understand the structure of a fitted linear model object? You just have to embark on a voyage of discovery and figure out what's in there. Happy trails.
### Indexing, review
Remember, there are 3 ways to pull elements out of a list:
* The `$` operator. Extracts a single element by name. Name can be unquoted, if syntactic.
```{r}
x <- list(a = "a", b = 2)
x$a
x$b
```
* `[[` a.k.a. double square bracket. Extracts a single element by name or position. Name must be quoted, if provided directly. Name or position can also be stored in a variable.
```{r}
x <- list(a = "a", b = 2)
x[["a"]]
x[[2]]
nm <- "a"
x[[nm]]
i <- 2
x[[i]]
```
* `[` a.k.a. single square bracket. Regular vector indexing. For a list input, this always returns a list!
```{r}
x <- list(a = "a", b = 2)
x["a"]
x[c("a", "b")]
x[c(FALSE, TRUE)]
```
### `str()`
`str()` can help with basic list inspection, although it's still rather frustrating. Learn to love the `max.level` and `list.len` arguments. You can use them to keep the output of `str()` down to a manageable volume.
Once you begin to suspect or trust that your list is homogeneous, i.e. consists of sub-lists with similar structure, it's often a good idea to do an in-depth study of a single element. In general, remember you can combine list inspection via `str(..., list.len = x, max.level = y)` with single `[` and double `[[` square bracket indexing.
The repurrrsive package provides examples of lists. We explore them below, to lay the groundwork for other lessons, and to demonstrate list inspection strategies.
### listviewer and RStudio's Object Explorer
The RStudio IDE (v1.1 and higher) offers an [Object Explorer](https://blog.rstudio.com/2017/08/22/rstudio-v1-1-preview-object-explorer/) that provides interactive inspection and code generation tools for hierarchical objects, such as lists. You can invoke it via the GUI or in code as `View(YOUR_UGLY_LIST)`.
However, that won't help you expose list exploration in something like this website. I am using the [listviewer](https://CRAN.R-project.org/package=listviewer) package to do this below. It allows you to expose list exploration in a rendered `.Rmd` document.
To replicate this experience locally, call, e.g., `listviewer::jsonedit(got_chars, mode = "view")`.
```{r}
library(listviewer)
```
## Wes Anderson color palettes
`wesanderson` is a simple list containing color palettes from the [wesanderson package](https://cran.r-project.org/package=wesanderson). Each component is a palette, named after a movie, and contains a character vector of colors as hexadecimal triplets.
```{r}
str(wesanderson)
```
### Explore `wesanderson`
```{r echo = FALSE}
jsonedit(wesanderson, mode = "view", elementId = "wesanderson")
```
You can get a similar experience in RStudio via `View(wesanderson)`.
## Game of Thrones POV characters
`got_chars` is a list with information on the `r length(got_chars)` point-of-view characters from the first five books in the Song of Ice and Fire series by George R. R. Martin. Retrieved from [An API Of Ice And Fire](https://anapioficeandfire.com). Each component corresponds to one character and contains `r length(got_chars[[1]])` components which are named atomic vectors of various lengths and types.
```{r}
str(got_chars, list.len = 3)
str(got_chars[[1]], list.len = 8)
```
### Explore `got_chars`
```{r echo = FALSE}
jsonedit(number_unnamed(got_chars), mode = "view", elementId = "got_chars")
```
You can get a similar experience in RStudio via `View(got_chars)`.
## GitHub users and repositories
`gh_users` is a list with information on 6 GitHub users. `gh_repos` is a nested list, also of length 6, where each component is another list with information on up to 30 of that user's repositories. Retrieved from the [GitHub API](https://developer.github.com/v3/).
```{r}
str(gh_users, max.level = 1)
```
### Explore `gh_users`
```{r echo = FALSE}
jsonedit(number_unnamed(gh_users), mode = "view", elementId = "gh_users")
```
You can get a similar experience in RStudio via `View(gh_users)`.
### Explore `gh_repos`
```{r echo = FALSE}
jsonedit(number_unnamed(gh_repos), mode = "view", elementId = "gh_repos")
```
You can get a similar experience in RStudio via `View(gh_repos)`.
#### Exercises
1. Read the documentation on `str()`. What does `max.level` control? Apply `str()` to `wesanderson` and/or `got_chars` and experiment with `max.level = 0`, `max.level = 1`, and `max.level = 2`. Which will you use in practice with deeply nested lists?
1. What does the `list.len` argument of `str()` control? What is its default value? Call `str()` on `got_chars` and then on a single component of `got_chars` with `list.len` set to a value much smaller than the default. What range of values do you think you'll use in real life?
1. Call `str()` on `got_chars`, specifying both `max.level` and `list.len`.
1. Call `str()` on the first element of `got_chars`, i.e. the first Game of Thrones character. Use what you've learned to pick an appropriate combination of `max.level` and `list.len`.
## Vectorized and "list-ized" operations
Recall that many operations "just work" in a vectorized fashion in R:
```{r}
(3:5) ^ 2
sqrt(c(9, 16, 25))
```
Through the magic of R, the operations "raise to the power of 2" and "take the square root" were applied to each individual element of the numeric vector input. Someone -- but not you! -- has written a `for()` loop:
```{r eval = FALSE}
for (i in 1:n) {
output[[i]] <- f(input[[i]])
}
```
Automatic vectorization is possible because our input is an atomic vector: the individual atoms are always of length one, always of uniform type.
What if the input is a list? You have to be more intentional to apply a function `f()` to each element of a list, i.e. to "list-ize" computation. This makes sense because the data structure itself does not guarantee that it makes any sense at all to apply a common function `f()` to each element of the list. You must guarantee that.
`purrr::map()` is a function for applying a function to each element of a list. The [closest base R function](bk01_base-functions.html) is `lapply()`. Here's how the square root example of the above would look if the input was in a list.
```{r}
map(c(9, 16, 25), sqrt)
```
A template for basic `map()` usage:
```{r eval = FALSE}
map(YOUR_LIST, YOUR_FUNCTION)
```
Below we explore these useful features of `purrr::map()` and friends:
* Shortcuts for `YOUR_FUNCTION` when you want to extract list elements by name or position
* Simplify and specify the type of output via `map_chr()`, `map_lgl()`, etc.
This is where you begin to see the differences between `purrr::map()` and `base::lapply()`.
### Name and position shortcuts
Who are these Game of Thrones characters?
We want the elements with name "name", so we do this (we restrict to the first few elements purely to conserve space):
```{r}
map(got_chars[1:4], "name")
```
We are exploiting one of purrr's most useful features: a shortcut to create a function that extracts an element based on its name.
A companion shortcut is used if you provide a positive integer to `map()`. This creates a function that extracts an element based on position.
The 3rd element of each character's list is his or her name and we get them like so:
```{r}
map(got_chars[5:8], 3)
```
To recap, here are two shortcuts for making the `.f` function that `map()` will apply:
* provide "TEXT" to extract the element named "TEXT"
- equivalent to `function(x) x[["TEXT"]]`
* provide `i` to extract the `i`-th element
- equivalent to `function(x) x[[i]]`
You will frequently see `map()` used together with [the pipe `%>%`](http://r4ds.had.co.nz/pipes.html). These calls produce the same result as the above.
```{r eval = FALSE}
got_chars %>%
map("name")
got_chars %>%
map(3)
```
#### Exercises
1. Use `names()` to inspect the names of the list elements associated with a single character. What is the index or position of the `playedBy` element? Use the character and position shortcuts to extract the `playedBy` elements for all characters.
1. What happens if you use the character shortcut with a string that does not appear in the lists' names?
1. What happens if you use the position shortcut with a number greater than the length of the lists?
1. What if these shortcuts did not exist? Write a function that takes a list and a string as input and returns the list element that bears the name in the string. Apply this to `got_chars` via `map()`. Do you get the same result as with the shortcut? Reflect on code length and readability.
1. Write another function that takes a list and an integer as input and returns the list element at that position. Apply this to `got_chars` via `map()`. How does this result and process compare with the shortcut?
### Type-specific map
`map()` always returns a list, even if all the elements have the same flavor and are of length one. But in that case, you might prefer a simpler object: **an atomic vector**.
If you expect `map()` to return output that can be turned into an atomic vector, it is best to use a type-specific variant of `map()`. This is more efficient than using `map()` to get a list and then simplifying the result in a second step. Also purrr will alert you to any problems, i.e. if one or more inputs has the wrong type or length. This is the [increased rigor about type alluded to in the section about coercion](bk00_vectors-and-lists.html#coercion).
Our current examples are suitable for demonstrating `map_chr()`, since the requested elements are always character.
```{r}
map_chr(got_chars[9:12], "name")
map_chr(got_chars[13:16], 3)
```
Besides `map_chr()`, there are other variants of `map()`, with the target type conveyed by the name:
* `map_lgl()`, `map_int()`, `map_dbl()`
#### Exercises
1. For each character, the second element is named "id". This is the character's id in the [API Of Ice And Fire](https://anapioficeandfire.com). Use a type-specific form of `map()` and an extraction shortcut to extract these ids into an integer vector.
1. Use your list inspection strategies to find the list element that is logical. There is one! Use a type-specific form of `map()` and an extraction shortcut to extract these values for all characters into a logical vector.
### Extract multiple values
What if you want to retrieve multiple elements? Such as the character's name and culture? First, recall how we do this with the list for a single user:
```{r}
got_chars[[3]][c("name", "culture", "gender", "born")]
```
We use single square bracket indexing and a character vector to index by name. How will we ram this into the `map()` framework? To paraphrase Chambers, ["everything that happens in R is a function call"](http://adv-r.had.co.nz/Functions.html#all-calls) and indexing with `[` is no exception.
It feels (and maybe looks) weird, but we can map `[` just like any other function. Recall `map()` usage:
```{r eval = FALSE}
map(.x, .f, ...)
```
The function `.f` will be `[`. And we finally get to use `...`! This is where we pass the character vector of the names of our desired elements. We inspect the result for two characters.
```{r}
x <- map(got_chars, `[`, c("name", "culture", "gender", "born"))
str(x[16:17])
```
Some people find this ugly and might prefer the `extract()` function from magrittr.
```{r}
library(magrittr)
x <- map(got_chars, extract, c("name", "culture", "gender", "born"))
str(x[18:19])
```
#### Exercises
1. Use your list inspection skills to determine the position of the elements named "name", "gender", "culture", "born", and "died". Map `[` or `magrittr::extract()` over users, requesting these elements by position instead of name.
### Data frame output
We just learned how to extract multiple elements per user by mapping `[`. But, since `[` is non-simplifying, each user's elements are returned in a list. And, as it must, `map()` itself returns list. We've traded one recursive list for another recursive list, albeit a slightly less complicated one.
How can we "stack up" these results row-wise, i.e. one row per user and variables for "name", "gender", etc.? A data frame would be the perfect data structure for this information.
This is what `map_dfr()` is for.
```{r}
map_dfr(got_chars, extract, c("name", "culture", "gender", "id", "born", "alive"))
```
Finally! A data frame! Hallelujah!
Notice how the variables have been automatically type converted. It's a beautiful thing. Until it's not. When programming, it is safer, but more cumbersome, to explicitly specify type and build your data frame the usual way.
```{r}
library(tibble)
got_chars %>% {
tibble(
name = map_chr(., "name"),
culture = map_chr(., "culture"),
gender = map_chr(., "gender"),
id = map_int(., "id"),
born = map_chr(., "born"),
alive = map_lgl(., "alive")
)
}
```
*Syntax notes: The dot `.` above is the placeholder for the primary input: `got_chars` in this case. The curly braces `{}` surrounding the `tibble()` call prevent `got_chars` from being passed in as the first argument of `tibble()`.*
#### Exercises
1. Use `map_dfr()` to create the same data frame as above, but indexing with a vector of positive integers instead of names.