-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathlist-columns.Rmd
256 lines (181 loc) · 6.03 KB
/
list-columns.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
# List columns
```{r}
# Libraries
library(tidyverse)
library(dcldata)
```
Recall that tibbles are lists of vectors.
```{r echo=FALSE}
knitr::include_graphics("images/list-columns/tibble.png", dpi = image_dpi)
```
Usually, these vectors are _atomic vectors_, so the elements in the columns are single values, like "a" or 1.
```{r echo=FALSE}
knitr::include_graphics("images/list-columns/tibble-atomic.png", dpi = image_dpi)
```
Tibbles can also have columns that are lists. These columns are (appropriately) called _list columns_.
```{r echo=FALSE}
knitr::include_graphics(
"images/list-columns/tibble-list-col.png", dpi = image_dpi
)
```
List columns are more flexible than normal, atomic vector columns. Lists can contain anything, so a list column can be made up of atomic vectors, other lists, tibbles, etc.
```{r echo=FALSE}
knitr::include_graphics(
"images/list-columns/tibble-list-col-vectors.png",
dpi = image_dpi
)
```
As you'll see, this can be a useful way to store data. In this chapter, you'll learn how to create list columns, how to turn list columns back into normal columns, and how to manipulate list columns.
## Creating
Typically, you'll create list columns by manipulating an existing tibble. There are three primary ways to create list columns:
* `nest()`
* `summarize()` and `list()`
* `mutate()` and `map()`
### `nest()`
`countries` is a simplified version of `dcldata::gm_countries`, which contains Gapminder data on 197 countries.
```{r}
countries <-
gm_countries %>%
select(name, region_gm4, un_status, un_admission, income_wb_2017)
countries
```
The tidyr function `nest()` creates list columns of tibbles.
```{r echo=FALSE}
knitr::include_graphics(
"images/list-columns/tibble-list-col-tibbles.png", dpi = image_dpi
)
```
Pass `nest()` the names of the columns to put into each individual tibble. `nest()` will create one row for each unique value of the remaining variables. For example, say we select just two columns from `countries`
```{r}
countries %>%
select(region_gm4, name)
```
and then nest `name`.
```{r}
regions <-
countries %>%
select(region_gm4, name) %>%
nest(countries = name)
regions
```
`nest()` created one tibble for each `region_gm4`.
```{r echo=FALSE}
knitr::include_graphics(
"images/list-columns/regions.png",
dpi = image_dpi
)
```
Each of these tibbles contains all the countries that belong to the continent.
```{r}
regions$countries[[1]]
```
The entire column is a list.
```{r}
typeof(regions$countries)
```
If we nest multiple variables, the individual tibbles will have multiple columns.
```{r}
regions_data <-
countries %>%
nest(data = c(name, un_status, un_admission, income_wb_2017))
regions_data
```
```{r}
regions_data$data[[1]]
```
You can specify columns to nest using the same syntax as `select()`.
```{r}
countries %>%
nest(data = !region_gm4)
```
You can also create multiple list columns at once.
```{r}
countries %>%
nest(countries = name, data = c(name, contains("un"), income_wb_2017))
```
### `summarize()` and `list()`
You've used `group_by()` and `summarize()` to collapse groups into single rows. We can also use `summarize()` to create a list column, where each element is a vector, list, or tibble.
If you supply `list()` with multiple atomic vectors, it will create a list of atomic vectors.
```{r}
list(c(1, 2, 3), c("a", "b", "c"))
```
We can use `summarize()` and `list()` to create a list of atomic vectors where each vector corresponds to one `region_gm4`. For example, the following creates a list column of countries.
```{r}
countries_collapsed <-
countries %>%
group_by(region_gm4) %>%
summarize(countries = list(name))
countries_collapsed
```
The `countries` column is similar to the one created earlier with `nest()`, except each element is an atomic vector, not a tibble.
```{r}
typeof(countries_collapsed$countries[[1]])
```
What if we want to manipulate each vector before creating the list column? For example, say we want to arrange all country names alphabetically. The following doesn't work:
```{r error=TRUE}
countries %>%
group_by(region_gm4) %>%
summarize(countries = sort(name))
```
You need to collect all the atomic vectors into a list.
```{r}
countries %>%
group_by(region_gm4) %>%
summarize(countries = list(sort(name)))
```
Here's another example, which only stores in `countries` only the countries that begin with "A".
```{r}
a_countries <-
countries %>%
group_by(region_gm4) %>%
summarize(countries = list(str_subset(name, "^A")))
a_countries
a_countries$countries[[1]]
```
### `mutate()`
The third way to create a list column is use to `rowwise()` and `mutate()`. For example, the following creates a list column where the element for each country is a vector of random numbers.
```{r}
countries %>%
select(name) %>%
rowwise() %>%
mutate(random = list(rnorm(n = str_length(name)))) %>%
ungroup()
```
## Unnesting
To transform a list column into normal columns, use `unnest()`. Here's our tibble with a list column of country names.
```{r}
regions
```
Supply the `cols` argument of `unnest()` with the name of the columns to unnest.
```{r}
regions %>%
unnest(cols = countries)
```
## Manipulating
To manipulate list columns, you'll often find it helpful to use row-wise operations. For example, say we want to find the number of countries in each continent. Here's `regions` again.
```{r}
regions
```
We can't call `length()` directly on `countries`, because we'll just get the length of the entire column.
```{r}
regions %>%
mutate(num_countries = length(countries))
```
Instead, we need to iterate over each element or row of `countries` separately.
```{r}
regions %>%
rowwise() %>%
mutate(num_countries = nrow(countries)) %>%
ungroup()
```
Note that we need to use `nrow()` because each element of `countries` is actually a tibble.
This code finds the proportion of a region's country names that end in "a".
```{r}
regions %>%
rowwise() %>%
mutate(
ends_in_a = sum(str_detect(countries$name, "a$")) / nrow(countries)
) %>%
ungroup() %>%
arrange(desc(ends_in_a))
```