forked from hadley/mastering-shiny
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathaction-tidy.Rmd
409 lines (308 loc) · 16.1 KB
/
action-tidy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
# Tidy evaluation {#action-tidy}
```{r, include = FALSE}
source("common.R")
```
If you are using the tidyverse from Shiny, you will almost certainly encounter the challenge of programming with tidy evaluation. Tidy evaluation is the technique that allows you to refer to variables within a data frame, without having to think about it, or do anything special. That's what makes code like this work:
```{r, eval = FALSE}
diamonds %>% filter(x == z)
ggplot(diamonds, aes(x = carat, y = price)) +
geom_hex()
```
First we'll go over the basic motivation, and the key idea that makes tidy evaluation more convenient for data analysis and less convenient for programming.
This article will focus on the combination of tidy evaluation with Shiny. If you want to learn more about the general challenges of using tidy evaluation in a package, see
<http://ggplot2.tidyverse.org/dev/articles/ggplot2-in-packages.html> (or the dplyr equivalent, when it exists).
As well as Shiny, this chapter will use both ggplot2 and dplyr to show the main use cases of tidy evaluation and Shiny together.
```{r setup}
library(shiny)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
```
## Motivation {#tidy-motivation}
The key idea of tidy evaluation is that it blurs the line between two meaning of "variable":
* Environment variables (env-vars for short), are programming variables.
Formally, they are between names and values that are usually created by `<-`.
* Data frame variables (data-vars for short), are "statistical" variables
that live inside a data frame. In base R you usually access them with `$` and
`[[`.
Take this piece of code:
```{r}
df <- data.frame(x = runif(3), y = runif(3))
df$x
```
It creates a env-var called `df`, that contains two data-vars `x` and `y`. Then it extracts the data-var `x` out of the data frame `df` using `$`.
Tidy evaluation makes it to write data analysis code because it blurs the distinction between the two. In most (but not all) base R functions you need to refer to a data-variable with `$`, leading to code that repeats the name of the data frame many times:
```{r}
diamonds[diamonds$x == 0 | diamonds$y == 0, ]
```
The dplyr equivalent, `filter()`, uses tidy evaluation to allow you to refer to a data-var as if it was a env-var:
```{r}
filter(diamonds, x == 0 | y == 0)
```
(dplyr's `filter()` is inspired by base R's `subset()`. `subset()` uses an ad hoc approach to each the same reasult as tidy evaluation, so unfortunately the same techniques don't apply to it.)
You usually these verbs purely with data-vars, but they work equally well with env-vars:
```{r}
min_carat <- 1
diamonds %>% filter(carat > min_carat)
```
I think this blurring of the meaning of variable is a really nice feature for interactive data analysis, because it allows you to refer to data-vars as is, without any prefix. And this seems to be fairly intuitive, since many newer R users will attempt to write `diamonds[x == 0 | y == 0, ]`. But when you start to program with these tools, you're going to have to grapple with the distinction. And this will be hard because you've never had to think about it before, so it'll take a while for your brain to learn these new concepts and categories. However, once you've teased apart the idea of "variable" in data-varialbe and env-variable, I think you'll find it fairly easy to use.
## Solutions
### Tidy evaluation in Shiny apps
The key to resolving this ambiguity is to make use of two __pronouns__ that are built into tidy evaluation: `.data` and `.env`. As you might guess from the name, these pronouns allow you to remove the ambiguity introduced by tidy evaluation. For example, we can rewrite the filter used above:
```{r}
diamonds %>% filter(.data$carat > .env$min_carat)
```
This doesn't immediately help us in Shiny apps, because the results from inputs are usually strings, and using `.data$var` isn't going to work becaues it's going to look for a data-var called `var`, not a data-var stored in the env-var `var`. Fortunately base R already has a solution for this: `.data[[var]]`.
Let's apply this to a simple example:
```{r}
ui <- fluidPage(
selectInput("var", "Variable", choices = names(diamonds)),
tableOutput("output")
)
server <- function(input, output, session) {
data <- reactive(filter(diamonds, input$var > 0))
output$output <- renderTable(head(data()))
}
```
This doesn't work because `input$var` isn't a data-var: it's an env-var containing the name of a data-var (stored as string). Unfortunately it also fails to give a useful error message because `input$var` will be a string like "carat" and:
```{r}
"carat" > 0
```
We can fix the problem by using `.data` and `[[`:
```{r}
server <- function(input, output, session) {
data <- reactive(filter(diamonds, .data[[input$var]] > 0))
output$output <- renderTable(head(data()))
}
```
### Tidy evaluation in functions
You should note that this a slightly different problem to use of tidy evaluation functions. Where we need a slightly different solution. You can use `.data` + `[[`, but it doesn't create a very user friendly function:
```{r}
filter_var <- function(df, var, val) {
filter(df, .data[[var]] > val)
}
filter_var(diamonds, "carat", 4)
```
This function is a bit weird because it takes the name of the variable as a string, so it doesn't work like most other tidyverse functions. Here we need to use a slightly different technique:
```{r}
filter_var <- function(df, var, val) {
filter(df, {{ var }} > val)
}
filter_var(diamonds, carat, 4)
```
The use of `{{` tells
### `parse()`
Finally, it's worth a note about using `paste()` + `parse()` + `eval()`. It's tempting approach because it means that you don't have to learn much new. But it has some major downsides. This is a bad idea because it means that the user of your app can run arbitrary R code. This isn't super important if its a Shiny app that only use you, but it's a good habit to get into --- otherwise it's very easy to accidentally create a security hole in an app that you share more widely.
## Case studies
### Plotting
```{r}
ui <- fluidPage(
selectInput("x", "X variable", choices = names(iris)),
selectInput("y", "Y variable", choices = names(iris)),
plotOutput("plot")
)
server <- function(input, output, session) {
output$plot <- renderPlot({
ggplot(iris, aes(.data[[input$x]], .data[[input$y]])) +
geom_point(position = ggforce::position_auto()) +
labs(x = input$x, y = input$y)
})
}
```
I use the special `ggforce::position_auto()` to automatically spread the points out when one axis is discrete. Once you've mastered the basics of tidy evaluation you'll quickly find that the challenge becomes making your app general enough to work with many different types of variable.
Instead of using `position_auto()` we could allow the user to pick the geom:
```{r}
ui <- fluidPage(
selectInput("x", "X variable", choices = names(iris)),
selectInput("y", "Y variable", choices = names(iris)),
selectInput("geom", "geom", c("point", "smooth", "jitter")),
plotOutput("plot")
)
server <- function(input, output, session) {
plot_geom <- reactive({
switch(input$geom,
point = geom_point(),
smooth = geom_smooth(se = FALSE),
jitter = geom_jitter()
)
})
output$plot <- renderPlot({
ggplot(iris, aes(.data[[input$x]], .data[[input$y]])) +
plot_geom() +
labs(x = input$x, y = input$y)
})
}
```
### Filtering and arranging
Same idea works for dplyr:
```{r}
library(dplyr, warn.conflicts = FALSE)
ui <- fluidPage(
selectInput("var", "Select variable", choices = names(mtcars)),
sliderInput("min", "Minimum value", 0, min = 0, max = 100),
selectInput("sort", "Sort by", choices = names(mtcars)),
tableOutput("data")
)
server <- function(input, output, session) {
observeEvent(input$var, {
rng <- range(mtcars[[input$var]])
updateSliderInput(session, "min", value = rng[[1]], min = rng[[1]], max = rng[[2]])
})
output$data <- renderTable({
mtcars %>%
filter(.data[[input$var]] > input$min) %>%
arrange(.data[[input$sort]])
})
}
```
Most other problems can be solved by combining this techique with your existing programming skills. For example, what if you wanted to conditionally sort in either ascending or descending order?
```{r}
ui <- fluidPage(
selectInput("var", "Sort by", choices = names(mtcars)),
checkboxInput("desc", "Descending order?"),
tableOutput("data")
)
server <- function(input, output, session) {
sorted <- reactive({
if (input$desc) {
arrange(mtcars, desc(.data[[input$var]]))
} else {
arrange(mtcars, .data[[input$var]])
}
})
output$data <- renderTable(sorted())
}
```
As you provide more control, you'll find the code gets more and more complicated, and it becomes harder and harder to create a user interface that is both comprehensive _and_ user friendly. This is why I've always focussed on code tools for data analysis: creating good UIs is really really hard!
## Additional challenges
The final section of this chapter covers a grab bag of additional topics that are important for various applications.
### Selection semantics
Most tidyverse functions (e.g. `dplyr::mutate()`, `dplyr::filter()`, `dplyr::group_by()`, `ggplot2::aes()`) have what we call __action__ semantics, which means that you can perform any action inside of them. Other function have __selection__ semantics; instead of general computation you can select variables using a special domain specific language that includes helpers like `starts_with()`, and `ends_with()`. The most important function that has selection semantics is `dplyr::select()`, but the set also includes many tidyr like `pivot_longer()` and `pivot_wider()`, `separate()`, `extract()`, and `unite()` functions. Selection semantics are powered by the tidyselect package.
Working with functions that use selection semantics is slightly different to those that use action semantics because there is no `.data` pronoun. Instead you use the helper `one_of()` or `all_of()`[^one-vs-all]:
[^one-vs-all]: `one_of()` is available in all versions of the tidyselect package, but the name is not very informative, so we recommend using `all_of()` if it's available to you.
```{r}
ui <- fluidPage(
selectInput("vars", "Variables", names(mtcars), multiple = TRUE),
tableOutput("data")
)
server <- function(input, output, session) {
output$data <- renderTable({
req(input$vars)
mtcars %>% select(one_of(input$vars))
})
}
```
(If you wanted all of the variables _except_ those selected you could use `-one_of(input$vars)))`.
### Multiple variables
As shown in the previous example, working with multiple variables is trivial when you're working with a function that uses selection semantics: you can just pass a character vector of variable names in to `one_of()`/`all_of()`. The challenge is operating on multiple variables when the function has action semantics, as is common with dplyr functions. There are two ways to work with multiple variables, depending on which version of dplyr you are working with. I'll illustrate them with an app that allows you to select any number of variables to count their unique values.
```{r}
ui <- fluidPage(
selectInput("vars", "Variables", names(mtcars), multiple = TRUE),
tableOutput("count")
)
```
In dplyr 0.8 and earlier, every function that uses action semantics also has a variant that has selection semantics, with the suffix `_at`. The easiest approach is to just to switch from action to selection semantics by changing the function that you're programing with.
```{r}
server <- function(input, output, session) {
output$count <- renderTable({
req(input$vars)
mtcars %>%
group_by_at(input$vars) %>%
summarise(n = n())
})
}
```
dplyr 1.0.0 provides a more flexible approach: inside of any function with action semantics, you can use `across()` to access selection semantics:
```{r}
server <- function(input, output, session) {
output$count <- renderTable({
req(input$vars)
mtcars %>%
group_by(across(all_of(input$vars))) %>%
summarise(n = n())
})
}
```
Things are mildly more complicated for `mutate()` and `summarise()` because you also need to supply a function to perform the operation.
```{r}
ui <- fluidPage(
selectInput("vars_g", "Group by", names(mtcars), multiple = TRUE),
selectInput("vars_s", "Summarise", names(mtcars), multiple = TRUE),
tableOutput("data")
)
# dplyr 0.8.0
server <- function(input, output, session) {
output$data <- renderTable({
mtcars %>%
group_by(across(all_of(input$vars_g))) %>%
summarise(across(all_of(input$vars_s), mean), n = n())
})
}
# dplyr 1.0.0
server <- function(input, output, session) {
output$data <- renderTable({
mtcars %>%
group_by_at(input$vars_g) %>%
summarise_at(input$vars_s, mean)
})
}
```
### Action semantics and user supplied data
There is one additional complication when you're working with user supplied data and action semantics. Take the following app: it allows the user to upload a tsv file, then select a variable, and filter by it. It will work for the vast majority of inputs you might try it with:
```{r}
ui <- fluidPage(
fileInput("data", "dataset", accept = ".tsv"),
selectInput("var", "var", character()),
numericInput("min", "min", 1, min = 0, step = 1),
tableOutput("output")
)
server <- function(input, output, session) {
data <- reactive({
req(input$data)
vroom::vroom(input$data$datapath)
})
observeEvent(data(), {
updateSelectInput(session, "var", choices = names(data()))
})
observeEvent(input$var, {
val <- data()[[input$var]]
updateNumericInput(session, "min", value = min(val))
})
output$output <- renderTable({
req(input$var)
data() %>%
filter(.data[[input$var]] > input$min) %>%
arrange(.data[[input$var]]) %>%
head(10)
})
}
```
There is a subtle problem with the use of `filter()`. Let's focus in on that code so we can play around and see the problem more easily outside of the app.
```{r}
df <- data.frame(x = 1, y = 2)
input <- list(var = "x", min = 0)
df %>% filter(.data[[input$var]] > input$min)
```
If you experiment with this code, you'll find that it appears to work just fine for vast majority of data frames. However, there's one big problem: what happens the data frame contains a variable called `input`?
```{r, error = TRUE}
df <- data.frame(x = 1, y = 2, input = 3)
df %>% filter(.data[[input$var]] > input$min)
```
We get an error message because `filter()` is attempting to evaluate `df$input$min`:
```{r, error = TRUE}
df$input$min
```
This problem is again due to the ambiguity of data-variables and env-variables. Tidy evaluation always prefers to use a data-variable if both are available. We can resolve the amibugity by telling `filter()` not to look in the data frame for `input`, and instead only use an env-variable[^bang-bang]:
```{r}
df <- data.frame(x = 1, y = 2, input = 3)
df %>% filter(.data[[input$var]] > .env$input$min)
```
[^bang-bang]: Instead of use `.env`, you can also use use `!!` if you know about it, e.g. `df %>% filter(.data[[input$var]] > !!input$min)`. This is evaluated at a slightly different time, and I think is mildly less appealing because it lacks the symmetry of `.data` vs `.env`, and you need to know about `!!`. But it's a fine solution if you're happy with `!!`.
At this point you might wonder if you're better off without `filter()`, and just write the equivalent base R code:
```{r}
df[df[[input$var]] > input$min, ]
```
That's fine too, as long as you're aware of all the edge cases where `filter()` behaves differently. In this case:
* You'll need `drop = FALSE` if `df` contains a single column (otherwise you'll
get a vector instead of a data frame)
* You'll need to use `which()` or similar to drop any missing values.
In general, if you're using dplyr for very simple cases, you might find it easier to use implementations that don't rely on tidy evaluation. However, in my opinion, one of the major advantages of the tidyverse is not just in routine application, but in the careful thought that has been applied to edge cases so that functions work more consistently. I don't want to oversell this, but at the same time, it's easy to forget the quirks of specific base R functions, and write code that works 95% of the time, but fails in unusual ways the other 5% of the time.