action-tidy.Rmd

# Tidy evaluation {#action-tidy}

```{r, include = FALSE}
source("common.R")
```

If you are using the tidyverse from Shiny, you will almost certainly encounter the challenge of programming with tidy evaluation. Tidy evaluation is the technique that allows you to refer to variables within a data frame, without having to think about it, or do anything special. That's what makes code like this work:

```{r, eval = FALSE}
diamonds %>% filter(x == z)

ggplot(diamonds, aes(x = carat, y = price)) + 
  geom_hex()
```

First we'll go over the basic motivation, and the key idea that makes tidy evaluation more convenient for data analysis and less convenient for programming.

This article will focus on the combination of tidy evaluation with Shiny. If you want to learn more about the general challenges of using tidy evaluation in a package, see
<http://ggplot2.tidyverse.org/dev/articles/ggplot2-in-packages.html> (or the dplyr equivalent, when it exists).

As well as Shiny, this chapter will use both ggplot2 and dplyr to show the main use cases of tidy evaluation and Shiny together.

```{r setup}
library(shiny)

library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
```

## Motivation {#tidy-motivation}

The key idea of tidy evaluation is that it blurs the line between two meaning of "variable":

* Environment variables (env-vars for short), are programming variables. 
  Formally, they are between names and values that are usually created by `<-`.

* Data frame variables (data-vars for short), are "statistical" variables 
  that live inside a data frame. In base R you usually access them with `$` and
  `[[`.
  
Take this piece of code:

```{r}
df <- data.frame(x = runif(3), y = runif(3))
df$x
```

It creates a env-var called `df`, that contains two data-vars `x` and `y`. Then it extracts the data-var `x` out of the data frame `df` using `$`.

Tidy evaluation makes it to write data analysis code because it blurs the distinction between the two. In most (but not all) base R functions you need to refer to a data-variable with `$`, leading to code that repeats the name of the data frame many times:

```{r}
diamonds[diamonds$x == 0 | diamonds$y == 0, ]
```

The dplyr equivalent, `filter()`, uses tidy evaluation to allow you to refer to a data-var as if it was a env-var:

```{r}
filter(diamonds, x == 0 | y == 0)
```

(dplyr's `filter()` is inspired by base R's `subset()`. `subset()` uses an ad hoc approach to each the same reasult as tidy evaluation, so unfortunately the same techniques don't apply to it.)

You usually these verbs purely with data-vars, but they work equally well with env-vars:

```{r}
min_carat <- 1
diamonds %>% filter(carat > min_carat)
```

I think this blurring of the meaning of variable is a really nice feature for interactive data analysis, because it allows you to refer to data-vars as is, without any prefix. And this seems to be fairly intuitive, since many newer R users will attempt to write `diamonds[x == 0 | y == 0, ]`. But when you start to program with these tools, you're going to have to grapple with the distinction. And this will be hard because you've never had to think about it before, so it'll take a while for your brain to learn these new concepts and categories. However, once you've teased apart the idea of "variable" in data-varialbe and env-variable, I think you'll find it fairly easy to use.

## Solutions

### Tidy evaluation in Shiny apps

The key to resolving this ambiguity is to make use of two __pronouns__ that are built into tidy evaluation: `.data` and `.env`. As you might guess from the name, these pronouns allow you to remove the ambiguity introduced by tidy evaluation. For example, we can rewrite the filter used above:

```{r}
diamonds %>% filter(.data$carat > .env$min_carat)
```

This doesn't immediately help us in Shiny apps, because the results from inputs are usually  strings, and using `.data$var` isn't going to work becaues it's going to look for a data-var called `var`, not a data-var stored in the env-var `var`. Fortunately base R already has a solution for this: `.data[[var]]`.

Let's apply this to a simple example:

```{r}
ui <- fluidPage(
  selectInput("var", "Variable", choices = names(diamonds)),
  tableOutput("output")
)
server <- function(input, output, session) {
  data <- reactive(filter(diamonds, input$var > 0))
  output$output <- renderTable(head(data()))
}
```

This doesn't work because `input$var` isn't a data-var: it's an env-var containing the name of a data-var (stored as string). Unfortunately it also fails to give a useful error message because `input$var` will be a string like "carat" and:

```{r}
"carat" > 0
```

We can fix the problem by using `.data` and `[[`:

```{r}
server <- function(input, output, session) {
  data <- reactive(filter(diamonds, .data[[input$var]] > 0))
  output$output <- renderTable(head(data()))
}
```

### Tidy evaluation in functions

You should note that this a slightly different problem to use of tidy evaluation functions. Where we need a slightly different solution. You can use `.data` + `[[`, but it doesn't create a very user friendly function:

```{r}
filter_var <- function(df, var, val) {
  filter(df, .data[[var]] > val)
}
filter_var(diamonds, "carat", 4)
```

This function is a bit weird because it takes the name of the variable as a string, so it doesn't work like most other tidyverse functions. Here we need to use a slightly different technique:

```{r}
filter_var <- function(df, var, val) {
  filter(df, {{ var }} > val)
}
filter_var(diamonds, carat, 4)
```

The use of `{{` tells 

### `parse()`

Finally, it's worth a note about using `paste()` + `parse()` + `eval()`.  It's tempting approach because it means that you don't have to learn much new. But it has some major downsides. This is a bad idea because it means that the user of your app can run arbitrary R code. This isn't super important if its a Shiny app that only use you, but it's a good habit to get into --- otherwise it's very easy to accidentally create a security hole in an app that you share more widely.

## Case studies

### Plotting

```{r}
ui <- fluidPage(
  selectInput("x", "X variable", choices = names(iris)),
  selectInput("y", "Y variable", choices = names(iris)),
  plotOutput("plot")
)
server <- function(input, output, session) {
  output$plot <- renderPlot({
    ggplot(iris, aes(.data[[input$x]], .data[[input$y]])) +
      geom_point(position = ggforce::position_auto()) +
      labs(x = input$x, y = input$y)
  })
}
```

I use the special `ggforce::position_auto()` to automatically spread the points out when one axis is discrete. Once you've mastered the basics of tidy evaluation you'll quickly find that the challenge becomes making your app general enough to work with many different types of variable.

Instead of using `position_auto()` we could allow the user to pick the geom:

```{r}
ui <- fluidPage(
  selectInput("x", "X variable", choices = names(iris)),
  selectInput("y", "Y variable", choices = names(iris)),
  selectInput("geom", "geom", c("point", "smooth", "jitter")),
  plotOutput("plot")
)
server <- function(input, output, session) {
  plot_geom <- reactive({
    switch(input$geom,
      point = geom_point(),
      smooth = geom_smooth(se = FALSE),
      jitter = geom_jitter()
    )
  })
  
  output$plot <- renderPlot({
    ggplot(iris, aes(.data[[input$x]], .data[[input$y]])) +
      plot_geom() + 
      labs(x = input$x, y = input$y)
  })
}
```

### Filtering and arranging

Same idea works for dplyr:

```{r}
library(dplyr, warn.conflicts = FALSE)

ui <- fluidPage(
  selectInput("var", "Select variable", choices = names(mtcars)),
  sliderInput("min", "Minimum value", 0, min = 0, max = 100),
  selectInput("sort", "Sort by", choices = names(mtcars)),
  tableOutput("data")
)
server <- function(input, output, session) {
  observeEvent(input$var, {
    rng <- range(mtcars[[input$var]])
    updateSliderInput(session, "min", value = rng[[1]], min = rng[[1]], max = rng[[2]])
  })
  
  output$data <- renderTable({
    mtcars %>% 
      filter(.data[[input$var]] > input$min) %>% 
      arrange(.data[[input$sort]])
  })
}
```

Most other problems can be solved by combining this techique with your existing programming skills. For example, what if you wanted to conditionally sort in either ascending or descending order?

```{r}
ui <- fluidPage(
  selectInput("var", "Sort by", choices = names(mtcars)),
  checkboxInput("desc", "Descending order?"),
  tableOutput("data")
)
server <- function(input, output, session) {
  sorted <- reactive({
    if (input$desc) {
      arrange(mtcars, desc(.data[[input$var]]))
    } else {
      arrange(mtcars, .data[[input$var]])
    }
  })
  output$data <- renderTable(sorted())
}
```

As you provide more control, you'll find the code gets more and more complicated, and it becomes harder and harder to create a user interface that is both comprehensive _and_ user friendly. This is why I've always focussed on code tools for data analysis: creating good UIs is really really hard!

## Additional challenges

The final section of this chapter covers a grab bag of additional topics that are important for various applications.

### Selection semantics

Most tidyverse functions (e.g. `dplyr::mutate()`, `dplyr::filter()`, `dplyr::group_by()`, `ggplot2::aes()`) have what we call __action__ semantics, which means that you can perform any action inside of them. Other function have __selection__ semantics; instead of general computation you can select variables using a special domain specific language that includes helpers like `starts_with()`, and `ends_with()`. The most important function that has selection semantics is `dplyr::select()`, but the set also includes many tidyr like `pivot_longer()` and `pivot_wider()`, `separate()`, `extract()`, and `unite()` functions. Selection semantics are powered by the tidyselect package.

Working with functions that use selection semantics is slightly different to those that use action semantics because there is no `.data` pronoun. Instead you use the helper `one_of()` or `all_of()`[^one-vs-all]:

[^one-vs-all]: `one_of()` is available in all versions of the tidyselect package, but the name is not very informative, so we recommend using `all_of()` if it's available to you.

```{r}
ui <- fluidPage(
  selectInput("vars", "Variables", names(mtcars), multiple = TRUE),
  tableOutput("data")
)

server <- function(input, output, session) {
  output$data <- renderTable({
    req(input$vars)
    mtcars %>% select(one_of(input$vars))
  })
}
```

(If you wanted all of the variables _except_ those selected you could use `-one_of(input$vars)))`.

### Multiple variables

As shown in the previous example, working with multiple variables is trivial when you're working with a function that uses selection semantics: you can just pass a character vector of variable names in to `one_of()`/`all_of()`. The challenge is operating on multiple variables when the function has action semantics, as is common with dplyr functions. There are two ways to work with multiple variables, depending on which version of dplyr you are working with. I'll illustrate them with an app that allows you to select any number of variables to count their unique values.

```{r}
ui <- fluidPage(
  selectInput("vars", "Variables", names(mtcars), multiple = TRUE),
  tableOutput("count")
)
```

In dplyr 0.8 and earlier, every function that uses action semantics also has a variant that has selection semantics, with the suffix `_at`. The easiest approach is to just to switch from action to selection semantics by changing the function that you're programing with.

```{r}
server <- function(input, output, session) {
  output$count <- renderTable({
    req(input$vars)
    
    mtcars %>% 
      group_by_at(input$vars) %>% 
      summarise(n = n())
  })
}
```

dplyr 1.0.0 provides a more flexible approach: inside of any function with action semantics, you can use `across()` to access selection semantics:

```{r}
server <- function(input, output, session) {
  output$count <- renderTable({
    req(input$vars)
    
    mtcars %>% 
      group_by(across(all_of(input$vars))) %>% 
      summarise(n = n())
  })
}
```

Things are mildly more complicated for `mutate()` and `summarise()` because you also need to supply a function to perform the operation.

```{r}
ui <- fluidPage(
  selectInput("vars_g", "Group by", names(mtcars), multiple = TRUE),
  selectInput("vars_s", "Summarise", names(mtcars), multiple = TRUE),
  tableOutput("data")
)

# dplyr 0.8.0
server <- function(input, output, session) {
  output$data <- renderTable({
    mtcars %>% 
      group_by(across(all_of(input$vars_g))) %>% 
      summarise(across(all_of(input$vars_s), mean), n = n())
  })
}

# dplyr 1.0.0
server <- function(input, output, session) {
  output$data <- renderTable({
    mtcars %>% 
      group_by_at(input$vars_g) %>% 
      summarise_at(input$vars_s, mean)
  })
}
```

### Action semantics and user supplied data

There is one additional complication when you're working with user supplied data and action semantics. Take the following app: it allows the user to upload a tsv file, then select a variable, and filter by it. It will work for the vast majority of inputs you might try it with:

```{r}
ui <- fluidPage(
  fileInput("data", "dataset", accept = ".tsv"),
  selectInput("var", "var", character()),
  numericInput("min", "min", 1, min = 0, step = 1),
  tableOutput("output")
)
server <- function(input, output, session) {
  data <- reactive({
    req(input$data)
    vroom::vroom(input$data$datapath)
  })
  observeEvent(data(), {
    updateSelectInput(session, "var", choices = names(data()))
  })
  observeEvent(input$var, {
    val <- data()[[input$var]]
    updateNumericInput(session, "min", value = min(val))
  })
  
  output$output <- renderTable({
    req(input$var)
    
    data() %>% 
      filter(.data[[input$var]] > input$min) %>% 
      arrange(.data[[input$var]]) %>% 
      head(10)
  })
}
```

There is a subtle problem with the use of `filter()`. Let's focus in on that code so we can play around and see the problem more easily outside of the app.

```{r}
df <- data.frame(x = 1, y = 2)
input <- list(var = "x", min = 0)

df %>% filter(.data[[input$var]] > input$min)
```

If you experiment with this code, you'll find that it appears to work just fine for vast majority of data frames. However, there's one big problem: what happens the data frame contains a variable called `input`?

```{r, error = TRUE}
df <- data.frame(x = 1, y = 2, input = 3)
df %>% filter(.data[[input$var]] > input$min)
```

We get an error message because `filter()` is attempting to evaluate `df$input$min`:

```{r, error = TRUE}
df$input$min
```

This problem is again due to the ambiguity of data-variables and env-variables. Tidy evaluation always prefers to use a data-variable if both are available. We can resolve the amibugity by telling `filter()` not to look in the data frame for `input`, and instead only use an env-variable[^bang-bang]:

```{r}
df <- data.frame(x = 1, y = 2, input = 3)
df %>% filter(.data[[input$var]] > .env$input$min)
```

[^bang-bang]: Instead of use `.env`, you can also use use `!!` if you know about it, e.g. `df %>% filter(.data[[input$var]] > !!input$min)`. This is evaluated at a slightly different time, and I think is mildly less appealing because it lacks the symmetry of `.data` vs `.env`, and you need to know about `!!`. But it's a fine solution if you're happy with `!!`.

At this point you might wonder if you're better off without `filter()`, and just write the equivalent base R code:

```{r}
df[df[[input$var]] > input$min, ]
```

That's fine too, as long as you're aware of all the edge cases where `filter()` behaves differently. In this case:

* You'll need `drop = FALSE` if `df` contains a single column (otherwise you'll 
  get a vector instead of a data frame)
  
* You'll need to use `which()` or similar to drop any missing values.

In general, if you're using dplyr for very simple cases, you might find it easier to use implementations that don't rely on tidy evaluation. However, in my opinion, one of the major advantages of the tidyverse is not just in routine application, but in the careful thought that has been applied to edge cases so that functions work more consistently. I don't want to oversell this, but at the same time, it's easy to forget the quirks of specific base R functions, and write code that works 95% of the time, but fails in unusual ways the other 5% of the time.