Rewrite `case_when()` using `vec_case_when()` #6286

DavisVaughan · 2022-06-01T13:48:56Z

Closes #6261
Closes #6206
Closes #5106
Closes #6145
Closes #6225

The intention is to move vec_case_when() to vctrs and rewrite in C, but getting the semantics right in R is more important for now.

vec_case_when() will also be used to back if_else() and coalesce() for sure. I think it could also back na_if(). It could additionally back replace_when() if we want to implement that.

case_when() has gained a new interface. Formulas are no longer used, instead you pass pairs of condition/value inputs. There is also an explicit .default argument now, and new .ptype and .size arguments.

The formula interface still works for now, I'm optimistic that it is 100% backwards compatible. We don't do anything to actively dissuade people from using the formula interface right now. However, you can't use any of the new arguments with the old interface.

Open question:

Should we have a .missing argument in case_when() and vec_case_when()? This would:

Match if_else()
Be easier than supplying is.na(x), "value" for most cases
Provide a way to handle the case where the missing value pops up in the computation rather than in x itself (see below)

We decided that .default should ONLY handle the case where all conditions are FALSE. Meaning it doesn't handle the case where no conditions are TRUE and at least one of those conditions is NA - the NA now gets propagated through. This is generally what people want, see this example:

library(dplyr, warn.conflicts = FALSE)

x <- c(1, 2, NA, 4, 5)

# Confusing because `NA` isn't propagated.
# The `NA` gets assigned to `"high"` which feels wrong.
case_when(
  x <= 2 ~ "low",
  x <= 4 ~ "med",
  TRUE ~ "high"
)
#> [1] "low"  "low"  "high" "med"  "high"

# New interface propagates `NA` - good!
case_when(
  x <= 2, "low",
  x <= 4, "med",
  .default = "high"
)
#> [1] "low"  "low"  NA     "med"  "high"

# Handle `NA` easily because we knew where it came from
case_when(
  x <= 2, "low",
  x <= 4, "med",
  is.na(x), "unknown",
  .default = "high"
)
#> [1] "low"     "low"     "unknown" "med"     "high"

The above was the motivating case for this change. But it is a little more frustrating if the NAs occur in the computation of the condition rather than in x itself. I still think the .default behavior is right, the problem is that it is hard to explicitly handle the NA cases now. A .missing argument would allow us to easily explicitly handle the computed NAs.

library(dplyr, warn.conflicts = FALSE)

x <- c(-1, 0, 1)

# Generated `NA` is treated like `FALSE`.
# `TRUE ~` applies to everything that is left, but does that really make sense?
case_when(
  sqrt(x) > 0 ~ "big",
  TRUE ~ "little"
)
#> Warning in sqrt(x): NaNs produced
#> [1] "little" "little" "big"

# Propagates `NA`, which seems reasonable
case_when(
  sqrt(x) > 0, "big",
  .default = "little"
)
#> Warning in sqrt(x): NaNs produced
#> [1] NA       "little" "big"

# But there is no easy way to "handle" the `NA` because it happens
# in the computation, not in `x`
case_when(
  sqrt(x) > 0, "big",
  is.na(x), "missing",
  .default = "little"
)
#> Warning in sqrt(x): NaNs produced
#> [1] NA       "little" "big"

# Proposal:
# case_when(
#   sqrt(x) > 0, "big",
#   .default = "little",
#   .missing = "missing"
# )

With backwards compatible support for the old formula interface

markfairbanks · 2022-06-03T14:29:24Z

I think the proposal of a .missing arg is a good way to go since .default won't handle NAs. In the case of using one vector the use case isn't quite as obvious, but in the case of 2+ vectors you would have to keep note of all vectors used in a condition and create an is.na() condition for each one.

devtools::load_all(".")
#> ℹ Loading dplyr

x <- c(1, 2, NA, 4, 5)
y <- c(1, 2, 3, NA, 5)
z <- c(1, 2, 3, 4, NA)

case_when(
  x <= 2 & y <= 2, "low",
  x <= 4 & y <= 4, "med",
  x == 5 & z == 5, "different",
  is.na(x) | is.na(y) | is.na(z), "unknown",
  .default = "high"
)
#> [1] "low"     "low"     "unknown" "unknown" "unknown"

DavisVaughan · 2022-06-07T14:48:21Z

@markfairbanks that example is actually a little tricky because of how TRUE & NA = NA but FALSE & NA = FALSE, which can result in this:

  x <- c(1, 6)
  y <- c(1, NA)

  case_when(
    x <= 2 & y <= 2, "low",
    x <= 4 & y <= 4, "med",
    .default = "high",
    .missing = "unknown"
  )
#> [1] "low"  "high"

  x <= 2 & y <= 2
#> [1]  TRUE FALSE
  x <= 4 & y <= 4
#> [1]  TRUE FALSE

It seems like you'd still have to check for missing values with is.na(x) | is.na(y) | is.na(z) in this case because the computation of the condition actually removed the NA that was present in the original vectors. I don't think there is any way around this.

I still think .missing is useful for the simple cases like x <= 4 on its own where NA is propagated. And it is definitely still useful for the case in my original example where the computation of the condition generated the missing value.

I think this is more of a case where you have to understand what base R gives you rather than a question of how case_when() should work

DavisVaughan · 2022-06-07T15:13:26Z

This also comes up when %in% is used, which I think is a fairly common use case of case_when().

x <- c("a", "b", NA, "e")

# NA turned into FALSE in the condition
vec_case_when(
  x %in% c("a", "e", "i", "o", "u"), "vowel",
  .default = "consonant",
  .missing = "missing"
)
#> [1] "vowel"     "consonant" "consonant" "vowel"

Maybe the original design where .default applies to both FALSE and NA was correct? So you are forced to deal with the missing values yourself, however they might come up, rather than relying on .missing which might not always work as you expect (because sometimes your NAs can become FALSE when the condition is computed). The main reason we didn't like the original design is because when missing values do propagate through, you get weird results like this, which the new version does handle better.

x <- c(1, 2, NA)

# NA propagated, then `TRUE ~` overrides the propagation
dplyr::case_when(
  x %% 2 == 0  ~ "odd",
  TRUE ~ "even"
)
#> [1] "even" "odd"  "even"

# NA propagated and `.default` doesn't handle it
vec_case_when(
  x %% 2 == 0, "odd",
  .default = "even"
)
#> [1] "even" "odd"  NA

DavisVaughan · 2022-06-17T15:59:50Z

Superseded by #6300

DavisVaughan added 6 commits June 1, 2022 08:59

Implement vec_case_when()

b53d0cc

Update case_when() to use the new interface

2f21e8b

With backwards compatible support for the old formula interface

Update internal usage of case_when()

e44b9ab

Add a test specifically for tidyverse#6261 and tidyverse#6206

0e44aed

Add an example of creating multiple columns at once

1503c21

NEWS bullet

4413e28

Add a .missing argument to case_when()

59dfb0a

DavisVaughan mentioned this pull request Jun 17, 2022

Rewrite case_when() using vec_case_when() #6300

Merged

DavisVaughan closed this Jun 17, 2022

DavisVaughan deleted the feature/vec-case-when branch July 1, 2022 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite `case_when()` using `vec_case_when()` #6286

Rewrite `case_when()` using `vec_case_when()` #6286

DavisVaughan commented Jun 1, 2022 •

edited

Loading

markfairbanks commented Jun 3, 2022

DavisVaughan commented Jun 7, 2022

DavisVaughan commented Jun 7, 2022 •

edited

Loading

DavisVaughan commented Jun 17, 2022

Rewrite case_when() using vec_case_when() #6286

Rewrite case_when() using vec_case_when() #6286

Conversation

DavisVaughan commented Jun 1, 2022 • edited Loading

markfairbanks commented Jun 3, 2022

DavisVaughan commented Jun 7, 2022

DavisVaughan commented Jun 7, 2022 • edited Loading

DavisVaughan commented Jun 17, 2022

Rewrite `case_when()` using `vec_case_when()` #6286

Rewrite `case_when()` using `vec_case_when()` #6286

DavisVaughan commented Jun 1, 2022 •

edited

Loading

DavisVaughan commented Jun 7, 2022 •

edited

Loading