Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite case_when() using vec_case_when() #6286

Closed

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jun 1, 2022

Closes #6261
Closes #6206
Closes #5106
Closes #6145
Closes #6225

The intention is to move vec_case_when() to vctrs and rewrite in C, but getting the semantics right in R is more important for now.

vec_case_when() will also be used to back if_else() and coalesce() for sure. I think it could also back na_if(). It could additionally back replace_when() if we want to implement that.


case_when() has gained a new interface. Formulas are no longer used, instead you pass pairs of condition/value inputs. There is also an explicit .default argument now, and new .ptype and .size arguments.

The formula interface still works for now, I'm optimistic that it is 100% backwards compatible. We don't do anything to actively dissuade people from using the formula interface right now. However, you can't use any of the new arguments with the old interface.


Open question:

Should we have a .missing argument in case_when() and vec_case_when()? This would:

  • Match if_else()
  • Be easier than supplying is.na(x), "value" for most cases
  • Provide a way to handle the case where the missing value pops up in the computation rather than in x itself (see below)

We decided that .default should ONLY handle the case where all conditions are FALSE. Meaning it doesn't handle the case where no conditions are TRUE and at least one of those conditions is NA - the NA now gets propagated through. This is generally what people want, see this example:

library(dplyr, warn.conflicts = FALSE)

x <- c(1, 2, NA, 4, 5)

# Confusing because `NA` isn't propagated.
# The `NA` gets assigned to `"high"` which feels wrong.
case_when(
  x <= 2 ~ "low",
  x <= 4 ~ "med",
  TRUE ~ "high"
)
#> [1] "low"  "low"  "high" "med"  "high"

# New interface propagates `NA` - good!
case_when(
  x <= 2, "low",
  x <= 4, "med",
  .default = "high"
)
#> [1] "low"  "low"  NA     "med"  "high"

# Handle `NA` easily because we knew where it came from
case_when(
  x <= 2, "low",
  x <= 4, "med",
  is.na(x), "unknown",
  .default = "high"
)
#> [1] "low"     "low"     "unknown" "med"     "high"

The above was the motivating case for this change. But it is a little more frustrating if the NAs occur in the computation of the condition rather than in x itself. I still think the .default behavior is right, the problem is that it is hard to explicitly handle the NA cases now. A .missing argument would allow us to easily explicitly handle the computed NAs.

library(dplyr, warn.conflicts = FALSE)

x <- c(-1, 0, 1)

# Generated `NA` is treated like `FALSE`.
# `TRUE ~` applies to everything that is left, but does that really make sense?
case_when(
  sqrt(x) > 0 ~ "big",
  TRUE ~ "little"
)
#> Warning in sqrt(x): NaNs produced
#> [1] "little" "little" "big"

# Propagates `NA`, which seems reasonable
case_when(
  sqrt(x) > 0, "big",
  .default = "little"
)
#> Warning in sqrt(x): NaNs produced
#> [1] NA       "little" "big"

# But there is no easy way to "handle" the `NA` because it happens
# in the computation, not in `x`
case_when(
  sqrt(x) > 0, "big",
  is.na(x), "missing",
  .default = "little"
)
#> Warning in sqrt(x): NaNs produced
#> [1] NA       "little" "big"

# Proposal:
# case_when(
#   sqrt(x) > 0, "big",
#   .default = "little",
#   .missing = "missing"
# )

@markfairbanks
Copy link
Contributor

I think the proposal of a .missing arg is a good way to go since .default won't handle NAs. In the case of using one vector the use case isn't quite as obvious, but in the case of 2+ vectors you would have to keep note of all vectors used in a condition and create an is.na() condition for each one.

devtools::load_all(".")
#> ℹ Loading dplyr

x <- c(1, 2, NA, 4, 5)
y <- c(1, 2, 3, NA, 5)
z <- c(1, 2, 3, 4, NA)

case_when(
  x <= 2 & y <= 2, "low",
  x <= 4 & y <= 4, "med",
  x == 5 & z == 5, "different",
  is.na(x) | is.na(y) | is.na(z), "unknown",
  .default = "high"
)
#> [1] "low"     "low"     "unknown" "unknown" "unknown"

@DavisVaughan
Copy link
Member Author

@markfairbanks that example is actually a little tricky because of how TRUE & NA = NA but FALSE & NA = FALSE, which can result in this:

  x <- c(1, 6)
  y <- c(1, NA)

  case_when(
    x <= 2 & y <= 2, "low",
    x <= 4 & y <= 4, "med",
    .default = "high",
    .missing = "unknown"
  )
#> [1] "low"  "high"

  x <= 2 & y <= 2
#> [1]  TRUE FALSE
  x <= 4 & y <= 4
#> [1]  TRUE FALSE

It seems like you'd still have to check for missing values with is.na(x) | is.na(y) | is.na(z) in this case because the computation of the condition actually removed the NA that was present in the original vectors. I don't think there is any way around this.

I still think .missing is useful for the simple cases like x <= 4 on its own where NA is propagated. And it is definitely still useful for the case in my original example where the computation of the condition generated the missing value.

I think this is more of a case where you have to understand what base R gives you rather than a question of how case_when() should work

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Jun 7, 2022

This also comes up when %in% is used, which I think is a fairly common use case of case_when().

x <- c("a", "b", NA, "e")

# NA turned into FALSE in the condition
vec_case_when(
  x %in% c("a", "e", "i", "o", "u"), "vowel",
  .default = "consonant",
  .missing = "missing"
)
#> [1] "vowel"     "consonant" "consonant" "vowel"

Maybe the original design where .default applies to both FALSE and NA was correct? So you are forced to deal with the missing values yourself, however they might come up, rather than relying on .missing which might not always work as you expect (because sometimes your NAs can become FALSE when the condition is computed). The main reason we didn't like the original design is because when missing values do propagate through, you get weird results like this, which the new version does handle better.

x <- c(1, 2, NA)

# NA propagated, then `TRUE ~` overrides the propagation
dplyr::case_when(
  x %% 2 == 0  ~ "odd",
  TRUE ~ "even"
)
#> [1] "even" "odd"  "even"

# NA propagated and `.default` doesn't handle it
vec_case_when(
  x %% 2 == 0, "odd",
  .default = "even"
)
#> [1] "even" "odd"  NA

@DavisVaughan
Copy link
Member Author

Superseded by #6300

@DavisVaughan DavisVaughan deleted the feature/vec-case-when branch July 1, 2022 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants