`rep_slice_sample()` doesn't work with integer sampling weights #480

andrewpbray · 2023-03-07T21:05:09Z

I'm trying my hand at teaching an intro course using purely rep_slice_sample() and an as yet to be written rep_shuffle_col() or something of the sort. This morning I ran into an unexpected behavior of rep_slice_sample(), particular since it differs from slice_sample(). I will have time later today to look into a fix but just thought I'd put up an issue to track it.

Shoutout to @mine-cetinkaya-rundel . That function that you wrote is still going strong like 8 years later!

library(tidyverse)
library(infer)

df <- tibble(id = 1:5, 
             letter = factor(c("a", "b", "c", "d", "e")))

# works when weight vec sums to 1
rep_slice_sample(df, n = 2, reps = 5, 
                 weight_by = c(.5, .4, .3, .2, .1))
#> # A tibble: 10 × 3
#> # Groups:   replicate [5]
#>    replicate    id letter
#>        <int> <int> <fct> 
#>  1         1     1 a     
#>  2         1     3 c     
#>  3         2     3 c     
#>  4         2     5 e     
#>  5         3     5 e     
#>  6         3     2 b     
#>  7         4     1 a     
#>  8         4     3 c     
#>  9         5     1 a     
#> 10         5     2 b

# works when weight vec doesn't sum to 1
rep_slice_sample(df, n = 2, reps = 5, 
                 weight_by = c(.5, .5, .5, .5, .5))
#> # A tibble: 10 × 3
#> # Groups:   replicate [5]
#>    replicate    id letter
#>        <int> <int> <fct> 
#>  1         1     5 e     
#>  2         1     4 d     
#>  3         2     2 b     
#>  4         2     3 c     
#>  5         3     2 b     
#>  6         3     4 d     
#>  7         4     1 a     
#>  8         4     4 d     
#>  9         5     3 c     
#> 10         5     5 e

# but doesn't work when weight vec is integer
rep_slice_sample(df, n = 2, reps = 5, 
                 weight_by = id)
#> Error: `weight_by` must be 'numeric vector with length `nrow(.data)` = 5', not 'closure'.

# however integer weight vec works for slice_sample()
slice_sample(df, n = 2, 
             weight_by = id)
#> # A tibble: 2 × 2
#>      id letter
#>   <int> <fct> 
#> 1     5 e     
#> 2     3 c

^{Created on 2023-03-07 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

simonpcouch · 2023-03-08T14:56:21Z

Hey Andrew!

Note that integers are numerics:

is.numeric(1L)
#> [1] TRUE

so this isn’t actually an issue with integers vs floats.

It looks like the error you saw here is:

Error: weight_by must be ‘numeric vector with length nrow(.data) = 5’, not ‘closure’.

This is because rep_slice_sample doesn’t support tidyselect, so needs the actual weights vector df$id rather than the tidyselect spec id. Passing the actual weights works fine:

library(tidyverse)
library(infer)

df <- tibble(id = 1:5, 
             letter = factor(c("a", "b", "c", "d", "e")))

rep_slice_sample(df, n = 2, reps = 5, 
                 weight_by = df$id)
#> # A tibble: 10 × 3
#> # Groups:   replicate [5]
#>    replicate    id letter
#>        <int> <int> <fct> 
#>  1         1     4 d     
#>  2         1     5 e     
#>  3         2     5 e     
#>  4         2     4 d     
#>  5         3     3 c     
#>  6         3     5 e     
#>  7         4     5 e     
#>  8         4     1 a     
#>  9         5     4 d     
#> 10         5     5 e

If we feel it's in scope for the package to support tidy selection here as well, I'm up for it!

^{Created on 2023-03-08 with reprex v2.0.2}

andrewpbray · 2023-03-08T18:36:33Z

Oh, I'm so glad you reviewed this! That's news to me that integers are numerics.

Since this function was adapted (from rep_sample_n()) to mirror changes to dplyr, to me makes sense to keep that parallel going in terms of tidyselect. Do you think it could just be implemented here or if we should really back propagate to any function where tidyselect could play a role? Seeing as how this is just one of the ancillary functions, maybe we could restrict the implementation of tidyselect to just this function?

simonpcouch · 2023-03-08T19:19:44Z

It'd feel fine to me to just implement that change only to weight_by in this function!

(Ope, I wrote tidyselect when I meant data-masking. Woopsies.😆) We already use something data-masking-esque in specify() and now generate() (via the variables argument). generate() uses a check_cols() helper that makes this happen, so we can piggyback off of that machinery.

andrewpbray · 2023-03-08T19:23:44Z

Sounds good to me!

Do you have time to do this? If not, I can take a swipe but it'd likely be awhile before I'd have the brain-space to reacquaint myself with the codebase.

simonpcouch · 2023-03-08T19:49:38Z

Yup, sure thing!🥑

github-actions · 2023-03-26T00:40:12Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

andrewpbray mentioned this issue Mar 7, 2023

Update rep_sample_n.R #481

Closed

simonpcouch mentioned this issue Mar 10, 2023

allow unquoted column name for weight_by in rep_slice_sample() #483

Merged

andrewpbray closed this as completed in #483 Mar 11, 2023

github-actions bot locked and limited conversation to collaborators Mar 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`rep_slice_sample()` doesn't work with integer sampling weights #480

`rep_slice_sample()` doesn't work with integer sampling weights #480

andrewpbray commented Mar 7, 2023

simonpcouch commented Mar 8, 2023

andrewpbray commented Mar 8, 2023

simonpcouch commented Mar 8, 2023

andrewpbray commented Mar 8, 2023

simonpcouch commented Mar 8, 2023

github-actions bot commented Mar 26, 2023

rep_slice_sample() doesn't work with integer sampling weights #480

rep_slice_sample() doesn't work with integer sampling weights #480

Comments

andrewpbray commented Mar 7, 2023

simonpcouch commented Mar 8, 2023

andrewpbray commented Mar 8, 2023

simonpcouch commented Mar 8, 2023

andrewpbray commented Mar 8, 2023

simonpcouch commented Mar 8, 2023

github-actions bot commented Mar 26, 2023

`rep_slice_sample()` doesn't work with integer sampling weights #480

`rep_slice_sample()` doesn't work with integer sampling weights #480