Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rep_slice_sample() doesn't work with integer sampling weights #480

Closed
andrewpbray opened this issue Mar 7, 2023 · 6 comments · Fixed by #483
Closed

rep_slice_sample() doesn't work with integer sampling weights #480

andrewpbray opened this issue Mar 7, 2023 · 6 comments · Fixed by #483

Comments

@andrewpbray
Copy link
Collaborator

I'm trying my hand at teaching an intro course using purely rep_slice_sample() and an as yet to be written rep_shuffle_col() or something of the sort. This morning I ran into an unexpected behavior of rep_slice_sample(), particular since it differs from slice_sample(). I will have time later today to look into a fix but just thought I'd put up an issue to track it.

Shoutout to @mine-cetinkaya-rundel . That function that you wrote is still going strong like 8 years later!

library(tidyverse)
library(infer)

df <- tibble(id = 1:5, 
             letter = factor(c("a", "b", "c", "d", "e")))

# works when weight vec sums to 1
rep_slice_sample(df, n = 2, reps = 5, 
                 weight_by = c(.5, .4, .3, .2, .1))
#> # A tibble: 10 × 3
#> # Groups:   replicate [5]
#>    replicate    id letter
#>        <int> <int> <fct> 
#>  1         1     1 a     
#>  2         1     3 c     
#>  3         2     3 c     
#>  4         2     5 e     
#>  5         3     5 e     
#>  6         3     2 b     
#>  7         4     1 a     
#>  8         4     3 c     
#>  9         5     1 a     
#> 10         5     2 b

# works when weight vec doesn't sum to 1
rep_slice_sample(df, n = 2, reps = 5, 
                 weight_by = c(.5, .5, .5, .5, .5))
#> # A tibble: 10 × 3
#> # Groups:   replicate [5]
#>    replicate    id letter
#>        <int> <int> <fct> 
#>  1         1     5 e     
#>  2         1     4 d     
#>  3         2     2 b     
#>  4         2     3 c     
#>  5         3     2 b     
#>  6         3     4 d     
#>  7         4     1 a     
#>  8         4     4 d     
#>  9         5     3 c     
#> 10         5     5 e

# but doesn't work when weight vec is integer
rep_slice_sample(df, n = 2, reps = 5, 
                 weight_by = id)
#> Error: `weight_by` must be 'numeric vector with length `nrow(.data)` = 5', not 'closure'.

# however integer weight vec works for slice_sample()
slice_sample(df, n = 2, 
             weight_by = id)
#> # A tibble: 2 × 2
#>      id letter
#>   <int> <fct> 
#> 1     5 e     
#> 2     3 c

Created on 2023-03-07 by the reprex package (v2.0.1)

@simonpcouch
Copy link
Collaborator

Hey Andrew!

Note that integers are numerics:

is.numeric(1L)
#> [1] TRUE

so this isn’t actually an issue with integers vs floats.

It looks like the error you saw here is:

Error: weight_by must be ‘numeric vector with length nrow(.data) = 5’, not ‘closure’.

This is because rep_slice_sample doesn’t support tidyselect, so needs the actual weights vector df$id rather than the tidyselect spec id. Passing the actual weights works fine:

library(tidyverse)
library(infer)

df <- tibble(id = 1:5, 
             letter = factor(c("a", "b", "c", "d", "e")))

rep_slice_sample(df, n = 2, reps = 5, 
                 weight_by = df$id)
#> # A tibble: 10 × 3
#> # Groups:   replicate [5]
#>    replicate    id letter
#>        <int> <int> <fct> 
#>  1         1     4 d     
#>  2         1     5 e     
#>  3         2     5 e     
#>  4         2     4 d     
#>  5         3     3 c     
#>  6         3     5 e     
#>  7         4     5 e     
#>  8         4     1 a     
#>  9         5     4 d     
#> 10         5     5 e

If we feel it's in scope for the package to support tidy selection here as well, I'm up for it!

Created on 2023-03-08 with reprex v2.0.2

@andrewpbray
Copy link
Collaborator Author

Oh, I'm so glad you reviewed this! That's news to me that integers are numerics.

Since this function was adapted (from rep_sample_n()) to mirror changes to dplyr, to me makes sense to keep that parallel going in terms of tidyselect. Do you think it could just be implemented here or if we should really back propagate to any function where tidyselect could play a role? Seeing as how this is just one of the ancillary functions, maybe we could restrict the implementation of tidyselect to just this function?

@simonpcouch
Copy link
Collaborator

It'd feel fine to me to just implement that change only to weight_by in this function!

(Ope, I wrote tidyselect when I meant data-masking. Woopsies.😆) We already use something data-masking-esque in specify() and now generate() (via the variables argument). generate() uses a check_cols() helper that makes this happen, so we can piggyback off of that machinery.

@andrewpbray
Copy link
Collaborator Author

Sounds good to me!

Do you have time to do this? If not, I can take a swipe but it'd likely be awhile before I'd have the brain-space to reacquaint myself with the codebase.

@simonpcouch
Copy link
Collaborator

Yup, sure thing!🥑

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants