Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A different use of the numeric_as_char argument in messy_linelist() function #192

Closed
Karim-Mane opened this issue Feb 19, 2025 · 5 comments · Fixed by #199
Closed

A different use of the numeric_as_char argument in messy_linelist() function #192

Karim-Mane opened this issue Feb 19, 2025 · 5 comments · Fixed by #199

Comments

@Karim-Mane
Copy link
Member

Is your feature request related to a problem? Please describe.
This is a follow up on the discussion about the effect of the numeric_as_char argument from PR #187.

Describe the solution you'd like
As I mentioned in that discussion, the followings are my suggestions:

  • convert few numbers (but not all) into character in the age columns as that would convert the column into character. But that way we can feature the messy character of the column (with a mixture of numbers and letters).
  • add a prefix and/or suffix to some values in the id column. That would be useful to test the corresponding standardize_subject_ids() function in {cleanepi}.

Additional context
I also suggest renaming this argument into something like int_as_char as this only has effect on columns of type integer.

@joshwlambert
Copy link
Member

There doesn't seem to be a standardize_subject_ids() function in the {cleanepi} NAMESPACE.

@Karim-Mane
Copy link
Member Author

I meant check_subject_ids() function instead.

@joshwlambert
Copy link
Member

I also suggest renaming this argument into something like int_as_char as this only has effect on columns of type integer.

I don't believe this is the case. The $ct_value column output by sim_linelist() is numeric, and depending on if numeric_as_char is TRUE or FALSE when messy_linelist() is called the type of the $ct_value column changes.

library(simulist)
set.seed(1234)
ll <- sim_linelist()
head(ll)
#>   id        case_name case_type sex age date_onset date_reporting
#> 1  1   Carlton Aragon suspected   m  42 2023-01-01     2023-01-01
#> 2  2 Baaqir al-Demian  probable   m   6 2023-01-06     2023-01-06
#> 3  4      Joshua Sher  probable   m  43 2023-01-07     2023-01-07
#> 4  5    Austin Porter suspected   m  19 2023-01-11     2023-01-11
#> 5  6    Sara Tennyson suspected   f  22 2023-01-09     2023-01-09
#> 6  7    Rachel Nguyen  probable   f  46 2023-01-13     2023-01-13
#>   date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1     2023-01-05      died   2023-01-14               <NA>              <NA>
#> 2           <NA>      died   2023-01-09         2022-12-27        2023-01-02
#> 3           <NA> recovered         <NA>         2023-01-07        2023-01-09
#> 4           <NA> recovered         <NA>         2023-01-04        2023-01-09
#> 5           <NA> recovered         <NA>         2023-01-01        2023-01-07
#> 6           <NA>      died   2023-01-25         2023-01-07        2023-01-09
#>   ct_value
#> 1       NA
#> 2       NA
#> 3       NA
#> 4       NA
#> 5       NA
#> 6       NA
sapply(ll, class)
#>                 id          case_name          case_type                sex 
#>          "integer"        "character"        "character"        "character" 
#>                age         date_onset     date_reporting     date_admission 
#>          "integer"             "Date"             "Date"             "Date" 
#>            outcome       date_outcome date_first_contact  date_last_contact 
#>        "character"             "Date"             "Date"             "Date" 
#>           ct_value 
#>          "numeric"
messy_ll1 <- messy_linelist(ll, numeric_as_char = TRUE)
head(messy_ll1)
#>     id        case_name case_type  sex         age date_onset date_reporting
#> 1  one             <NA>      <NA>    m   forty-two 2023-01-01     2023-01-01
#> 2  two Baaqir al-Demian  probable Male         six       <NA>     2023-01-06
#> 3  two Baaqir al-Demian  probable Male         six       <NA>     2023-01-06
#> 4 four             <NA>  probable male forty-three 2023-01-07     2023-01-07
#> 5 five    Austin Porter suspected    m    nineteen 2023-01-11     2023-01-11
#> 6  six    Sara Tennyson suspected    f  twenty-two 2023-01-09     2023-01-09
#>   date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1     2023-01-05      died   2023-01-14               <NA>              <NA>
#> 2           <NA>      died   2023-01-09         2022-12-27        2023-01-02
#> 3           <NA>      died   2023-01-09         2022-12-27        2023-01-02
#> 4           <NA> recovered         <NA>               <NA>        2023-01-09
#> 5           <NA> recovered         <NA>         2023-01-04        2023-01-09
#> 6           <NA> recovered         <NA>         2023-01-01        2023-01-07
#>   ct_value
#> 1     <NA>
#> 2     <NA>
#> 3     <NA>
#> 4     <NA>
#> 5     <NA>
#> 6     <NA>
sapply(messy_ll1, class)
#>                 id          case_name          case_type                sex 
#>        "character"        "character"        "character"        "character" 
#>                age         date_onset     date_reporting     date_admission 
#>        "character"        "character"        "character"        "character" 
#>            outcome       date_outcome date_first_contact  date_last_contact 
#>        "character"        "character"        "character"        "character" 
#>           ct_value 
#>        "character"
messy_ll2 <- messy_linelist(ll, numeric_as_char = FALSE)
head(messy_ll2)
#>      id        case_name case_type    sex         age date_onset date_reporting
#> 1   one   CarltonzAragon suspected   male   forty-two 2023-01-01     2023-01-01
#> 2   two Baaqir al-Demian  probable      M         six 2023-01-06     2023-01-06
#> 3  four      Joshua Sher  prqbable   <NA> forty-three 2023-01-07           <NA>
#> 4  five             <NA> suspected   <NA>    nineteen 2023-01-11     2023-01-11
#> 5   six    Sara Tennyson suspected female  twenty-two 2023-01-09           <NA>
#> 6 seven    Rachel Nguyej  probable      F   forty-six 2023-01-13     2023-01-13
#>   date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1     2023-01-05      dieg   2023-01-14               <NA>              <NA>
#> 2           <NA>      died   2023-01-09         2022-12-27        2023-01-02
#> 3           <NA> recovered         <NA>               <NA>        2023-01-09
#> 4           <NA> recovered         <NA>         2023-01-04        2023-01-09
#> 5           <NA> recovered         <NA>         2023-01-01              <NA>
#> 6           <NA>      died   2023-01-25               <NA>        2023-01-09
#>   ct_value
#> 1       NA
#> 2       NA
#> 3       NA
#> 4       NA
#> 5       NA
#> 6       NA
sapply(messy_ll2, class)
#>                 id          case_name          case_type                sex 
#>        "character"        "character"        "character"        "character" 
#>                age         date_onset     date_reporting     date_admission 
#>        "character"        "character"        "character"        "character" 
#>            outcome       date_outcome date_first_contact  date_last_contact 
#>        "character"        "character"        "character"        "character" 
#>           ct_value 
#>          "numeric"

Created on 2025-02-20 with reprex v2.1.1

@Karim-Mane
Copy link
Member Author

Thanks @joshwlambert for your efforts to clarify this.

My suggestion was not aimed at proving that the function does not produce the described output depending on the specified arguments. Instead, it was intended to update the impact of some arguments like numeric_as_char in a way that it introduces some level of messyness across columns where we can showcase the usage of {cleanepi} functionalities.
Basically, I am suggesting that this argument not only convert numeric columns into character, but to messyfy some values in the age and id columns (not all values).

However, if you are happy with the current behaviour of the function, please feel free to close this issue and ignore it.

@joshwlambert
Copy link
Member

Thanks for the feedback.

I've addressed these requests in PR #199.

convert few numbers (but not all) into character in the age columns as that would convert the column into character. But that way we can feature the messy character of the column (with a mixture of numbers and letters).

In #199 I've updated int_as_word to prop_int_as_word, which instead of taking a logical boolean now accepts a number between 0 and 1. This allow you to customise the proportion of integers that are converted to words. It will still make the entire column of type character.

add a prefix and/or suffix to some values in the id column. That would be useful to test the corresponding standardize_subject_ids() function in {cleanepi}.

#199 adds a new setting to messy_linelist(): inconsistent_id. By default this is FALSE so will not affect the $id column. But if set to TRUE will append a random three letter prefix or suffix, or both, to roughly 10% of IDs in the $id column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants