Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add messy_linelist() function #187

Merged
merged 47 commits into from
Feb 18, 2025
Merged

Add messy_linelist() function #187

merged 47 commits into from
Feb 18, 2025

Conversation

joshwlambert
Copy link
Member

This PR adds the messy_linelist() function to the package (closes #183). This function takes the output of sim_linelist() or the first list element of the output of sim_outbreak() and converts the clean line list data into messy line list data.

A new internal function, .spelling_mistake() is added which is called by messy_linelist().

Unit tests are added for the messy_linelist() function.

The {english} package is added as an Imported dependency in the DESCRIPTION. This is called by messy_linelist() when int_as_word = TRUE (which it is by default) to convert integers into words.

The README is updated to include:

  • {cleanepi} in the Complimentary R packages section
  • {messy} in the Related projects section

The design-principles.Rmd vignette is updated, adding a bullet point on the function naming convention for exported post-processing functions (renaming of truncation() required, #186), and adding {english} to the list of hard package dependencies.

@joshwlambert joshwlambert added the enhancement New feature or request label Feb 10, 2025
Copy link

This pull request:

  • Adds 1 new dependencies (direct and indirect)
  • Adds 1 new system dependencies
  • Removes 0 existing dependencies (direct and indirect)
  • Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 1 new dependencies (direct and indirect)
  • Adds 1 new system dependencies
  • Removes 0 existing dependencies (direct and indirect)
  • Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 1 new dependencies (direct and indirect)
  • Adds 1 new system dependencies
  • Removes 0 existing dependencies (direct and indirect)
  • Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 1 new dependencies (direct and indirect)
  • Adds 1 new system dependencies
  • Removes 0 existing dependencies (direct and indirect)
  • Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

#'
#' @examples
#' linelist <- sim_linelist()
#' messy_linelist <- messy_linelist(linelist)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prop_missing is 0.1 by default. This means that a worth of 10% of the data frame will be set to NA?

The simulated data itself already contained a certain number of missing values. So the 10% is adding more missing values to the simulated messy data.

set.seed(1)
# simulate linelist
linelist <- simulist::sim_linelist()
num_na_before <- sum(is.na(unlist(linelist)))
num_na_before
#> [1] 333

# add noise to simulated linelist
messy_linelist <- simulist::messy_linelist(linelist)
num_na_after <- sum(is.na(unlist(messy_linelist)))
num_na_after
#> [1] 505

# expected number of introduced random missingness 
num_missing <- round(prod(dim(messy_linelist)) * 0.1)
num_missing
#> [1] 205

Created on 2025-02-11 with reprex v2.1.0

I am wondering about the definition of this parameter and what its effect should be. A suggestion could be:

  • can it be the mimimum percent of missing values from the messy data after simulation? In this case, the function would only add more missing values if the percent of missing values is not up to this in the initial dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this as it has been mentioned before and may be causing some confusion.

The line list output by sim_linelist() does not contain any missing values. However it does contain NAs. The NAs are used for the recovery time which are not supplied in the line list by default (see issue #36 for a discussion on this), hospital admission times for those that are not admitted to hospital, and for Ct values for cases that are not "confirmed" because we assume that only the confirmed cases have a PCR which provides a Ct value. But the line list is still regarded as complete.

messy_linelist() introduces 10% missing values which inserts NAs randomly (maybe even overwriting existing NAs) to provide ~10% missing. This does not include cells that are supposed to be NA (e.g. case that recover without the user specifying a onset-to-recovery delay distribution).

I would say the main question to resolve is, is there a better way to represent cells in the line list <data.frame> that should be empty (e.g. Ct values for probable cases) without using NA, or a way to disambiguate NAs.

#' * `numeric_as_char = TRUE`
#' * `date_as_char = TRUE`
#' * `inconsistent_dates = FALSE`
#' * `int_as_word = TRUE`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument allows for the conversion of numeric columns into character. However, the values in those columns remains the same (except when an NA is indrocuced).

set.seed(1)
# simulate linelist
linelist <- simulist::sim_linelist()
head(linelist)
#>   id       case_name case_type sex age date_onset date_reporting date_admission
#> 1  1     James Manis suspected   m  59 2023-01-01     2023-01-01     2023-01-09
#> 2  2   Anisa Hatcher confirmed   f  90 2023-01-01     2023-01-01           <NA>
#> 3  3     Morgan Bohn confirmed   f   4 2023-01-02     2023-01-02           <NA>
#> 4  5    David Welter confirmed   m  29 2023-01-04     2023-01-04           <NA>
#> 5  6   Sade Phillips suspected   f  14 2023-01-05     2023-01-05     2023-01-09
#> 6  7 Sameeha al-Zaki  probable   f  85 2023-01-06     2023-01-06     2023-01-08
#>     outcome date_outcome date_first_contact date_last_contact ct_value
#> 1      died   2023-01-13               <NA>              <NA>       NA
#> 2 recovered         <NA>         2022-12-31        2023-01-05     22.3
#> 3 recovered         <NA>         2022-12-30        2023-01-01     24.5
#> 4 recovered         <NA>         2023-01-05        2023-01-05     24.8
#> 5      died   2023-01-23         2023-01-07        2023-01-08       NA
#> 6 recovered         <NA>         2023-01-03        2023-01-06       NA

# add noise to simulated linelist
messy_linelist <- simulist::messy_linelist(linelist)
head(messy_linelist)
#>   id       case_name case_type    sex  age date_onset date_reporting
#> 1  1     James Manis suspected      m <NA> 2023-01-01     2023-01-01
#> 2  2   Anisa Hatcher confirmed      f   90       <NA>     2023-01-01
#> 3  3     Morgan Bohn cenfirmed      f    4 2023-01-02     2023-01-02
#> 4  5    Datid Welter confirmed      M   29 2023-01-04     2023-01-04
#> 5  6   Sade Phillips suspected Female   14 2023-01-05     2023-01-05
#> 6  7 Sameeha al-Zaki  probable female   85 2023-01-06     2023-01-06
#>   date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1           <NA>      died   2023-01-13               <NA>              <NA>
#> 2           <NA> recovered         <NA>         2022-12-31        2023-01-05
#> 3           <NA> recovered         <NA>         2022-12-30        2023-01-01
#> 4           <NA> recovered         <NA>         2023-01-05              <NA>
#> 5     2023-01-09      died   2023-01-23         2023-01-07        2023-01-08
#> 6     2023-01-08 recovered         <NA>         2023-01-03        2023-01-06
#>   ct_value
#> 1     <NA>
#> 2     22.3
#> 3     24.5
#> 4     24.8
#> 5     <NA>
#> 6     <NA>

# the difference only happens  
are_different <- setdiff(linelist$age, as.numeric(messy_linelist$age))
if (length(are_different) > 0) {
    # get their indices
    idx <- which(linelist$age == are_different)
    
    # display their replacement in the messy data
    messy_linelist$age[idx]
}
#> [1] NA

Created on 2025-02-12 with reprex v2.1.0

You could consider:

  • writting some of the values in the age in letters (the reverse action of {numberize} as also suggested by @Degoot-AM )
  • add a prefix and/or suffix to some values in the id column. That would be useful to test the corresponding fucntionality in {cleanepi}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The messy_linelist() function has changed since the reprex in your comment. The current code changes all ages (integers) to words.

set.seed(1)
# simulate linelist
linelist <- simulist::sim_linelist()
head(linelist)
#>   id       case_name case_type sex age date_onset date_reporting date_admission
#> 1  1     James Manis suspected   m  59 2023-01-01     2023-01-01     2023-01-09
#> 2  2   Anisa Hatcher confirmed   f  90 2023-01-01     2023-01-01           <NA>
#> 3  3     Morgan Bohn confirmed   f   4 2023-01-02     2023-01-02           <NA>
#> 4  5    David Welter confirmed   m  29 2023-01-04     2023-01-04           <NA>
#> 5  6   Sade Phillips suspected   f  14 2023-01-05     2023-01-05     2023-01-09
#> 6  7 Sameeha al-Zaki  probable   f  85 2023-01-06     2023-01-06     2023-01-08
#>     outcome date_outcome date_first_contact date_last_contact ct_value
#> 1      died   2023-01-13               <NA>              <NA>       NA
#> 2 recovered         <NA>         2022-12-31        2023-01-05     22.3
#> 3 recovered         <NA>         2022-12-30        2023-01-01     24.5
#> 4 recovered         <NA>         2023-01-05        2023-01-05     24.8
#> 5      died   2023-01-23         2023-01-07        2023-01-08       NA
#> 6 recovered         <NA>         2023-01-03        2023-01-06       NA

# add noise to simulated linelist
messy_linelist <- simulist::messy_linelist(linelist)
head(messy_linelist)
#>      id       case_name case_type    sex         age date_onset date_reporting
#> 1   one     James Manis suspected      m        <NA> 2023-01-01     2023-01-01
#> 2   two   Anisa Hatcher confirmed      f      ninety       <NA>     2023-01-01
#> 3 three     Morgan Bohn cenfirmed      f        four 2023-01-02     2023-01-02
#> 4  five    Datid Welter confirmed      M twenty-nine 2023-01-04     2023-01-04
#> 5   six   Sade Phillips suspected Female    fourteen 2023-01-05     2023-01-05
#> 6 seven Sameeha al-Zaki  probable female eighty-five 2023-01-06     2023-01-06
#>   date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1           <NA>      died   2023-01-13               <NA>              <NA>
#> 2           <NA> recovered         <NA>         2022-12-31        2023-01-05
#> 3           <NA> recovered         <NA>         2022-12-30        2023-01-01
#> 4           <NA> recovered         <NA>         2023-01-05              <NA>
#> 5     2023-01-09      died   2023-01-23         2023-01-07        2023-01-08
#> 6     2023-01-08 recovered         <NA>         2023-01-03        2023-01-06
#>   ct_value
#> 1     <NA>
#> 2     22.3
#> 3     24.5
#> 4     24.8
#> 5     <NA>
#> 6     <NA>

as.numeric(messy_linelist$age)
#> Warning: NAs introduced by coercion
#>   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#>  [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#>  [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#>  [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [151] NA NA NA NA NA NA NA NA

Created on 2025-02-17 with reprex v2.1.1

Please let me know if this resolves your point or whether there are other changes required. Thanks.

#' mistakes and inconsistencies, as well as coerce date types.
#'
#' @param linelist Line list `<data.frame>` output from [sim_linelist()].
#' @inheritParams create_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to update the function documentation by adding the following section to clearly mention the function arguments in the corresponding section.

#' @param ... Other arguments that can be used to modify the default behaviour
#' of the function. Accepted arguments are:
#'
#' \describe{
#'   \item{`prop_missing`}{A numeric between 0 and 1 used to ... Default is
#'   `0.1`}
#'   \item{`missing_value`}{A numeric or character used to represent the added
#'   missing values. Default is `NA`.}
#'   \item{`prop_spelling_mistakes`}{A numeric between 0 and 1 used to specify
#'    the proportion of spelling mistakes in the messy data. Default is `0.1`.}
#'   \item{`inconsistent_sex`}{A boolean used to specify whether the values in
#'   the sex column should be consistent (TRUE) or not. Default is `TRUE`.}
#'   \item{`sex_as_numeric`}{A boolean used to specify whether the values in the
#'   sex column should be of type numeric or not. Default is `FALSE`.}
#'   \item{`numeric_as_char`}{A boolean used to specify whether numeric columns
#'   should be converted into character or not. Default is `TRUE`.}
#'   \item{`date_as_char`}{A boolean used to specify whether Date columns should
#'   be converted into character or not. Default is `TRUE`.}
#'   \item{`inconsistent_dates`}{A boolean used to specify whether the values in
#'   columns of type Date should be inconsistent or not. Default is `FALSE`.}
#'   \item{`int_as_word`}{A boolean used to specify whether to convert the
#'   numeric columns into character or not. Default is `TRUE`.}
#'   }
#'

That way, you can remove lines 24 to 34 in the current function documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggesting this! This is a really nice way of formatting the documentation that I'd overlooked.

However, \describe{} doesn't seem to work in @param so instead I've put it in @details (see f311bc6). If you know of a way to put it in @params please let me know as I think it would be better placed there.

Copy link

This pull request:

  • Adds 1 new dependencies (direct and indirect)
  • Adds 1 new system dependencies
  • Removes 0 existing dependencies (direct and indirect)
  • Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 1 new dependencies (direct and indirect)
  • Adds 1 new system dependencies
  • Removes 0 existing dependencies (direct and indirect)
  • Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 1 new dependencies (direct and indirect)
  • Adds 1 new system dependencies
  • Removes 0 existing dependencies (direct and indirect)
  • Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 1 new dependencies (direct and indirect)
  • Adds 1 new system dependencies
  • Removes 0 existing dependencies (direct and indirect)
  • Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

@joshwlambert
Copy link
Member Author

The lintr is also flagging warnings on the main branch so I'm going to merge this PR and then fix the linting issues in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feedback on the messy Function
3 participants