-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add messy_linelist()
function
#187
Conversation
…post-processing functions
This pull request:
Reach out on slack ( (Note that results may be inaccurate if you branched from an outdated version of the target branch.) |
This pull request:
Reach out on slack ( (Note that results may be inaccurate if you branched from an outdated version of the target branch.) |
This pull request:
Reach out on slack ( (Note that results may be inaccurate if you branched from an outdated version of the target branch.) |
This pull request:
Reach out on slack ( (Note that results may be inaccurate if you branched from an outdated version of the target branch.) |
#' | ||
#' @examples | ||
#' linelist <- sim_linelist() | ||
#' messy_linelist <- messy_linelist(linelist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prop_missing
is 0.1
by default. This means that a worth of 10%
of the data frame will be set to NA
?
The simulated data itself already contained a certain number of missing values. So the 10% is adding more missing values to the simulated messy data.
set.seed(1)
# simulate linelist
linelist <- simulist::sim_linelist()
num_na_before <- sum(is.na(unlist(linelist)))
num_na_before
#> [1] 333
# add noise to simulated linelist
messy_linelist <- simulist::messy_linelist(linelist)
num_na_after <- sum(is.na(unlist(messy_linelist)))
num_na_after
#> [1] 505
# expected number of introduced random missingness
num_missing <- round(prod(dim(messy_linelist)) * 0.1)
num_missing
#> [1] 205
Created on 2025-02-11 with reprex v2.1.0
I am wondering about the definition of this parameter and what its effect should be. A suggestion could be:
- can it be the mimimum percent of missing values from the messy data after simulation? In this case, the function would only add more missing values if the percent of missing values is not up to this in the initial dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this as it has been mentioned before and may be causing some confusion.
The line list output by sim_linelist()
does not contain any missing values. However it does contain NA
s. The NA
s are used for the recovery time which are not supplied in the line list by default (see issue #36 for a discussion on this), hospital admission times for those that are not admitted to hospital, and for Ct values for cases that are not "confirmed" because we assume that only the confirmed cases have a PCR which provides a Ct value. But the line list is still regarded as complete.
messy_linelist()
introduces 10% missing values which inserts NA
s randomly (maybe even overwriting existing NA
s) to provide ~10% missing. This does not include cells that are supposed to be NA
(e.g. case that recover without the user specifying a onset-to-recovery delay distribution).
I would say the main question to resolve is, is there a better way to represent cells in the line list <data.frame>
that should be empty (e.g. Ct values for probable cases) without using NA
, or a way to disambiguate NA
s.
R/messy_linelist.R
Outdated
#' * `numeric_as_char = TRUE` | ||
#' * `date_as_char = TRUE` | ||
#' * `inconsistent_dates = FALSE` | ||
#' * `int_as_word = TRUE` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument allows for the conversion of numeric columns into character. However, the values in those columns remains the same (except when an NA
is indrocuced).
set.seed(1)
# simulate linelist
linelist <- simulist::sim_linelist()
head(linelist)
#> id case_name case_type sex age date_onset date_reporting date_admission
#> 1 1 James Manis suspected m 59 2023-01-01 2023-01-01 2023-01-09
#> 2 2 Anisa Hatcher confirmed f 90 2023-01-01 2023-01-01 <NA>
#> 3 3 Morgan Bohn confirmed f 4 2023-01-02 2023-01-02 <NA>
#> 4 5 David Welter confirmed m 29 2023-01-04 2023-01-04 <NA>
#> 5 6 Sade Phillips suspected f 14 2023-01-05 2023-01-05 2023-01-09
#> 6 7 Sameeha al-Zaki probable f 85 2023-01-06 2023-01-06 2023-01-08
#> outcome date_outcome date_first_contact date_last_contact ct_value
#> 1 died 2023-01-13 <NA> <NA> NA
#> 2 recovered <NA> 2022-12-31 2023-01-05 22.3
#> 3 recovered <NA> 2022-12-30 2023-01-01 24.5
#> 4 recovered <NA> 2023-01-05 2023-01-05 24.8
#> 5 died 2023-01-23 2023-01-07 2023-01-08 NA
#> 6 recovered <NA> 2023-01-03 2023-01-06 NA
# add noise to simulated linelist
messy_linelist <- simulist::messy_linelist(linelist)
head(messy_linelist)
#> id case_name case_type sex age date_onset date_reporting
#> 1 1 James Manis suspected m <NA> 2023-01-01 2023-01-01
#> 2 2 Anisa Hatcher confirmed f 90 <NA> 2023-01-01
#> 3 3 Morgan Bohn cenfirmed f 4 2023-01-02 2023-01-02
#> 4 5 Datid Welter confirmed M 29 2023-01-04 2023-01-04
#> 5 6 Sade Phillips suspected Female 14 2023-01-05 2023-01-05
#> 6 7 Sameeha al-Zaki probable female 85 2023-01-06 2023-01-06
#> date_admission outcome date_outcome date_first_contact date_last_contact
#> 1 <NA> died 2023-01-13 <NA> <NA>
#> 2 <NA> recovered <NA> 2022-12-31 2023-01-05
#> 3 <NA> recovered <NA> 2022-12-30 2023-01-01
#> 4 <NA> recovered <NA> 2023-01-05 <NA>
#> 5 2023-01-09 died 2023-01-23 2023-01-07 2023-01-08
#> 6 2023-01-08 recovered <NA> 2023-01-03 2023-01-06
#> ct_value
#> 1 <NA>
#> 2 22.3
#> 3 24.5
#> 4 24.8
#> 5 <NA>
#> 6 <NA>
# the difference only happens
are_different <- setdiff(linelist$age, as.numeric(messy_linelist$age))
if (length(are_different) > 0) {
# get their indices
idx <- which(linelist$age == are_different)
# display their replacement in the messy data
messy_linelist$age[idx]
}
#> [1] NA
Created on 2025-02-12 with reprex v2.1.0
You could consider:
- writting some of the values in the
age
in letters (the reverse action of {numberize} as also suggested by @Degoot-AM ) - add a prefix and/or suffix to some values in the
id
column. That would be useful to test the corresponding fucntionality in {cleanepi}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The messy_linelist()
function has changed since the reprex in your comment. The current code changes all ages (integers
) to words.
set.seed(1)
# simulate linelist
linelist <- simulist::sim_linelist()
head(linelist)
#> id case_name case_type sex age date_onset date_reporting date_admission
#> 1 1 James Manis suspected m 59 2023-01-01 2023-01-01 2023-01-09
#> 2 2 Anisa Hatcher confirmed f 90 2023-01-01 2023-01-01 <NA>
#> 3 3 Morgan Bohn confirmed f 4 2023-01-02 2023-01-02 <NA>
#> 4 5 David Welter confirmed m 29 2023-01-04 2023-01-04 <NA>
#> 5 6 Sade Phillips suspected f 14 2023-01-05 2023-01-05 2023-01-09
#> 6 7 Sameeha al-Zaki probable f 85 2023-01-06 2023-01-06 2023-01-08
#> outcome date_outcome date_first_contact date_last_contact ct_value
#> 1 died 2023-01-13 <NA> <NA> NA
#> 2 recovered <NA> 2022-12-31 2023-01-05 22.3
#> 3 recovered <NA> 2022-12-30 2023-01-01 24.5
#> 4 recovered <NA> 2023-01-05 2023-01-05 24.8
#> 5 died 2023-01-23 2023-01-07 2023-01-08 NA
#> 6 recovered <NA> 2023-01-03 2023-01-06 NA
# add noise to simulated linelist
messy_linelist <- simulist::messy_linelist(linelist)
head(messy_linelist)
#> id case_name case_type sex age date_onset date_reporting
#> 1 one James Manis suspected m <NA> 2023-01-01 2023-01-01
#> 2 two Anisa Hatcher confirmed f ninety <NA> 2023-01-01
#> 3 three Morgan Bohn cenfirmed f four 2023-01-02 2023-01-02
#> 4 five Datid Welter confirmed M twenty-nine 2023-01-04 2023-01-04
#> 5 six Sade Phillips suspected Female fourteen 2023-01-05 2023-01-05
#> 6 seven Sameeha al-Zaki probable female eighty-five 2023-01-06 2023-01-06
#> date_admission outcome date_outcome date_first_contact date_last_contact
#> 1 <NA> died 2023-01-13 <NA> <NA>
#> 2 <NA> recovered <NA> 2022-12-31 2023-01-05
#> 3 <NA> recovered <NA> 2022-12-30 2023-01-01
#> 4 <NA> recovered <NA> 2023-01-05 <NA>
#> 5 2023-01-09 died 2023-01-23 2023-01-07 2023-01-08
#> 6 2023-01-08 recovered <NA> 2023-01-03 2023-01-06
#> ct_value
#> 1 <NA>
#> 2 22.3
#> 3 24.5
#> 4 24.8
#> 5 <NA>
#> 6 <NA>
as.numeric(messy_linelist$age)
#> Warning: NAs introduced by coercion
#> [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [151] NA NA NA NA NA NA NA NA
Created on 2025-02-17 with reprex v2.1.1
Please let me know if this resolves your point or whether there are other changes required. Thanks.
#' mistakes and inconsistencies, as well as coerce date types. | ||
#' | ||
#' @param linelist Line list `<data.frame>` output from [sim_linelist()]. | ||
#' @inheritParams create_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to update the function documentation by adding the following section to clearly mention the function arguments in the corresponding section.
#' @param ... Other arguments that can be used to modify the default behaviour
#' of the function. Accepted arguments are:
#'
#' \describe{
#' \item{`prop_missing`}{A numeric between 0 and 1 used to ... Default is
#' `0.1`}
#' \item{`missing_value`}{A numeric or character used to represent the added
#' missing values. Default is `NA`.}
#' \item{`prop_spelling_mistakes`}{A numeric between 0 and 1 used to specify
#' the proportion of spelling mistakes in the messy data. Default is `0.1`.}
#' \item{`inconsistent_sex`}{A boolean used to specify whether the values in
#' the sex column should be consistent (TRUE) or not. Default is `TRUE`.}
#' \item{`sex_as_numeric`}{A boolean used to specify whether the values in the
#' sex column should be of type numeric or not. Default is `FALSE`.}
#' \item{`numeric_as_char`}{A boolean used to specify whether numeric columns
#' should be converted into character or not. Default is `TRUE`.}
#' \item{`date_as_char`}{A boolean used to specify whether Date columns should
#' be converted into character or not. Default is `TRUE`.}
#' \item{`inconsistent_dates`}{A boolean used to specify whether the values in
#' columns of type Date should be inconsistent or not. Default is `FALSE`.}
#' \item{`int_as_word`}{A boolean used to specify whether to convert the
#' numeric columns into character or not. Default is `TRUE`.}
#' }
#'
That way, you can remove lines 24 to 34 in the current function documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggesting this! This is a really nice way of formatting the documentation that I'd overlooked.
However, \describe{}
doesn't seem to work in @param
so instead I've put it in @details
(see f311bc6). If you know of a way to put it in @params
please let me know as I think it would be better placed there.
Co-authored-by: Karim-Mane <karimanee@outlook.com>
This pull request:
Reach out on slack ( (Note that results may be inaccurate if you branched from an outdated version of the target branch.) |
This pull request:
Reach out on slack ( (Note that results may be inaccurate if you branched from an outdated version of the target branch.) |
This pull request:
Reach out on slack ( (Note that results may be inaccurate if you branched from an outdated version of the target branch.) |
This pull request:
Reach out on slack ( (Note that results may be inaccurate if you branched from an outdated version of the target branch.) |
The lintr is also flagging warnings on the |
This PR adds the
messy_linelist()
function to the package (closes #183). This function takes the output ofsim_linelist()
or the first list element of the output ofsim_outbreak()
and converts the clean line list data into messy line list data.A new internal function,
.spelling_mistake()
is added which is called bymessy_linelist()
.Unit tests are added for the
messy_linelist()
function.The {english} package is added as an Imported dependency in the
DESCRIPTION
. This is called bymessy_linelist()
whenint_as_word = TRUE
(which it is by default) to convert integers into words.The
README
is updated to include:The
design-principles.Rmd
vignette is updated, adding a bullet point on the function naming convention for exported post-processing functions (renaming oftruncation()
required, #186), and adding {english} to the list of hard package dependencies.