Add `messy_linelist()` function #187

joshwlambert · 2025-02-10T14:10:17Z

This PR adds the messy_linelist() function to the package (closes #183). This function takes the output of sim_linelist() or the first list element of the output of sim_outbreak() and converts the clean line list data into messy line list data.

A new internal function, .spelling_mistake() is added which is called by messy_linelist().

Unit tests are added for the messy_linelist() function.

The {english} package is added as an Imported dependency in the DESCRIPTION. This is called by messy_linelist() when int_as_word = TRUE (which it is by default) to convert integers into words.

The README is updated to include:

{cleanepi} in the Complimentary R packages section
{messy} in the Related projects section

The design-principles.Rmd vignette is updated, adding a bullet point on the function naming convention for exported post-processing functions (renaming of truncation() required, #186), and adding {english} to the list of hard package dependencies.

…post-processing functions

github-actions · 2025-02-10T14:11:55Z

This pull request:

Adds 1 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 0 existing dependencies (direct and indirect)
Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

…FALSE used in numeric_as_char

github-actions · 2025-02-12T14:13:04Z

This pull request:

Adds 1 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 0 existing dependencies (direct and indirect)
Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

github-actions · 2025-02-12T14:18:44Z

This pull request:

Adds 1 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 0 existing dependencies (direct and indirect)
Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

github-actions · 2025-02-12T15:18:24Z

This pull request:

Adds 1 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 0 existing dependencies (direct and indirect)
Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

R/messy_linelist.R

Karim-Mane · 2025-02-11T17:04:07Z

R/messy_linelist.R

+#'
+#' @examples
+#' linelist <- sim_linelist()
+#' messy_linelist <- messy_linelist(linelist)


prop_missing is 0.1 by default. This means that a worth of 10% of the data frame will be set to NA?

The simulated data itself already contained a certain number of missing values. So the 10% is adding more missing values to the simulated messy data.

set.seed(1) # simulate linelist linelist <- simulist::sim_linelist() num_na_before <- sum(is.na(unlist(linelist))) num_na_before #> [1] 333 # add noise to simulated linelist messy_linelist <- simulist::messy_linelist(linelist) num_na_after <- sum(is.na(unlist(messy_linelist))) num_na_after #> [1] 505 # expected number of introduced random missingness num_missing <- round(prod(dim(messy_linelist)) * 0.1) num_missing #> [1] 205

^{Created on 2025-02-11 with reprex v2.1.0}

I am wondering about the definition of this parameter and what its effect should be. A suggestion could be:

can it be the mimimum percent of missing values from the messy data after simulation? In this case, the function would only add more missing values if the percent of missing values is not up to this in the initial dataset.

Thanks for raising this as it has been mentioned before and may be causing some confusion.

The line list output by sim_linelist() does not contain any missing values. However it does contain NAs. The NAs are used for the recovery time which are not supplied in the line list by default (see issue #36 for a discussion on this), hospital admission times for those that are not admitted to hospital, and for Ct values for cases that are not "confirmed" because we assume that only the confirmed cases have a PCR which provides a Ct value. But the line list is still regarded as complete.

messy_linelist() introduces 10% missing values which inserts NAs randomly (maybe even overwriting existing NAs) to provide ~10% missing. This does not include cells that are supposed to be NA (e.g. case that recover without the user specifying a onset-to-recovery delay distribution).

I would say the main question to resolve is, is there a better way to represent cells in the line list <data.frame> that should be empty (e.g. Ct values for probable cases) without using NA, or a way to disambiguate NAs.

Karim-Mane · 2025-02-12T09:45:23Z

R/messy_linelist.R

+#' * `numeric_as_char = TRUE`
+#' * `date_as_char = TRUE`
+#' * `inconsistent_dates = FALSE`
+#' * `int_as_word = TRUE`


This argument allows for the conversion of numeric columns into character. However, the values in those columns remains the same (except when an NA is indrocuced).

set.seed(1) # simulate linelist linelist <- simulist::sim_linelist() head(linelist) #> id case_name case_type sex age date_onset date_reporting date_admission #> 1 1 James Manis suspected m 59 2023-01-01 2023-01-01 2023-01-09 #> 2 2 Anisa Hatcher confirmed f 90 2023-01-01 2023-01-01 <NA> #> 3 3 Morgan Bohn confirmed f 4 2023-01-02 2023-01-02 <NA> #> 4 5 David Welter confirmed m 29 2023-01-04 2023-01-04 <NA> #> 5 6 Sade Phillips suspected f 14 2023-01-05 2023-01-05 2023-01-09 #> 6 7 Sameeha al-Zaki probable f 85 2023-01-06 2023-01-06 2023-01-08 #> outcome date_outcome date_first_contact date_last_contact ct_value #> 1 died 2023-01-13 <NA> <NA> NA #> 2 recovered <NA> 2022-12-31 2023-01-05 22.3 #> 3 recovered <NA> 2022-12-30 2023-01-01 24.5 #> 4 recovered <NA> 2023-01-05 2023-01-05 24.8 #> 5 died 2023-01-23 2023-01-07 2023-01-08 NA #> 6 recovered <NA> 2023-01-03 2023-01-06 NA # add noise to simulated linelist messy_linelist <- simulist::messy_linelist(linelist) head(messy_linelist) #> id case_name case_type sex age date_onset date_reporting #> 1 1 James Manis suspected m <NA> 2023-01-01 2023-01-01 #> 2 2 Anisa Hatcher confirmed f 90 <NA> 2023-01-01 #> 3 3 Morgan Bohn cenfirmed f 4 2023-01-02 2023-01-02 #> 4 5 Datid Welter confirmed M 29 2023-01-04 2023-01-04 #> 5 6 Sade Phillips suspected Female 14 2023-01-05 2023-01-05 #> 6 7 Sameeha al-Zaki probable female 85 2023-01-06 2023-01-06 #> date_admission outcome date_outcome date_first_contact date_last_contact #> 1 <NA> died 2023-01-13 <NA> <NA> #> 2 <NA> recovered <NA> 2022-12-31 2023-01-05 #> 3 <NA> recovered <NA> 2022-12-30 2023-01-01 #> 4 <NA> recovered <NA> 2023-01-05 <NA> #> 5 2023-01-09 died 2023-01-23 2023-01-07 2023-01-08 #> 6 2023-01-08 recovered <NA> 2023-01-03 2023-01-06 #> ct_value #> 1 <NA> #> 2 22.3 #> 3 24.5 #> 4 24.8 #> 5 <NA> #> 6 <NA> # the difference only happens are_different <- setdiff(linelist$age, as.numeric(messy_linelist$age)) if (length(are_different) > 0) { # get their indices idx <- which(linelist$age == are_different) # display their replacement in the messy data messy_linelist$age[idx] } #> [1] NA

^{Created on 2025-02-12 with reprex v2.1.0}

You could consider:

writting some of the values in the age in letters (the reverse action of {numberize} as also suggested by @Degoot-AM )

add a prefix and/or suffix to some values in the id column. That would be useful to test the corresponding fucntionality in {cleanepi}.

The messy_linelist() function has changed since the reprex in your comment. The current code changes all ages (integers) to words.

set.seed(1) # simulate linelist linelist <- simulist::sim_linelist() head(linelist) #> id case_name case_type sex age date_onset date_reporting date_admission #> 1 1 James Manis suspected m 59 2023-01-01 2023-01-01 2023-01-09 #> 2 2 Anisa Hatcher confirmed f 90 2023-01-01 2023-01-01 <NA> #> 3 3 Morgan Bohn confirmed f 4 2023-01-02 2023-01-02 <NA> #> 4 5 David Welter confirmed m 29 2023-01-04 2023-01-04 <NA> #> 5 6 Sade Phillips suspected f 14 2023-01-05 2023-01-05 2023-01-09 #> 6 7 Sameeha al-Zaki probable f 85 2023-01-06 2023-01-06 2023-01-08 #> outcome date_outcome date_first_contact date_last_contact ct_value #> 1 died 2023-01-13 <NA> <NA> NA #> 2 recovered <NA> 2022-12-31 2023-01-05 22.3 #> 3 recovered <NA> 2022-12-30 2023-01-01 24.5 #> 4 recovered <NA> 2023-01-05 2023-01-05 24.8 #> 5 died 2023-01-23 2023-01-07 2023-01-08 NA #> 6 recovered <NA> 2023-01-03 2023-01-06 NA # add noise to simulated linelist messy_linelist <- simulist::messy_linelist(linelist) head(messy_linelist) #> id case_name case_type sex age date_onset date_reporting #> 1 one James Manis suspected m <NA> 2023-01-01 2023-01-01 #> 2 two Anisa Hatcher confirmed f ninety <NA> 2023-01-01 #> 3 three Morgan Bohn cenfirmed f four 2023-01-02 2023-01-02 #> 4 five Datid Welter confirmed M twenty-nine 2023-01-04 2023-01-04 #> 5 six Sade Phillips suspected Female fourteen 2023-01-05 2023-01-05 #> 6 seven Sameeha al-Zaki probable female eighty-five 2023-01-06 2023-01-06 #> date_admission outcome date_outcome date_first_contact date_last_contact #> 1 <NA> died 2023-01-13 <NA> <NA> #> 2 <NA> recovered <NA> 2022-12-31 2023-01-05 #> 3 <NA> recovered <NA> 2022-12-30 2023-01-01 #> 4 <NA> recovered <NA> 2023-01-05 <NA> #> 5 2023-01-09 died 2023-01-23 2023-01-07 2023-01-08 #> 6 2023-01-08 recovered <NA> 2023-01-03 2023-01-06 #> ct_value #> 1 <NA> #> 2 22.3 #> 3 24.5 #> 4 24.8 #> 5 <NA> #> 6 <NA> as.numeric(messy_linelist$age) #> Warning: NAs introduced by coercion #> [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA #> [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA #> [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA #> [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA #> [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA #> [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA #> [151] NA NA NA NA NA NA NA NA

^{Created on 2025-02-17 with reprex v2.1.1}

Please let me know if this resolves your point or whether there are other changes required. Thanks.

Karim-Mane · 2025-02-12T14:10:40Z

R/messy_linelist.R

+#' mistakes and inconsistencies, as well as coerce date types.
+#'
+#' @param linelist Line list `<data.frame>` output from [sim_linelist()].
+#' @inheritParams create_config


I suggest to update the function documentation by adding the following section to clearly mention the function arguments in the corresponding section.

#' @param ... Other arguments that can be used to modify the default behaviour #' of the function. Accepted arguments are: #' #' \describe{ #' \item{`prop_missing`}{A numeric between 0 and 1 used to ... Default is #' `0.1`} #' \item{`missing_value`}{A numeric or character used to represent the added #' missing values. Default is `NA`.} #' \item{`prop_spelling_mistakes`}{A numeric between 0 and 1 used to specify #' the proportion of spelling mistakes in the messy data. Default is `0.1`.} #' \item{`inconsistent_sex`}{A boolean used to specify whether the values in #' the sex column should be consistent (TRUE) or not. Default is `TRUE`.} #' \item{`sex_as_numeric`}{A boolean used to specify whether the values in the #' sex column should be of type numeric or not. Default is `FALSE`.} #' \item{`numeric_as_char`}{A boolean used to specify whether numeric columns #' should be converted into character or not. Default is `TRUE`.} #' \item{`date_as_char`}{A boolean used to specify whether Date columns should #' be converted into character or not. Default is `TRUE`.} #' \item{`inconsistent_dates`}{A boolean used to specify whether the values in #' columns of type Date should be inconsistent or not. Default is `FALSE`.} #' \item{`int_as_word`}{A boolean used to specify whether to convert the #' numeric columns into character or not. Default is `TRUE`.} #' } #'

That way, you can remove lines 24 to 34 in the current function documentation.

Thanks for suggesting this! This is a really nice way of formatting the documentation that I'd overlooked.

However, \describe{} doesn't seem to work in @param so instead I've put it in @details (see f311bc6). If you know of a way to put it in @params please let me know as I think it would be better placed there.

R/messy_linelist.R

Co-authored-by: Karim-Mane <karimanee@outlook.com>

github-actions · 2025-02-17T16:20:02Z

This pull request:

Adds 1 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 0 existing dependencies (direct and indirect)
Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

github-actions · 2025-02-17T17:29:51Z

This pull request:

Adds 1 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 0 existing dependencies (direct and indirect)
Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

…ther unit tests

github-actions · 2025-02-18T18:20:51Z

This pull request:

Adds 1 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 0 existing dependencies (direct and indirect)
Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

github-actions · 2025-02-18T18:27:22Z

This pull request:

Adds 1 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 0 existing dependencies (direct and indirect)
Removes 0 existing system dependencies

Reach out on slack (#code-review or #help channels) to double check if there are base R alternatives to the new dependencies.

(Note that results may be inaccurate if you branched from an outdated version of the target branch.)

joshwlambert · 2025-02-18T18:29:27Z

The lintr is also flagging warnings on the main branch so I'm going to merge this PR and then fix the linting issues in a separate PR.

joshwlambert added 30 commits February 10, 2025 10:33

add messy function

7f689f2

update NAMESPACE with messy function

51f3373

add unit test for messy

68005d6

add .spelling_mistake function

1c2ac7e

add spelling mistakes to messy function

38fb7b8

add unit tests for messy() spelling mistakes

6d63de5

add inconsistent_sex option to messy()

bad14ea

add unit tests for messy() inconsistent_sex feature

404ad59

add sex_as_numeric to messy()

a9c8a49

add unit tests for messy() sex_as_numeric = TRUE

e69fd83

add numeric_as_char to messy()

56ed45b

add units tests for numeric_as_char in messy()

c2156e2

add date_as_char to messy()

a10d4bd

add unit tests for date_as_char in messy()

97fa3dd

fix unit test expectations for messy()

2215677

add inconsistent_dates to messy()

87f010b

add unit tests for inconsistent_date in messy()

256e692

add @description doc to messy()

0df300d

linting messy()

5e1d07b

fix unit test for inconsistent_dates in messy()

ec657d9

fix object name bug in messy(), relates #183

2ed8609

add int_as_word to messy(), relates #183

4370eef

add {english} to Imports in DESCRIPTION

9df1579

add unit test for int_as_words in messy()

0534bcd

correct error message for incorrect arg in messy()

414dc45

add unit test for incorrect arg passed to messy()

b0783c6

rename messy() to messy_linelist(), relates #183

6a5d94f

call .check_linelist in messy_linelist()

c58dac6

add unit test for incorrect linelist in messy_linelist()

ad39526

add bullet point to design principles vignette on naming of exported …

447218a

…post-processing functions

joshwlambert added the enhancement New feature or request label Feb 10, 2025

joshwlambert added 3 commits February 12, 2025 14:11

move int_as_word before numeric_as_char and as.data.frame and drop = …

4e38fb7

…FALSE used in numeric_as_char

update numeric_as_char messy_linelist test to turn off int_as_word

2e767fe

update WORDLIST

a13935a

add stats namespace to runif() in messy_linelist()

c9bb816

handle tibble and other data.frame subclasses in messy_linelist()

37ce36c

Karim-Mane reviewed Feb 12, 2025

View reviewed changes

joshwlambert and others added 2 commits February 17, 2025 15:02

use && for scalar comparison in messy_linelist

291ebef

use describe{} in messy_linelist doc

f311bc6

Co-authored-by: Karim-Mane <karimanee@outlook.com>

joshwlambert added 2 commits February 17, 2025 17:27

tidy data.frame subclass handling into internal functions

5e584bc

add unit test for data.frame subsclass handling

906d393

joshwlambert added 2 commits February 18, 2025 18:18

add prop_duplicate_row to messy_linelist, relates #183

6d867f8

add unit test for prop_duplicate_row in messy_linelist() and update o…

fc58ccc

…ther unit tests

fix typo in messy_linelist documentation

cac6f9d

joshwlambert merged commit f5e8bf5 into main Feb 18, 2025
9 of 10 checks passed

joshwlambert deleted the messy branch February 18, 2025 18:31

This was referenced Feb 19, 2025

add Missing values to specific columns #191

Closed

A different use of the numeric_as_char argument in messy_linelist() function #192

Closed

joshwlambert mentioned this pull request Feb 19, 2025

Rename truncation() to truncate_linelist() #193

Merged

Karim-Mane mentioned this pull request Feb 19, 2025

usage of \describe{} #194

Closed

joshwlambert mentioned this pull request Feb 19, 2025

Use \describe{} in messy_linelist ... documentation #196

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `messy_linelist()` function #187

Add `messy_linelist()` function #187

joshwlambert commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

Karim-Mane Feb 11, 2025

joshwlambert Feb 17, 2025

Karim-Mane Feb 12, 2025

joshwlambert Feb 17, 2025

Karim-Mane Feb 12, 2025

joshwlambert Feb 17, 2025

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 18, 2025

github-actions bot commented Feb 18, 2025

joshwlambert commented Feb 18, 2025

Add messy_linelist() function #187

Add messy_linelist() function #187

Conversation

joshwlambert commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

Karim-Mane Feb 11, 2025

Choose a reason for hiding this comment

joshwlambert Feb 17, 2025

Choose a reason for hiding this comment

Karim-Mane Feb 12, 2025

Choose a reason for hiding this comment

joshwlambert Feb 17, 2025

Choose a reason for hiding this comment

Karim-Mane Feb 12, 2025

Choose a reason for hiding this comment

joshwlambert Feb 17, 2025

Choose a reason for hiding this comment

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 18, 2025

github-actions bot commented Feb 18, 2025

joshwlambert commented Feb 18, 2025

Add `messy_linelist()` function #187

Add `messy_linelist()` function #187