add Missing values to specific columns #191

Karim-Mane · 2025-02-19T14:33:25Z

Is your feature request related to a problem? Please describe.
This is a follow up on the introduction of 10% of missing data.

Thanks @joshwlambert for the clarification about the usage of NA in some columns. As you mentionned, in the discussion of PR #187, boils down to represent unavailable data for cases where this is not collected (Ct values for non confirmed cases).

I have run the function several times to see columns where NA is introduced from the sim_linelist() function. It came out to be columns - see the outcome below.

final_linelist <- NULL
# simulate linelist 100 times
for (i in seq_len(100)) {
    linelist <- suppressWarnings(simulist::sim_linelist())
    x <- matrix(colSums(is.na(linelist)), ncol=ncol(linelist))
    final_linelist <- rbind(final_linelist, x)
}
colnames(final_linelist) <- names(linelist)

# detect columns with NAs
sum_na <- colSums(final_linelist)
names(sum_na) <- colnames(final_linelist)
sum_na[sum_na > 0]
#>     date_admission       date_outcome date_first_contact  date_last_contact 
#>              62589              67113                100                100 
#>           ct_value 
#>              38826

^{Created on 2025-02-19 with reprex v2.1.0}

Describe the solution you'd like
I suggest the 10% NA to be introduced in the remaining columns i.e. in columns other than these five columns.

The text was updated successfully, but these errors were encountered:

joshwlambert · 2025-02-21T11:51:41Z

Thanks for the informative description. I have addressed this request in PR #199. I've added a new internal .add_missing() function that adds what you've requested.

If the missing_value is NA then the newly inserted missing values do not sample from the existing NA elements in the <data.frame>. Avoiding overwriting missing values. If the missing_value is changed by the user, for example to "N/A", then the .add_missing() function samples from all <data.frame> elements.

.add_missing() also performs type coercion to avoid unwanted type coercions when the user specifies a custom missing_value.

I suggest the 10% NA to be introduced in the remaining columns i.e. in columns other than these five columns.

The approach taken in .add_missing() still allows introducing missing values into the <data.frame> columns that already contain NAs when the missing_value = NA (default), this nicely retains the feature of random missingness without overwriting NA values.

joshwlambert added a commit that referenced this issue Feb 20, 2025

introduces missing values for those not NA, closes #191

0ed3247

joshwlambert mentioned this issue Feb 21, 2025

Enhance messy_linelist() #199

Merged

joshwlambert closed this as completed in #199 Feb 21, 2025

joshwlambert closed this as completed in 3bad0cf Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Missing values to specific columns #191

add Missing values to specific columns #191

Karim-Mane commented Feb 19, 2025

joshwlambert commented Feb 21, 2025

add Missing values to specific columns #191

add Missing values to specific columns #191

Comments

Karim-Mane commented Feb 19, 2025

joshwlambert commented Feb 21, 2025