Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Missing values to specific columns #191

Closed
Karim-Mane opened this issue Feb 19, 2025 · 1 comment · Fixed by #199
Closed

add Missing values to specific columns #191

Karim-Mane opened this issue Feb 19, 2025 · 1 comment · Fixed by #199

Comments

@Karim-Mane
Copy link
Member

Is your feature request related to a problem? Please describe.
This is a follow up on the introduction of 10% of missing data.

Thanks @joshwlambert for the clarification about the usage of NA in some columns. As you mentionned, in the discussion of PR #187, boils down to represent unavailable data for cases where this is not collected (Ct values for non confirmed cases).

I have run the function several times to see columns where NA is introduced from the sim_linelist() function. It came out to be columns - see the outcome below.

final_linelist <- NULL
# simulate linelist 100 times
for (i in seq_len(100)) {
    linelist <- suppressWarnings(simulist::sim_linelist())
    x <- matrix(colSums(is.na(linelist)), ncol=ncol(linelist))
    final_linelist <- rbind(final_linelist, x)
}
colnames(final_linelist) <- names(linelist)

# detect columns with NAs
sum_na <- colSums(final_linelist)
names(sum_na) <- colnames(final_linelist)
sum_na[sum_na > 0]
#>     date_admission       date_outcome date_first_contact  date_last_contact 
#>              62589              67113                100                100 
#>           ct_value 
#>              38826

Created on 2025-02-19 with reprex v2.1.0

Describe the solution you'd like
I suggest the 10% NA to be introduced in the remaining columns i.e. in columns other than these five columns.

@joshwlambert
Copy link
Member

Thanks for the informative description. I have addressed this request in PR #199. I've added a new internal .add_missing() function that adds what you've requested.

If the missing_value is NA then the newly inserted missing values do not sample from the existing NA elements in the <data.frame>. Avoiding overwriting missing values. If the missing_value is changed by the user, for example to "N/A", then the .add_missing() function samples from all <data.frame> elements.

.add_missing() also performs type coercion to avoid unwanted type coercions when the user specifies a custom missing_value.

I suggest the 10% NA to be introduced in the remaining columns i.e. in columns other than these five columns.

The approach taken in .add_missing() still allows introducing missing values into the <data.frame> columns that already contain NAs when the missing_value = NA (default), this nicely retains the feature of random missingness without overwriting NA values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants