Impute missing values while minimizing distortion of overall variable distributions by:
- Using available columns per row to create a bagged model.
- Applying that model to non-NA rows to find distribution of residuals.
- Adding variation to the model's output by adding a random residual to each of them.
As designed this imputer takes in a dataframe whose categorical variables are encoded as strings, and imputes NAs for all missing values, starting with the columns with the fewest NAs, then using the newly NA-free columns in the next imputations.
The regression estimator is linear regression, and the classifier is random forests.
This imputer is an implementation of a technique described in the following paper:
Joseph L. Schafer & Maren K. Olsen (1998) Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective, Multivariate Behavioral Research, 33:4, 545-571, DOI: 10.1207/s15327906mbr3304_5