Predicting Used Car Values

Overview

This dataset is a heavily obfuscated (due to binning and pseudonym usage for virtually all variables) customer history dataset in which vehicle appraisals and purchases are paired by customer provided by CarMax for their 2023 analytics showcase (this modeling was performed after/independently from the showcase competition and seeks to answer a different question involving predicting used car prices). Despite the obfuscation these are real data and they have a substantial amount of mising data (~1/3 of rows missing at least one variable).

I performed a limited form of iterative multiple imputation of multiple variables by seeding missing values with categoricaly (typically vehicle model) modes/means and then performing a final iterative pass using regression based methods. Based on test/train splits these methods typically have high performance (i.e. R^2 between .7-.9) for each variable.

Next, I optimize two decision tree regressors (one for appraisals one for purchases) by grid search. After exploring the impact of each parameter on the individual decision tree, I use these results to perform grid searches for random forest regressors. In the end, the random forest models prevail by a slight margin.

Performance

Overall, after rounding (for ordinal outputs) and combing appraisal and purchase records, R^2 is 0.91, with a concordance index of 0.95, meaning the model predicts true car values with a high degree of accuracy (and when it is wrong it usually is in the correct overall direction relative to the average). Perhaps an even better and (in this case stricter) measure of performance, Kendall's Tau (for ordinal correlation) still produces a value of 0.89 with a p-value of approx 0 (beyond floating point precision).

Notebooks

There are 4 parts, Pat 1 is just EDA, Part 2 is imputation, Part 2 is exploring the results of the decision tree grid search (model_optimizer.py generates the outputs, see python model_optimizer.py --help for more info), part 3 is exploring the test performance of the decision tree and random forest models.

Data / Model Availability

Raw and all intermediate data (including fitted models) are available on Google Drive.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
images		images
utils		utils
.gitignore		.gitignore
Pt1_EDA.ipynb		Pt1_EDA.ipynb
Pt2_Imputation.ipynb		Pt2_Imputation.ipynb
Pt3_Decision_Tree_Evaluation.ipynb		Pt3_Decision_Tree_Evaluation.ipynb
Pt4_Final_Model_Performance.ipynb		Pt4_Final_Model_Performance.ipynb
README.md		README.md
model_optimizer.py		model_optimizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Used Car Values

Overview

Performance

Notebooks

Data / Model Availability

About

Releases

Packages

Languages

nkuehnle/kmax_car_value_regression

Folders and files

Latest commit

History

Repository files navigation

Predicting Used Car Values

Overview

Performance

Notebooks

Data / Model Availability

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages