Kaggle Link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview
Cookiecutter Data Science: This is the structure I am trying to use for my projects based on a recommendation by a Senior Data Scientist at work.
The purpose of this project is to further develop my skills in data science. This project serves as a starting point for a series of projects I want to complete to gain experience and further develop myself in this field. There are some core things I want to take away from this project:
- Explore advanced regression techniques
- Research the theory
- Understand key assumptions
- Learn about the general processes
- Brush up my data visualization & exploration skills
- Learn more about best practices for version control in data science
The purpose of this project is create a model to predict the sale price of houses in Ames, Iowa; its an alternative to the common Boston Housing dataset. The dataset was compiled by Dean De Cock, his guidelines for this project are available in the References folder.
- Documentation for the different attributes in the dataset are also available in the References folder
Project Objective: Develop a model to predict sale prices for houses based on various numerical & categorical data
- Since this dataset is provides a good opportunity to investigate the properties of regression models, that will be the focus
- In a real work environment, determining what type of technique to use beforehand is definetly not good practice
The training data consists of 1460 rows & 81 columns.
- Based on the documentation by Dean De Cook, there are 23 nominal, 23 ordinal, 14 discrete, and 20 continuous features
- There are 19 attributes with empty/NULL values; the most being Alley (Type of alley access to property), FirePlaceQu (Fireplace quality), PoolQC (Pool Quality), Fence (Fence Quality), and MiscFeature (Miscellaneous feature not covered in other categories).
The VIF function was used to check for multicollinearity within the independent variables in this dataset. Based on this analysis a few variables were dropped.
- BsmtFinSF1, BsmtFinSF2, and BsmtUnfSF were dropped and only the TotalBsmtSF variable was retained
- 1stFlrSF and 2ndFlrSF were dropped off because they were highly correlated with GrLivArea
- GrLivArea still had a high VIF score of 8.2. Based on the heatmap, this could be caused by the correlation to the number of rooms and metrics related to garage size. However, since these could be real factors that impact the final sale price of a house, they were not removed.
- PoolQC, MiscFeature, Alley, Fence, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, BsmtFinType1, BsmtExposure, BsmtFinType2, BsmtCond, BsmtQual, MasVnrType all have NA meaning that the particular house 'feature' does not exist. The NAs do not represent true missing values
- The NAs here were replaced with 'None'
- For Electrical, there is only 1 null value, so for this one, it will be imputed with the most common value of SBrkr. The impact of this should be fairly minimal even if it is wrong
- Lot frontage was further investigated because it is a numerical field where 17.7% of the data was missing
- The Neighborhood & Lot Configuration will be used to help impute the Lot Frontage of the missing values
- Due to the number of outliers, the median of the Neighborhood & Lot Configuration values will be used to impute the missing Lot Frontage data
Fig.# - If the mean or median of the overall lot frontage was used, it would not be a good representation of how the values can vary based on the neighbourhood & lot configuration.
- https://github.com/wblakecannon/ames/blob/master/ipynb/00-eda.ipynb
- https://github.com/thisisclement/Ames-Housing-Price-Prediction/blob/master/code/proj_2_eda_modelling.ipynb
- https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
- https://www.kaggle.com/masumrumi/a-detailed-regression-guide-with-house-pricing
- https://nycdatascience.com/blog/student-works/ames-higher-house-price-prediction/
- Encoding - https://ahume9.medium.com/ames-housing-prices-reconsidered-part-1-simple-encoding-d5f4e8bac675
- Nomial Encoding - https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
- Multicollinearity: https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/
- Multiple Regression: https://www.youtube.com/watch?v=AkBjJ6OunR4&t=631s
- Variance_inflation_factor caveat when used to check for Multicollinearity: https://stackoverflow.com/questions/42658379/variance-inflation-factor-in-python