House-Price-Modelling

Kaggle Link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview

Cookiecutter Data Science: This is the structure I am trying to use for my projects based on a recommendation by a Senior Data Scientist at work.

Personal Note

The purpose of this project is to further develop my skills in data science. This project serves as a starting point for a series of projects I want to complete to gain experience and further develop myself in this field. There are some core things I want to take away from this project:

Explore advanced regression techniques
- Research the theory
- Understand key assumptions
- Learn about the general processes
Brush up my data visualization & exploration skills
Learn more about best practices for version control in data science

Introduction

The purpose of this project is create a model to predict the sale price of houses in Ames, Iowa; its an alternative to the common Boston Housing dataset. The dataset was compiled by Dean De Cock, his guidelines for this project are available in the References folder.

Documentation for the different attributes in the dataset are also available in the References folder

Project Objective: Develop a model to predict sale prices for houses based on various numerical & categorical data

Since this dataset is provides a good opportunity to investigate the properties of regression models, that will be the focus
- In a real work environment, determining what type of technique to use beforehand is definetly not good practice

Data Exploration

The training data consists of 1460 rows & 81 columns.

Based on the documentation by Dean De Cook, there are 23 nominal, 23 ordinal, 14 discrete, and 20 continuous features
There are 19 attributes with empty/NULL values; the most being Alley (Type of alley access to property), FirePlaceQu (Fireplace quality), PoolQC (Pool Quality), Fence (Fence Quality), and MiscFeature (Miscellaneous feature not covered in other categories).

Multicollinearity

The VIF function was used to check for multicollinearity within the independent variables in this dataset. Based on this analysis a few variables were dropped.

BsmtFinSF1, BsmtFinSF2, and BsmtUnfSF were dropped and only the TotalBsmtSF variable was retained
1stFlrSF and 2ndFlrSF were dropped off because they were highly correlated with GrLivArea
GrLivArea still had a high VIF score of 8.2. Based on the heatmap, this could be caused by the correlation to the number of rooms and metrics related to garage size. However, since these could be real factors that impact the final sale price of a house, they were not removed.

Missing Values

PoolQC, MiscFeature, Alley, Fence, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, BsmtFinType1, BsmtExposure, BsmtFinType2, BsmtCond, BsmtQual, MasVnrType all have NA meaning that the particular house 'feature' does not exist. The NAs do not represent true missing values
- The NAs here were replaced with 'None'
For Electrical, there is only 1 null value, so for this one, it will be imputed with the most common value of SBrkr. The impact of this should be fairly minimal even if it is wrong

Lot Frontage

Lot frontage was further investigated because it is a numerical field where 17.7% of the data was missing
The Neighborhood & Lot Configuration will be used to help impute the Lot Frontage of the missing values
- Due to the number of outliers, the median of the Neighborhood & Lot Configuration values will be used to impute the missing Lot Frontage data

Fig.# - If the mean or median of the overall lot frontage was used, it would not be a good representation of how the values can vary based on the neighbourhood & lot configuration.

Reference Work

Concepts

Multicollinearity: https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/
Multiple Regression: https://www.youtube.com/watch?v=AkBjJ6OunR4&t=631s
Variance_inflation_factor caveat when used to check for Multicollinearity: https://stackoverflow.com/questions/42658379/variance-inflation-factor-in-python

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Data/Raw		Data/Raw
Notebooks		Notebooks
References		References
Reports/Figures		Reports/Figures
.gitignore		.gitignore
README.md		README.md
debug.log		debug.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

House-Price-Modelling

Personal Note

Introduction

Data Exploration

Multicollinearity

Missing Values

Lot Frontage

Reference Work

Concepts

About

Releases

Packages

Languages

ShaktiB/House-Price-Modelling

Folders and files

Latest commit

History

Repository files navigation

House-Price-Modelling

Personal Note

Introduction

Data Exploration

Multicollinearity

Missing Values

Lot Frontage

Reference Work

Concepts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages