-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathProject 3 script_final.R
185 lines (142 loc) · 7.18 KB
/
Project 3 script_final.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
#' ---
#' title: "Project 3"
#' author: "VinaTeam"
#' date: "December 3rd, 2021"
#' ---
# Problem 3 Introduction:
# On one Earth in the multiverse of madness: With escalating home prices in King County,
# an aspiring NoMaj (i.e. muggle), Jacob Kawalski, who has been sleepless because Queenie decided to
# join the dark side, launch a real estate business. However, he needs to understand the real estate market.
# Objectives:
# Regarding this problem, we will analyze a sample from the larger dataset 'kc_house_data_2.csv'. Based on the sample,
# we will build and select an appropriate model for the business, predicting the home prices in King County.
# Data Description:
# The house_2.csv dataset represents ten of thousands information from customers, which includes many variables,
# like year, prices, number of bathroom, number of bedrooms, zipcode, square feet, etc.
# Libraries ------------------------------------------------------------
library(rpart)
library(rpart.plot)
library(forecast)
library(caret)
library(ISLR)
require(tree)
library(tidyr)
library(dplyr)
library(janitor)
library(ROSE)
library(ggplot2)
library(corrgram)
library(forecast)
library(ggpubr)
library(forecast)
# 1. Data import and preparation ---------------------------------------------
# For the data, we decided to remove all variables that were deemed unecessary.
# Using domain knowledge, we decided to remove variables such as day of the week, and day of the sale,
# since we usually don't know exactly when a person will want to sell. We also removed sqft_living and sqft_lot,
# because the sqft_living15 and lot15 variants has post-renovations numbers, so are more up-to-date. We kept variables
# like condition and zipcode, because customers usually care about a house's condition and location.
# Outside of removing variables, we turned all the categorical variables into factors
# and remove all the missing values.
#-----------------------------------------------------------------------------
## Data importing
house <- read.csv("house_2.csv", header = TRUE)
head(house, 10)
names(house)
## Removing unecessary variables
house_filtered <- house[, -c(1,2,4,5,6,10,11,14,17,22,23)]
house_filtered <- drop_na(house_filtered)
names(house_filtered)
str(house_filtered)
names1 <- c("waterfront", "condition", "grade", "zipcode")
house_filtered[,names1] <- lapply(house_filtered[,names1], as.factor)
str(house_filtered)
# 2. Building regression model & Evaluating the model -------------------------------------
#
# After cleaning and transforming the data, now is the time to build the model.
# For the model, we chose a linear regression model because we feel that it is a good fit
# for this problem, and that it will tell us which variables are the most significant
# in predicting housing prices.
# For actual training, we used seed number 669, and 60/40 training validation split.
# This means that 60% of the data is used to train the model, while 40% is used for validating whether
# the model is still good when used on an unfamiliar set of data.
#
#----------------------------------------------
## Setting seed, Creating training and validation set
set.seed(669)
train_index <- sample(1:nrow(house_filtered), 0.6 * nrow(house_filtered))
valid_index <- setdiff(1:nrow(house_filtered), train_index)
## Inputting index to the variables
train_df <- house_filtered[train_index, ]
valid_df <- house_filtered[valid_index, ]
## Counting data of training and validiation
nrow(train_df)
nrow(valid_df)
## Comparing traning set and validation set
compare_df_cols(train_df, valid_df)
## Explore the relationship between the variables
## Correlogram
corrgram(train_df)
## Scatterplot
ggplot(data = train_df) + aes(x = price, y = sqft_living15) +
geom_point() +
ggtitle ("The correlation between price and living") +
geom_smooth(method=lm, se= TRUE) +
stat_cor(method = "pearson", label.x = 1000, label.y = 3)
###-----------------------------------------------------------------
### Living room area in 2015 has the strongest relationship with price.
###-------------------------------------------------------------------
## Building model from the training set
price_model <- lm(price ~ ., data = train_df)
summary(price_model)
### -----------------------------------------------------------------
### Based on the summary, we could interpret:
### Based on the F stat and p-value, we could say this model is pretty significant
### The Adjusted R-squared is also relatively high : 80.87 %, which indicates a relatively good fit.
### There are 10 variables that are significant in predicting price : Year, bathroom, floors,
### waterfront, grade, sqft_basement, yr_built, yr_renovated , zipcode and sqft_living15
###-----------------------------------------------------------------
## Predicting the outcome of the training and validation set using the model
price_model_prediction_train <- predict(price_model, train_df)
price_model_prediction_valid <- predict(price_model, valid_df)
## Comparing errors between training and validation sets
accuracy(price_model_prediction_train, train_df$price)
accuracy(price_model_prediction_valid, valid_df$price)
### -------------------------------------------------------------
### The RMSE of the validation set is slightly higher compared to the training set,
### which doesn't suggest overfitting.
### The RMSE is relatively low, which indicates a good model.
###-------------------------------------------------------------
### Discussion & Evaluation-----------------------------------
### The final model was fairly accurate, as the RMSE is relatively low, and the adjusted R-squared is also relatively high.
### In the real world, many of these variables can be utilized, such as grade.
### Having a high grade means that the house can sell for higher prices.
### A grading system might not be available in all counties, but we also have other variables,
### such as condition, bathroom, and zipcode, which are all significant and can affect house prices.
###-------------------------------------------------------------
# 3. Predict new prices for new houses-------------------------
# This is where we predict the prices for the new houses
#------------------------------------------------------------
## Importing new records
house_pred <- read.csv("house_test_2.csv", header = TRUE)
names(house_pred)
## Removing the unused variables from the records
house_pred_filtered <- house_pred[, -c(1,2,4,5,6,9,10,13,16,21,22)]
names(house_pred_filtered)
str(house_pred_filtered)
house_pred_filtered[,names1] <- lapply(house_pred_filtered[,names1], as.factor)
str(house_pred_filtered)
## Predict record 1
house_1 <- house_pred_filtered[1,]
record_price1 <- predict(price_model, house_1)
record_price1
### The price of the first record is $410,910.1
## Predict record 2
house_2 <- house_pred_filtered[2,]
record_price2 <- predict(price_model, house_2)
record_price2
### The price of the second record is $331,777.3
## Predict record 3
house_3 <- house_pred_filtered[3,]
record_price3 <- predict(price_model, house_3)
record_price3
### The price of the last record is $343,520.8