-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathBlack_Friday_analysis.Rmd
311 lines (243 loc) · 12.2 KB
/
Black_Friday_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
---
title: "Black Friday Analysis"
author: "Sanata Sy-Sahande"
date: "October 22, 2018"
output: html_document
---
```{r setup, include=FALSE, echo=FALSE}
knitr::opts_chunk$set(echo=TRUE,
cache=TRUE, autodep=TRUE, cache.comments=FALSE,
message=FALSE, warning=FALSE)
library(dplyr, quietly=T)
#import data
bf <- read.csv("BlackFriday.csv", stringsAsFactors = F,
header=T)
```
In this document, I first do some exploratory data analysis on the Black Friday dataset. I then run several prediction models to predict the amount purchased by a customer, and classification models to predict the product category of the purchase.
## Exploratory analysis
The dataset includes information on about 5800+ customers, for about 3600+ products.
```{r}
length(unique(bf$Product_ID)) #3k+ products
length(unique(bf$User_ID)) #6K customers
```
The average purchase is about \$9300, and the variable is normally distributed, with spikes at \$15000 and $20000.
```{r}
mean(bf$Purchase)
hist(bf$Purchase)
```
#Purchases as a function of features
The next series of plots show how purchase amounts vary by the other variables in the dataset.
At first glance age, gender (men), and product category seem to be the best predictors of amount purchased.
```{r}
#by age
spend.age <- bf %>% group_by(Age) %>%
summarise(mean = mean(Purchase))
plot(spend.age$mean, xaxt="n", pch=16, ylim=c(8000, 10000), type='b',
main = "Spending by Age Groups")
axis(1, 1:7, labels=spend.age$Age)
#by gender
spend.gender <- bf %>% group_by(Gender) %>%
summarise(mean = mean(Purchase))
barplot(spend.gender$mean, names=spend.gender$Gender,
main = "Spending by Gender")
#by occupation
spend.occ <- bf %>% group_by(Occupation) %>%
summarise(mean = mean(Purchase)) %>% arrange(mean)
barplot(spend.occ$mean, names=spend.occ$Occupation,
main = "Spending by Occupation (masked)")
#by marital status
spend.mar <- bf %>% group_by(Marital_Status) %>%
summarise(mean = mean(Purchase)) %>% arrange(mean)
barplot(spend.mar$mean, names=spend.mar$Marital_Status,
main = "Spending by Marital Status")
#by category
spend.cat <- bf %>% group_by(Product_Category_1) %>%
summarise(mean = mean(Purchase)) %>% arrange(mean)
barplot(spend.cat$mean, names=spend.cat$Product_Category_1,
main = "Spending by Product Category")
#by city
spend.city <- bf %>% group_by(City_Category) %>%
summarise(mean = mean(Purchase)) %>% arrange(mean)
barplot(spend.city$mean, names=spend.city$City_Category,
main = "Spending by City")
```
Next, I quickly check which product categories are most popular. Categories 1, 5, and 8 outnumber all other categories. This issue will come up later in the classification models.
```{r}
##number of purchases per category
n.cat <- bf %>% group_by(Product_Category_1) %>%
summarise(n = n()) %>% arrange(n)
barplot(n.cat$n, names = n.cat$Product_Category_1,
main = "Total Purchases by Product Category")
```
I then check whether some categories just have more products, which would explain their popularity. Again, I find categories 1, 5, and 8 have the most products.
```{r}
##number of products per category
n.prod <- bf %>% group_by(Product_Category_1) %>%
summarise(nprod = n_distinct(Product_ID)) %>% arrange(nprod)
barplot(n.prod$nprod, names = n.prod$Product_Category_1,
main = "Total Products by Product Category")
```
#Data cleaning
I did some data cleaning to add and edit variables.
```{r}
#Data cleaning: product and purchase by customer
nprod.cust <- bf %>% group_by(User_ID) %>%
summarise(user.nprod = n_distinct(Product_ID))
hist(nprod.cust$user.nprod)
summary(nprod.cust$user.nprod)
purch.cust <- bf %>% group_by(User_ID) %>%
summarise(totspend = sum(Purchase))
#add to dataset
bf <- left_join(bf, nprod.cust, by="User_ID")
bf <- left_join(bf, purch.cust, by="User_ID")
#Data cleaning: drop variables
#Dropping cat2 and cat3 vars to simplify classification
bf <- bf %>% select(-Product_Category_2, -Product_Category_3)
#Data cleaning: edit variables
bf <- bf %>% mutate(Age = recode(Age, "0-17"=0,
"18-25"=1,
"26-35"=2,
"36-45"=3,
"46-50"=4,
"51-55"=5,
"55+"=6))
bf <- bf %>% mutate(Stay_In_Current_City_Years =
recode(Stay_In_Current_City_Years,
"0"=0, "1"=1, "2"=2, "3"=3, "4+"=4))
#Data cleaning: convert to factors
bf$Gender <- as.factor(bf$Gender)
bf$Occupation <- as.factor(bf$Occupation)
bf$City_Category <- as.factor(bf$City_Category )
bf$Product_Category_1 <- as.factor(bf$Product_Category_1)
```
#Regression: Predict Purchase Amount
I run several models to predict Purchase based on customer features. I first calculate the baseline RMSE with a simple OLS regression.
```{r}
#Data prep: make training and validation
set.seed(1)
train = sample(1:nrow(bf), nrow(bf)/2)
#LINEAR REG
lm.bf <- lm(Purchase ~ Gender + Age + Occupation +
City_Category + Stay_In_Current_City_Years +
Marital_Status,
data=bf[train, ])
#predict
yhat = predict(lm.bf, newdata=bf[-train, ])
mse <- mean((bf$Purchase[-train] - yhat)^2)
sqrt(mse)
```
RMSE is equal to `r sqrt(mse)`. This means that an OLS model is expected to predict the target purchase amount by a margin of $`r sqrt(mse)`, equivalent to about half a standard deviation.
**Feature Selection**
To improve on the OLS model, I do some feature engineering to select the best variables to keep. This is especially useful since the dataset includes two categorical variables with many levels: occupation (21 levels) and product category (18 levels). Together they add almost 40 features to the model.
I first determine the optimal number of variables to include in the model. I compare full subset selection, forward, and backward selection. I use adjusted R-squared as my evaluation metric.
Below are the results of the forward selection approach, whichindicates about 10 variables before the improvement in adjusted R-squared from adding additional variables becomes neglible. (Similar results for full and backward selection ommitted.)
```{r}
library(leaps)
#forward selection
regfit.fwd = regsubsets(Purchase ~ Gender + Age + Occupation +
City_Category +
Stay_In_Current_City_Years +
Marital_Status, bf, nvmax=19,
method = "forward")
reg.summary = summary(regfit.fwd)
plot(reg.summary$adjr2 ,xlab =" Number of Variables ",
ylab=" Adjusted RSq",type="l")
which.max(reg.summary$adjr2)
fwd.coefs <- coef(regfit.fwd, 10)
fwd.coefs
```
```{r}
#First, create a vector that allocates each observation to one of k = 10 folds
k=10
set.seed(1)
folds=sample(1:k,nrow(bf),replace =TRUE)
cv.errors = matrix(NA, k, 15, dimnames=list(NULL, paste(1:15) )) #matrix to store results
#make predict function
predict.regsubsets = function(object, newdata ,id ,...){
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id=id)
xvars =names (coefi )
mat[,xvars ]%*% coefi
}
#In the jth fold, the elements of folds that equal j
#are in the test set, and the remainder are in the training set
for(j in 1:k){
best.fit = regsubsets(Purchase ~ Gender + Age + Occupation +
City_Category +
Stay_In_Current_City_Years +
Marital_Status,
data=bf[folds !=j,],
nvmax =15)
for(i in 1:15) {
pred=predict.regsubsets(best.fit, bf[folds==j,], id=i) #make predictions for each model size
#compute the test errors on the appropriate subset
#store them in cv.errors
cv.errors[j,i]=mean( (bf$Purchase[folds==j]-pred)^2)
}
#returns a kx15 matrix in cv.errors
}
mean.cv.errors = apply(cv.errors ,2, mean) #calculate errors for the j-variable model
mean.cv.errors
par(mfrow =c(1,1))
plot(mean.cv.errors ,type='b')
```
```{r}
```
As an alternative, I used cross-validation to determine the optimal number of variables in the model. I did this by performing best subset selection for up to 15 variables, within each of k=10 training sets. The results seem to confirm that 10-15 is the ideal range of variables. I select 12, and obtain the best 12 coefficients from the best subset method on the full sample.
```{r}
reg.best = regsubsets(Purchase ~ Gender + Age + Occupation +
City_Category +
Stay_In_Current_City_Years +
Marital_Status, data=bf, nvmax=15)
coef(reg.best, 12)
lm.fit <- lm(Purchase ~ Gender +
(Occupation==1) + (Occupation==7) +
(Occupation==10) + (Occupation==12) +
(Occupation==14) + (Occupation==15) +
(Occupation==17) + (Occupation==19) +
(Occupation==20) + (City_Category=="B") +
(City_Category=="C"), data=bf[train, ])
#predict
yhat = predict(lm.fit, newdata=bf[-train, ])
mse <- mean((bf$Purchase[-train] - yhat)^2) #MSE
sqrt(mse)
```
Selecting the best subset did not lead to a noticeable improvement in RMSE, which is now at `r sqrt(mse)`. This suggests that a linear model may not be the best model for the data. I turn to regression trees instead.
#Regression Trees
In this section, I compare the performance of random forest with boosting to try to reduce the RMSE.
```{r}
library(gbm)
boost.bf = gbm(Purchase ~ Gender + Age + Occupation +
City_Category + Stay_In_Current_City_Years +
Marital_Status,
data=bf[train, ],
distribution="gaussian", #use ="bernoulli" for classification
n.trees=500,
interaction.depth = 4) #limit depth of trees
#partial dependence plots:marginal effect of selected vars
par(mfrow=c(1,2))
plot(boost.bf, i="Gender")
plot(boost.bf, i="City_Category")
#predict purchase on test set
yhat.boost=predict(boost.bf, newdata=bf[-train, ], n.trees=500)
mse <- mean((yhat.boost - bf$Purchase[-train])^2)
sqrt(mse) #RMSE = 4932
```
Again, the boosted trees reduced the RMSE, but not by much. My guess is that individual characteristics of consumers do not do as great a job of explaining the purchase amount as the products themselves. This is confirmed by comparing the R-squared of the original OLS model, and one including the product categories on the right hand side.
```{r}
lm.bf <- lm(Purchase ~ Gender + Age + Occupation +
City_Category + Stay_In_Current_City_Years +
Marital_Status,
data=bf[train, ])
sum.lm1 <- summary(lm.bf)
sum.lm1$r.squared
lm.bf <- lm(Purchase ~ Gender + Age + Occupation + Product_Category_1 +
City_Category + Stay_In_Current_City_Years +
Marital_Status,
data=bf[train, ])
sum.lm2 <- summary(lm.bf)
sum.lm2$r.squared
```
As suspected, including product categories increases the training R-squared from `r round(sum.lm1$r.squared, 2)` to `r round(sum.lm2$r.squared, 2)`. But this is not particularly informative because it simply indicates that customers who by products from more expensive categories spend more money. It seems then that the appropriate model would be to predict *which categories* of goods customers will buy--a classification problem.
However, before continuing, I make a case for why the prediction models were still informative. Despite their low explanatory power, we now have a better idea of what types of customers tend to spend more: male customers, and those in a set of key occupations (masked). I would expect to see these same features be relevant for the classification models.