This repository contains different predictive methods of the red wine ratings (along with my own explanations!) based on various features. Data can be found as red.txt under the main directory.
Content:
- Regression Tree (packages: rpart)
- Pruned Regression Tree
With each model we will tune the parameters with the package caret (Classification And REgression Training).
library(rpart)
library(e1071)
library(class)
library(VGAM)
library(xgboost)
Overview the data with str() and check that no missing values are present.
str(red.df)
any(is.na(red.df)) # checking missing values
Split the dataset into 90% & 10% for training & test sets.
index <- sample(nrow(red.df)*0.1) # random index for test set
train <- red.df[-index, ]
test <- red.df[index, ]
The train() function from the caret package trains the model with given arguments. According to the method used, specific tuning parameters will be required to tune the model. Here we have rpart requiring cp (Complexity Parameter) as its only parameter. A grid of cp can be fed to the argument tuneGrid for search of best result (E.g. Choosing the value of cp giving the lowest RMSE.) trControl specifies the type of resampling.
Several models will be built and compared at the end of the repository.
Tree-based methods are simiple and useful for interpretation. It can be applied to both regression and classification problems. Extended methods such as baggin, random forest an boosting are built upon the basic decision trees.
Grid of tuning paramters
grid.rt <- expand.grid(.cp = seq(0.001, 0.1, by = 0.001))
head(grid.rt)
## .cp
## 1 0.001
## 2 0.002
## 3 0.003
## 4 0.004
## 5 0.005
trControl <- trainControl(method = "cv", number = 10) # 10-fold Cross Validation
rtCV <- train(quality ~ ., data = train, # model training
method = "rpart",
tuneGrid = grid.rt,
trControl = trControl)
Checking the model rtCV and its plot, we see that the lowest RMSE happens at cp = 0.005.
rtCV
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.003.
plot(rtCV)
Final step: Making prediction with the test set on-hold. Check out the MSE (mean squared error)--the mean of squared distance between each predicted and the original value.
pred <- predict(rtCV, newdata = test)
head(pred) # prediction
## 105 43 55 50 78 90
## 5.093750 5.181818 5.411255 5.102564 5.365854 5.411255
head(test$quality) # original
## [1] 5 6 6 5 6 5
To find out the Mean Squared Error (MSE) of the prediction:
MSE <- mean((pred - test$quality)^2)
MSE
## [1] 0.4174422
rpart MSE = 0.42