-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpractical-ml.Rmd
142 lines (118 loc) · 5.67 KB
/
practical-ml.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: "Activity Classifications using Practical Machine Learning"
author: "Bill Felix"
date: "July 25, 2015"
output: html_document
---
#Executive Summary
The intent of this project is to predict the method in which exercises were performed for six participants performing barbell lifts. There are a total of five methods of performance; one correct method and four incorrect methods. Accelerometer data was collected from the belt, forearm, arm, and dumbbell during the lifts. The following is a data procedure to clean and explore the data, and then build a classification model using the Random Forest method to predict greater than 99% accuracy with less than 1% chance of Out of Sample Error. To view the resulting html file go here: http://rpubs.com/manlike_fox/95565
##Downloading and Importing Data
Locate the data files
```{r echo =T, cache =T}
trainUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
```
Background information on the data files: http://groupware.les.inf.puc-rio.br/har
Import and structure missing data
**Note: It is important that missing values are handled in a consistent manner and that "NA", "#DIV/0!", and "" are all NA values.**
```{r echo =T, cache =T}
train <- read.csv(url(trainUrl), na.strings=c("NA","#DIV/0!",""))
test <- read.csv(url(testUrl), na.strings=c("NA",""))
```
Look at the dimensions of training data set
```{r echo =F, cache =T}
dim(train)
```
##Data Wrangling (variable selection)
Determine which variables should be kept for the model by removing columns that have missing values and removing columns that are factor variables such as `user_name`. Also, removing variables related to the timestamp of the observation.
```{r echo =T, cache=T}
train <- train[,!sapply(train, function(x) any(is.na(x)))]
train <- train[,!sapply(train[,-60], function(x) any(is.factor(x)))]
train <- train[,-c(1:4)]
dim(train) ## look
```
Examining a common measurement captured by the four accelerometers to explore potential relationships. This example covers the `total_accel` feature. Grouping by the `classe` variable, a boxplot shows a few subtle differences.
```{r echo =T, cache =T}
library(dplyr)
library(tidyr)
library(ggplot2)
t_accel <- train %>% select(classe, starts_with("total")) %>%
gather(feature, value, -classe)
qplot(x = classe, y = value
, data = t_accel
, geom = c("boxplot")
, fill = feature
, main = "Comparison of Each Total Accel by Classe")
```
#Building a Machine Learning Model
##Creating Training and Validation sets
To train our model we have created a partition using 70% of the training data. The remaining data will be used as validation and determining the Out of Sample error.
```{r echo =T, cache =T, message =F}
library(caret)
library(AppliedPredictiveModeling)
set.seed(88)
## setup sets
ourTrain <- createDataPartition(y = train$classe , p = .70, list =F)
training <- train[ ourTrain,] ## create 1st partition
validating <- train[ -ourTrain,] ## create 2nd partition
dim(training) ## look
dim(validating) ## look
```
##Fit the Model
Using the `caret` package is simple to experiment with different machine learning algorithms. The Random Forest method was chosen after researching algorithms specifically designed for classification. Random Forest uses a native bootstrap sampling method to migitigate potential overfitting for unseen data. Source: https://en.wikipedia.org/wiki/Random_forest
```{r echo =T, cache =T, message =F, warning =F}
library(dplyr)
data <- training[,1:52]
classe <- training$classe
fit <- train(x = data, y = classe
, method = "rf")
print(fit$finalModel)
```
Random Forest calculated the Out of Bag (OOB) rate above at 0.7%, which is below the original threshold set in the Executive Summary.
##Validation
The following procedure will look to uncover error rates for unforeseen data in the `validating` data set
```{r echo =T, cache =T, eval =T}
predicts <- predict(fit, validating[,1:52])
confusionMatrix(predicts, validating$classe)
```
##Results
The prediction shows 99.24% accuracy, which can be broken down by class and is shown visually below:
```{r echo =T, cache =T}
looks <- confusionMatrix(predicts, validating$classe)[4]
viz <- as.data.frame(looks) %>% select(1:4, 8)%>% gather(var, value)
viz$classe <- rep(c("A","B","C","D","E"), 5)
levels(viz$var) <- c("Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Balance Accuracy")
qplot(x = var, y = value
, data = viz
, color = classe
, main = "Prediction Accuracy Measures by Class"
, ylab = "Accuracy"
, xlab = "Measures")
```
##Conclusion
The Random Forest method proved to be a great method for this classification problem. The model does such a good job right out of the box that it was not necessary for further fine-tuning. There is room for performance fine-tuning using parallel resources; however, that was not in the scope of this project.
#Submissions
The following is the procedure for applying the model to the test data sets to extract the submission files.
###Reciprocal PreProcessing to the Test Set
```{r echo =T, cache =T, eval =F}
test <- test[,!sapply(test, function(x) any(is.na(x)))]
test <- test[,!sapply(test[,-60], function(x) any(is.factor(x)))]
test <- test[,-c(1:4)]
dim(test)
```
####Writing Out Submission Documents
```{r echo =T, message =F, eval =F}
library(dplyr)
predictions <- predict(fit, newdata = test[,1:52])
print(predictions)
```
```{r eval =F}
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("./answers/problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(predictions)
```