practical-ml.Rmd

---
title: "Activity Classifications using Practical Machine Learning"
author: "Bill Felix"
date: "July 25, 2015"
output: html_document
---
#Executive Summary
The intent of this project is to predict the method in which exercises were performed for six participants performing barbell lifts. There are a total of five methods of performance; one correct method and four incorrect methods. Accelerometer data was collected from the belt, forearm, arm, and dumbbell during the lifts. The following is a data procedure to clean and explore the data, and then build a classification model using the Random Forest method to predict greater than 99% accuracy with less than 1% chance of Out of Sample Error. To view the resulting html file go here: http://rpubs.com/manlike_fox/95565

##Downloading and Importing Data
Locate the data files
```{r echo =T, cache =T}
trainUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
```
Background information on the data files: http://groupware.les.inf.puc-rio.br/har

Import and structure missing data  
**Note: It is important that missing values are handled in a consistent manner and that "NA", "#DIV/0!", and "" are all NA values.**
```{r echo =T, cache =T}
train <- read.csv(url(trainUrl), na.strings=c("NA","#DIV/0!",""))
test <- read.csv(url(testUrl), na.strings=c("NA",""))
```

Look at the dimensions of training data set
```{r echo =F, cache =T}
dim(train)
```

##Data Wrangling (variable selection)
Determine which variables should be kept for the model by removing columns that have missing values and removing columns that are factor variables such as `user_name`. Also, removing variables related to the timestamp of the observation.
```{r echo =T, cache=T}
train <- train[,!sapply(train, function(x) any(is.na(x)))]
train <- train[,!sapply(train[,-60], function(x) any(is.factor(x)))]
train <- train[,-c(1:4)]

dim(train) ## look
```

Examining a common measurement captured by the four accelerometers to explore potential relationships. This example covers the `total_accel` feature. Grouping by the `classe` variable, a boxplot shows a few subtle differences.
```{r echo =T, cache =T}
library(dplyr)
library(tidyr)
library(ggplot2)

t_accel <- train %>% select(classe, starts_with("total")) %>%
    gather(feature, value, -classe)

qplot(x = classe, y = value
      , data = t_accel
      , geom = c("boxplot")
      , fill = feature
      , main = "Comparison of Each Total Accel by Classe")
```

#Building a Machine Learning Model

##Creating Training and Validation sets
To train our model we have created a partition using 70% of the training data. The remaining data will be used as validation and determining the Out of Sample error.
```{r echo =T, cache =T, message =F}
library(caret)
library(AppliedPredictiveModeling)

set.seed(88)

## setup sets
ourTrain <- createDataPartition(y = train$classe , p = .70, list =F)
training <- train[ ourTrain,] ## create 1st partition
validating <- train[ -ourTrain,] ## create 2nd partition

dim(training) ## look
dim(validating) ## look
```

##Fit the Model
Using the `caret` package is simple to experiment with different machine learning algorithms. The Random Forest method was chosen after researching algorithms specifically designed for classification. Random Forest uses a native bootstrap sampling method to migitigate potential overfitting for unseen data. Source: https://en.wikipedia.org/wiki/Random_forest
```{r echo =T, cache =T, message =F, warning =F}
library(dplyr)

data <- training[,1:52]
classe <- training$classe

fit <- train(x = data, y = classe
             , method = "rf")
print(fit$finalModel)
```

Random Forest calculated the Out of Bag (OOB) rate above at 0.7%, which is below the original threshold set in the Executive Summary.

##Validation
 The following procedure will look to uncover error rates for unforeseen data in the `validating` data set 
```{r echo =T, cache =T, eval =T}
predicts <- predict(fit, validating[,1:52])
confusionMatrix(predicts, validating$classe)
```
##Results
The prediction shows 99.24% accuracy, which can be broken down by class and is shown visually below:
```{r echo =T, cache =T}
looks <- confusionMatrix(predicts, validating$classe)[4]
viz <- as.data.frame(looks) %>% select(1:4, 8)%>% gather(var, value)
viz$classe <- rep(c("A","B","C","D","E"), 5)
levels(viz$var) <- c("Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Balance Accuracy")
qplot(x = var, y = value
      , data = viz
      , color = classe
      , main = "Prediction Accuracy Measures by Class"
      , ylab = "Accuracy"
      , xlab = "Measures")
```

##Conclusion
The Random Forest method proved to be a great method for this classification problem. The model does such a good job right out of the box that it was not necessary for further fine-tuning. There is room for performance fine-tuning using parallel resources; however, that was not in the scope of this project.

#Submissions
The following is the procedure for applying the model to the test data sets to extract the submission files.
###Reciprocal PreProcessing to the Test Set
```{r echo =T, cache =T, eval =F}
test <- test[,!sapply(test, function(x) any(is.na(x)))]
test <- test[,!sapply(test[,-60], function(x) any(is.factor(x)))]
test <- test[,-c(1:4)]

dim(test)
```

####Writing Out Submission Documents
```{r echo =T, message =F, eval =F}
library(dplyr)

predictions <- predict(fit, newdata = test[,1:52])
print(predictions)
```
```{r eval =F}
pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("./answers/problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

pml_write_files(predictions)
```