Set-up

set.seed(12345)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2

Introduction

Thanks to advancing technology, it is now possible to collect data about physical activity using wearable devices such as Jawbone Up, Nike FuelBand, and Fitbit. The aim of this project is to use physical activity information to predict how well these activities are performed.

Data

The dataset used in this analysis contains information about 6 individuals, who correctly and incorrectly performed barbell lifts in 5 different ways. For a detailed explanation of the dataset, please see http://groupware.les.inf.puc-rio.br/har. The dataset, which was split into training and test datasets, were downloaded from the course website.

trainDf <- read.csv("pml-training.csv")
testDf <- read.csv("pml-testing.csv")

The training dataset has 19622 observations of 160 variables and quite some NA values.

The test dataset has 20 observations of 160 variables and also a lot of NA values.

At a first glance, the first seven variables can be said to be irrelevant to the analysis and hence are removed.

trainDf <- trainDf[,8:length(trainDf)]

Variables with a high proportion of NA values are removed from the dataset.

# see the appendix for the function removeNAs()
trainDf <- removeNAs(trainDf, 0.90)

As a rule of thumb, variables with very little variance and hence having little predictive power, should be removed from the datasets.

nzv <- nearZeroVar(trainDf)
trainDf <- trainDf[, -nzv]

The trainDf dataset is partitioned in order to create training and validation datasets. This enables the out of sample error to be reported.

inTrain <- createDataPartition(y = trainDf$classe, p = 0.7, list = FALSE)
training <- trainDf[inTrain, ]
validation <- trainDf[-inTrain, ]

Modelling

Predicting the type of physical activity is a classification problem. In general, Random Forest is applied in such cases.

Tne caret package provides an interface, the train function, in order to construct models using a method of interest, this time Random Forest (method = "rf"). Using trainControl, one can pass settings to the train function, in this case cross validation (method="cv"), and 3 for number of folds. In short, a Random Forest model with 3-fold cross validation is performed.

setttingsRF <- trainControl(method="cv", 3) # RF settings
modelRF <- train(classe ~ ., data=training, method="rf", trControl=setttingsRF, ntree=250)
modelRF
## Random Forest 
## 
## 13737 samples
##    52 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 9158, 9158, 9158 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9884254  0.9853565
##   27    0.9887166  0.9857257
##   52    0.9844216  0.9802893
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

The validation dataset, in combination with the functions predict and confusionMatrix, is used to determine the out-of-sample error of the model created.

predRF <- predict(modelRF, newdata = validation)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
confusionMatrix(validation$classe, predRF)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    0    0    0    1
##          B    9 1127    3    0    0
##          C    0    5 1017    4    0
##          D    0    0   18  946    0
##          E    0    1    1    3 1077
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9924          
##                  95% CI : (0.9898, 0.9944)
##     No Information Rate : 0.2858          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9903          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9946   0.9947   0.9788   0.9927   0.9991
## Specificity            0.9998   0.9975   0.9981   0.9964   0.9990
## Pos Pred Value         0.9994   0.9895   0.9912   0.9813   0.9954
## Neg Pred Value         0.9979   0.9987   0.9955   0.9986   0.9998
## Prevalence             0.2858   0.1925   0.1766   0.1619   0.1832
## Detection Rate         0.2843   0.1915   0.1728   0.1607   0.1830
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9972   0.9961   0.9885   0.9945   0.9990

The out-of-sample error for the model appears to be 0.0076. This error rate is acceptable and it is safe to go on with prediction using this very model.

Prediction

The model built above is used for predicting the type of activity using the test dataset testDf provided:

predictions <- predict(modelRF, testDf)
predictions
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

These predictions were submitted to the course website (Course Project Prediction Quiz) and were 100% accurate.

Conclusions

In this analysis, type and aspect (correct vs incorrect) of physical activity is estimated, based on a dataset which is specifically built for this purpose. A Random Forest model with 0.76% out-of-sample error was used to predict 20 test cases, resulting in 100% (20 out of 20) accurcacy. Given these results, human activity can be said to be accurately predicted with a Random Forest model.

Appendix

removeNAs <- function(dataframe, cutoff) {
tempDf <- dataframe
# for every column in the dataset
for(i in 1:length(dataframe)) {
    # if proportion of NA values are above the cutoff
    if(sum(is.na(dataframe[, i])) / nrow(dataframe) >= cutoff ) {
        varName <- names(dataframe[i])
        tempDf <- tempDf[, -which(names(tempDf) == varName)]
    }
}
dataframe <- tempDf
}