set.seed(12345)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
Thanks to advancing technology, it is now possible to collect data about physical activity using wearable devices such as Jawbone Up, Nike FuelBand, and Fitbit. The aim of this project is to use physical activity information to predict how well these activities are performed.
The dataset used in this analysis contains information about 6 individuals, who correctly and incorrectly performed barbell lifts in 5 different ways. For a detailed explanation of the dataset, please see http://groupware.les.inf.puc-rio.br/har. The dataset, which was split into training
and test
datasets, were downloaded from the course website.
trainDf <- read.csv("pml-training.csv")
testDf <- read.csv("pml-testing.csv")
The training dataset has 19622 observations of 160 variables and quite some NA
values.
The test dataset has 20 observations of 160 variables and also a lot of NA
values.
At a first glance, the first seven variables can be said to be irrelevant to the analysis and hence are removed.
trainDf <- trainDf[,8:length(trainDf)]
Variables with a high proportion of NA
values are removed from the dataset.
# see the appendix for the function removeNAs()
trainDf <- removeNAs(trainDf, 0.90)
As a rule of thumb, variables with very little variance and hence having little predictive power, should be removed from the datasets.
nzv <- nearZeroVar(trainDf)
trainDf <- trainDf[, -nzv]
The trainDf
dataset is partitioned in order to create training
and validation
datasets. This enables the out of sample error
to be reported.
inTrain <- createDataPartition(y = trainDf$classe, p = 0.7, list = FALSE)
training <- trainDf[inTrain, ]
validation <- trainDf[-inTrain, ]
Predicting the type of physical activity is a classification problem. In general, Random Forest is applied in such cases.
Tne caret
package provides an interface, the train
function, in order to construct models using a method of interest, this time Random Forest (method = "rf"
). Using trainControl
, one can pass settings to the train
function, in this case cross validation (method="cv"
), and 3
for number of folds
. In short, a Random Forest model with 3-fold cross validation is performed.
setttingsRF <- trainControl(method="cv", 3) # RF settings
modelRF <- train(classe ~ ., data=training, method="rf", trControl=setttingsRF, ntree=250)
modelRF
## Random Forest
##
## 13737 samples
## 52 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 9158, 9158, 9158
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9884254 0.9853565
## 27 0.9887166 0.9857257
## 52 0.9844216 0.9802893
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
The validation
dataset, in combination with the functions predict
and confusionMatrix
, is used to determine the out-of-sample error
of the model created.
predRF <- predict(modelRF, newdata = validation)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
confusionMatrix(validation$classe, predRF)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 0 0 0 1
## B 9 1127 3 0 0
## C 0 5 1017 4 0
## D 0 0 18 946 0
## E 0 1 1 3 1077
##
## Overall Statistics
##
## Accuracy : 0.9924
## 95% CI : (0.9898, 0.9944)
## No Information Rate : 0.2858
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9903
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9946 0.9947 0.9788 0.9927 0.9991
## Specificity 0.9998 0.9975 0.9981 0.9964 0.9990
## Pos Pred Value 0.9994 0.9895 0.9912 0.9813 0.9954
## Neg Pred Value 0.9979 0.9987 0.9955 0.9986 0.9998
## Prevalence 0.2858 0.1925 0.1766 0.1619 0.1832
## Detection Rate 0.2843 0.1915 0.1728 0.1607 0.1830
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9972 0.9961 0.9885 0.9945 0.9990
The out-of-sample error
for the model appears to be 0.0076. This error rate is acceptable and it is safe to go on with prediction using this very model.
The model built above is used for predicting the type of activity using the test dataset testDf
provided:
predictions <- predict(modelRF, testDf)
predictions
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
These predictions were submitted to the course website (Course Project Prediction Quiz) and were 100% accurate.
In this analysis, type and aspect (correct vs incorrect) of physical activity is estimated, based on a dataset which is specifically built for this purpose. A Random Forest model with 0.76% out-of-sample error was used to predict 20 test cases, resulting in 100% (20 out of 20) accurcacy. Given these results, human activity can be said to be accurately predicted with a Random Forest model.
removeNAs <- function(dataframe, cutoff) {
tempDf <- dataframe
# for every column in the dataset
for(i in 1:length(dataframe)) {
# if proportion of NA values are above the cutoff
if(sum(is.na(dataframe[, i])) / nrow(dataframe) >= cutoff ) {
varName <- names(dataframe[i])
tempDf <- tempDf[, -which(names(tempDf) == varName)]
}
}
dataframe <- tempDf
}