Feedback should be send to
goran.milovanovic@datakolektiv.com
. These notebooks
accompany the Intro to Data Science: Non-Technical Background course
2020/21.
Today we will introduce an idea even more powerful than the Decision Tree model to solve classification and regression problems: the Random Forest model. We will rely on the R package {randomForest} to train Random Forests in both classification and regression contexts. In order to understand Random Forests we will introduce some important theoretical concepts: Bootstrap aggregating (a.k.a. Bagging), Out-of-bag (OOB) error, and Feature Bagging (a.k.a. the random subspace method or attribute bagging). We will see how these approaches in Machine Learning prevent overfitting in the training of a complex model like Random Forest.
Following our discussion of Random Forests, we will get back to a single Decision Tree and discuss the basics of its algorithm step by step on a small data set, comparing our results with the results obtained from {rpart}.
# install.packages('randomForest')
Grab the HR_comma_sep.csv
dataset from the Kaggle and
place it in your _data
directory for this session. We will
also use the Boston
Housing Dataset: BostonHousing.csv
dataDir <- paste0(getwd(), "/_data/")
library(tidyverse)
library(data.table)
library(randomForest)
library(ggrepel)
We begin by loading and inspecting the HR_comma_sep.csv
dataset:
dataSet <- read.csv(paste0(getwd(), "/_data/HR_comma_sep.csv"),
header = T,
check.names = 1,
stringsAsFactors = F)
glimpse(dataSet)
Rows: 14,999
Columns: 10
$ satisfaction_level <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10, 0.92, 0.89, 0.42, 0.45,…
$ last_evaluation <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77, 0.85, 1.00, 0.53, 0.54,…
$ number_project <int> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2, 2, 2, 2, 4, 2, 5, 6, 2,…
$ average_montly_hours <int> 157, 262, 272, 223, 159, 153, 247, 259, 224, 142, 135, 305, 234, …
$ time_spend_company <int> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3, 3, 3, 3, 6, 3, 5, 4, 3,…
$ Work_accident <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ left <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ promotion_last_5years <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ sales <chr> "sales", "sales", "sales", "sales", "sales", "sales", "sales", "s…
$ salary <chr> "low", "medium", "medium", "low", "low", "low", "low", "low", "lo…
The task is to predict the value of left
- whether the
employee has left the company or not - from a set of predictors
encompassing the following:
There three important concepts to understand how Random Forest builds upon Decision Trees:
Bootstrap aggregating (Bagging). We begin with some training set for our model, \(D\). Bagging generates \(m\) new training sets, \(D_i\), \(i = 1, 2, ..., m\), by randomly sampling from \(D\) uniformly and with replacement. The samples obtained in this way are known as bootstrap samples. In Random Forests, \(m\) simpler models - Decision Trees, precisely - are fitted for each \(D_i\) bootstrap sample.
Out-of-bag (OOB) error. Each time a new bootstrap sample \(D_i\) is produced, some data points remain out of the bag, are not used in model training, and form the OOB set (the Out-of-bag set). The OOB error is a prediction error (remember this concept from our introduction to cross-validation?) which is computed from the OOB set, where the prediction is obtained by averaging the response (in regression) or as a majority vote (in classification) from all the trees in the forest that were not trained on that particular OOB instance.
The Random Subspace Method (Feature Bagging). The Random Subspace Method is a method to control for the complexity of the decision trees grown in a Random Forest model. On each new split, only a randomly chosen subset of predictors is used to find the optimal split. The Random Forests algorithm, as we will see, has a control parameter that determines how many features are randomly selected to produce a new split.
We will fit a Random Forest model to the
HR_comma_sep.csv
dataset with
randomForest::randomForest()
in R. In spite of all the
precautionary measures already taken in the Random Forest model itself,
we will still perform an additional outter k-fold
cross-validation with 5 folds: better safe than sorry.
First, define the folds:
dataSet$ix <- sample(1:5, dim(dataSet)[1], replace = T)
table(dataSet$ix)
1 2 3 4 5
3116 2938 2982 2947 3016
Perform a 5-fold cross validation for a set of Random Forests across the following control parameters:
ntree <- seq(250, 1000, by = 250)
mtry <- 3:(dim(dataSet)[2]-2)
# - mtry: we do not count `left` which is an outcome
# - and `ix` which is a fold index, so we have
# - dim(dataSet)[2]-2 predictors in the model.
# - start timer:
tstart <- Sys.time()
# - iterate:
# - lapply() across ntree
rfModels <- lapply(ntree, function(nt) {
# - lapply() across mtry
mtrycv <- lapply(mtry, function(mt) {
# - lapply() across folds
cv <- lapply(unique(dataSet$ix), function(fold) {
# - split training and test sets
testIx <- fold
trainIx <- setdiff(1:5, testIx)
trainSet <- dataSet %>%
dplyr::filter(ix %in% trainIx) %>%
dplyr::select(-ix)
testSet <- dataSet %>%
dplyr::filter(ix %in% testIx) %>%
dplyr::select(-ix)
# - `left` to factor for classification
# - w. randomForest()
trainSet$left <- as.factor(trainSet$left)
testSet$left <- as.factor(testSet$left)
# - Random Forest:
model <- randomForest::randomForest(formula = left ~ .,
data = trainSet,
ntree = nt,
mtry = mt
)
# - ROC analysis:
predictions <- predict(model,
newdata = testSet)
hit <- sum(ifelse(predictions == 1 & testSet$left == 1, 1, 0))
hit <- hit/sum(testSet$left == 1)
fa <- sum(ifelse(predictions == 1 & testSet$left == 0, 1, 0))
fa <- fa/sum(testSet$left == 0)
acc <- sum(predictions == testSet$left)
acc <- acc/length(testSet$left)
# - ROC output:
return(
data.frame(fold, hit, fa, acc)
)
})
# - collect results from all folds:
cv <- data.table::rbindlist(cv)
# - average ROC:
avg_roc <- data.frame(hit = mean(cv$hit),
fa = mean(cv$fa),
acc = mean(cv$acc),
mtry = mt)
return(avg_roc)
})
# - collect from all mtry values:
mtrycv <- rbindlist(mtrycv)
# - add ntree and out:
mtrycv$ntree <- nt
return(mtrycv)
})
# - collect all results
rfModels <- rbindlist(rfModels)
write.csv(rfModels,
paste0(getwd(), "/rfModels.csv"))
# - Report timing:
print(paste0("The estimation took: ",
difftime(Sys.time(), tstart, units = "mins"),
" minutes."))
[1] "The estimation took: 13.5653865853945 minutes."
print(rfModels)
Let’s inspect the CV results visually. Accuracy first:
rfModels$ntree <- factor(rfModels$ntree)
rfModels$mtry <- factor(rfModels$mtry)
ggplot(data = rfModels,
aes(x = mtry,
y = acc,
group = ntree,
color = ntree,
fill = ntree,
label = round(acc, 2))
) +
geom_path(size = .25) +
geom_point(size = 1.5) +
ggtitle("Random Forests CV: Accuracy") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5, size = 8))
Hit rate:
ggplot(data = rfModels,
aes(x = mtry,
y = hit,
group = ntree,
color = ntree,
fill = ntree,
label = round(acc, 2))
) +
geom_path(size = .25) +
geom_point(size = 1.5) +
ggtitle("Random Forests CV: Hit Rate") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5, size = 8))
False Alarm rate:
ggplot(data = rfModels,
aes(x = mtry,
y = fa,
group = ntree,
color = ntree,
fill = ntree,
label = round(acc, 2))
) +
geom_path(size = .25) +
geom_point(size = 1.5) +
ggtitle("Random Forests CV: FA Rate") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5, size = 8))
And pick the best model from ROC:
rfModels$diff <- rfModels$hit - rfModels$fa
w_best <- which.max(rfModels$diff)
rfModels[w_best, ]
And we can still refine the chosen model:
dataSet <- dataSet %>%
dplyr::select(-ix)
dataSet$left <- factor(dataSet$left)
optimal_model <- randomForest::randomForest(formula = left ~ .,
data = dataSet,
ntree = 250,
mtry = 4)
optimal_model$importance
MeanDecreaseGini
satisfaction_level 2054.571142
last_evaluation 655.974267
number_project 937.381386
average_montly_hours 696.566033
time_spend_company 973.819607
Work_accident 15.990344
promotion_last_5years 2.813876
sales 60.474408
salary 32.918475
The cumulative OOB error up to the i-th tree can be found in the
first column of the optimal_model$err.rate
field:
head(optimal_model$err.rate)
OOB 0 1
[1,] 0.02296376 0.01938534 0.03422619
[2,] 0.02106197 0.01725903 0.03296703
[3,] 0.02157001 0.01664533 0.03711871
[4,] 0.02039531 0.01418807 0.04009201
[5,] 0.01891952 0.01280932 0.03843769
[6,] 0.01734679 0.01186805 0.03476969
Let’s take a closer look at it:
oobFrame <- as.data.frame(optimal_model$err.rate)
oobFrame$ntree <- 1:dim(oobFrame[1])
ggplot(data = oobFrame,
aes(x = ntree,
y = OOB)) +
geom_path(size = .25) +
geom_point(size = 1.5) +
ggtitle("Cumulative OOB error from the CV optimal Random Forest model") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5, size = 8))
We can observe how the cumulative OOB error stabilizes following the growth of a certain number of trees in the Random Forest model:
w_tree <- which.min(oobFrame$OOB)
print(w_tree)
[1] 239
And finally:
optimal_model <- randomForest::randomForest(formula = left ~ .,
data = dataSet,
ntree = w_tree,
mtry = 4
)
predictions <- predict(optimal_model,
newdata = dataSet)
hit <- sum(ifelse(predictions == 1 & dataSet$left == 1, 1, 0))
hit <- hit/sum(dataSet$left == 1)
fa <- sum(ifelse(predictions == 1 & dataSet$left == 0, 1, 0))
fa <- fa/sum(dataSet$left == 0)
acc <- sum(predictions == dataSet$left)
acc <- acc/length(dataSet$left)
print(paste0("Accuracy: ", acc, "; Hit rate: ", hit, "; FA rate: ", fa))
[1] "Accuracy: 1; Hit rate: 1; FA rate: 0"
In {randomForest}, to obtain the Random Forest model for regression simply do not pronounce an outcome to be a factor. We will use the Boston Housing dataset to demonstrate.
dataSet <- read.csv(paste0('_data/', 'BostonHousing.csv'),
header = T,
check.names = F,
stringsAsFactors = F)
head(dataSet)
Here are the variables:
The medv
variable is the outcome.
Random Forest (no cross-validation this time):
rfRegMsodel <- randomForest::randomForest(formula = medv ~ .,
data = dataSet,
ntree = 1000,
mtry = 7
)
We can use the generic plot()
function to assess how the
MSE changes across ntree
:
plot(rfRegMsodel)
predictions <- predict(rfRegMsodel,
newdata = dataSet)
predictFrame <- data.frame(predicted_medv = predictions,
observed_medv = dataSet$medv)
ggplot(data = predictFrame,
aes(x = predicted_medv,
y = observed_medv)) +
geom_smooth(method = "lm", size = .25, color = "red") +
geom_point(size = 1.5, color = "black") +
geom_point(size = .75, color = "white") +
ggtitle("Random Forest in Regression: Observed vs Predicted\nBoston Housing Dataset") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5, size = 8))
N.B. We are interested in principles on which the split is decided in a Decision Tree. The following algorithm is neither a complete Decision Tree algorithm nor a fully correct implementation at all. For example, we do not deal with internal cross-validation across \(\alpha\) - the complexity parameter - at all. Second, for reasons of simplicity, we consider only binary categorical features. The aim is to demonstrate how Gini and MSE are used to compare candidate splits across categorical and numerical features, respectively.
Step one: setup + load Titanic data.
### --- setup
library(rpart)
library(rpart.plot)
### --- Titanic dataset
data(Titanic)
library(carData)
data_set <- TitanicSurvival
data_set <- data_set[complete.cases(data_set), ]
head(data_set, 20)
Let’s now train a Decision Tree to predict survived
w.
{rpart}:
### --- decision tree model w. rpart
dec_tree <- rpart(survived ~ .,
data = data_set)
rpart.plot(dec_tree)
# - predictions
predictions <- predict(dec_tree,
newdata = data_set,
type = "prob")
predictions <- ifelse(predictions[, 1] <= predictions[, 2],
"yes",
"no")
# - elementary ROC analysis
acc <- sum(predictions == data_set$survived)/length(data_set$survived)
print(acc)
[1] 0.8049713
hit <- sum(predictions == "yes" & data_set$survived == "yes")/sum(data_set$survived == "yes")
print(hit)
[1] 0.5620609
fa <- sum(predictions == "yes" & data_set$survived == "no")/sum(data_set$survived == "no")
print(fa)
[1] 0.02746365
Q. How is a split decided? First we will develop a
function to compute Gini Impurity. We will use the nodes
data.frame to test:
# - nodes data.frame to test with:
nodes <- data.frame(class_counts = c(142, 232))
print(nodes)
### --- First: a function to compute Gini Impurity
# - for classification trees
# - Gini Impurity
# - N.B. Implement Gini as
# - # - p1*(1-p1) + p2*(1-p2) == p1 - p1^2 + p2 - p2^2 = (p1 + p2) - (p1^2+p2^2) = 1-sum(p^2)
gini_binary <- function(nodes) {
# - probabilities
p <- nodes$class_counts/sum(nodes$class_counts)
# - Gini Impurity
gini <- 1 - sum(p^2)
return(gini)
}
gini_binary(nodes)
[1] 0.4710458
Now the complicated part: a function, attribute_split()
,
to decide a split for both numerical and categorical predictors.
attribute_split <- function(params) {
# - input
# - the data:
data <- params[[1]]
# - the attribute (feature, predictor) that we wish to test
attribute <- params[[2]]
# - the outcome variable
target <- params[[3]]
# - type of attribute
att <- which(colnames(data) == attribute)
atype <- class(data[, att])
# - switch categorical and continuous
if (atype == "factor") {
# - CATEGORICAL
# - cardinality
acard <- length(unique(data[, att]))
# - possible splits
design <- combn(unique(data[, att]), 2)
design <- matrix(as.character(design),
ncol = acard)
# - If cardinality == 2
if (acard == 2) {
# - left
wd <- which(data[, att] == design[, 1])
d <- data[wd, target]
n1 <- length(d)
td <- as.numeric(table(d))
td <- data.frame(class_counts = td)
gini_split_1 <- gini_binary(td)
# - right
wd <- which(data[, att] == design[, 2])
d <- data[wd, target]
n2 <- length(d)
td <- as.numeric(table(d))
td <- data.frame(class_counts = td)
gini_split_2 <- gini_binary(td)
# - Gini
n <- n1 + n2
g <- n1/n*gini_split_1 + n2/n*gini_split_2
# - output
return(list(splits = design,
gini = g))
# - If cardinality > 2
} else {
# - gini for all possible splits
g <- apply(design, 2, function(x) {
# - left
wd <- which(data[, att] %in% x)
d <- data[wd, target]
n1 <- length(d)
td <- as.numeric(table(d))
td <- data.frame(class_counts = td)
gini_split_1 <- gini_binary(td)
# - right
wd <- which(!(data[, att] %in% x))
d <- data[wd, target]
n2 <- length(d)
td <- as.numeric(table(d))
td <- data.frame(class_counts = td)
gini_split_2 <- gini_binary(td)
# - Gini
n <- n1 + n2
return(n1/n*gini_split_1 + n2/n*gini_split_2)
})
# - output
return(list(splits = design,
gini = g))
}
} else {
# - CONTINUOUS
# - note: if CONTINUOUS
# - we measure MSE
# - lower and upper bounds
lower_b <- min(data[, att]) + 1
upper_b <- max(data[, att]) - 1
# - objective
min_gini <- function(split) {
# - right
wd <- which(data[, att] > split)
d <- data[wd, target]
n1 <- length(d)
td <- as.numeric(table(d))
td <- data.frame(class_counts = td)
gini_split_1 <- gini_binary(td)
# - left
wd <- which(data[, att] <= split)
d <- data[wd, target]
n2 <- length(d)
td <- as.numeric(table(d))
td <- data.frame(class_counts = td)
gini_split_2 <- gini_binary(td)
n <- n1 + n2
return(n1/n*gini_split_1 + n2/n*gini_split_2)
}
# - optimization: grid-search
c <- seq(lower_b, upper_b, by = .1)
ginies <- sapply(c, min_gini)
# - output
return(list(split = c[which.min(ginies)[1]],
gini = ginies[which.min(ginies)]))
}
}
Test: first split.
# - fist split:
params <- list(data_set, "sex", "survived")
attribute_split(params)
$splits
[,1] [,2]
[1,] "female" "male"
$gini
[1] 0.3433076
params <- list(data_set, "passengerClass", "survived")
attribute_split(params)
$splits
[,1] [,2] [,3]
[1,] "1st" "1st" "2nd"
[2,] "2nd" "3rd" "3rd"
$gini
[1] 0.4435625 0.4824558 0.4440288
params <- list(data_set, "age", "survived")
attribute_split(params)
$split
[1] 8.0667
$gini
[1] 0.4752871
Split at sex == "male"
:
# - split at `sex == "male"`:
params <- list(data_set[data_set$sex=="male", ], "age", "survived")
attribute_split(params)
$split
[1] 9.0333
$gini
[1] 0.3063536
Split at sex == "female"
# - split at sex == "female"
params <- list(data_set[data_set$sex=="female", ], "age", "survived")
attribute_split(params)
$split
[1] 30.5667
$gini
[1] 0.3648965
Bagging (Bootstrap Aggregation), from Corporate Finance Institute
A very basic introduction to Random Forests using R, from Oxford Protein Informatics Group
A gentle introduction to random forests using R, Eight to Late
Goran S. Milovanović
DataKolektiv, 2020/21
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.