Feedback should be send to goran.milovanovic@datakolektiv.com
. These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.
We continue our exploration of Generalized Linear Models (GLMs) by generalizing the Binomial Logistic Regression even further to solve for classification in multiple categories. As we introduce the resulting model of Multinomial Logistic Regression we also revisit the theory of the Maximum Likelihood Estimate (MLE). Then we introduce the Receiver-Operating Characteristic (ROC) Analysis for classification problems as an important tool in model selection: the topic to be explored in detail in our following sessions.
Setup: install {nnet} for Multinomial Logistic Regression.
install.packages('nnet')
install.packages('ggrepel')
dataDir <- paste0(getwd(), "/_data/")
library(tidyverse)
library(data.table)
library(nnet)
library(car)
library(ggrepel)
The Multinomial Logistic Regression model is a powerful classification tool. Consider a problem where some outcome variable can result in more than two discrete outcomes. For example, a customer visiting a webshop can end up their visit (a) buying nothing, (b) buying some Product A, or (c) Product B, or (d) Product C, etc. If we have some information about a particular customer’s journey through the website (e.g. how much time did they spend on some particular pages, did they visit the webshop before or not, or any other information that customers might have chose to disclose on their sign-up…), we can use it as a set of predictors of customer behavior resulting in any of the (a), (b), (c), (d). We do that by means of a simple extension of the Binomial Logistic Regression that is used to solve for dichotomies: enters the Multinomial Logistic Regression model.
First, similar to what happens in dummy coding, given a set of \(K\) possible outcomes we choose one of them as a baseline. Thus all results of the Multionomial Logistic Regression will be interpreted as effects relative to that baseline outcome category, for example: for a unit increase in predictor \(X_1\) what is the change in odds to switch from (a) bying nothing to (b) buying Product A. We are already familiar with this logic, right?
So, consider a set of \(K-1\) independent Binary Logistic Models with only one predictor \(X\) where the baseline is now referred to as \(K\):
\[log\frac{P(Y_i = 1)}{P(Y_i = K)} = \beta_1X_i\]
\[log\frac{P(Y_i = 2)}{P(Y_i = K)} = \beta_2X_i\]
\[log\frac{P(Y_i = K-1)}{P(Y_i = K)} = \beta_{K-1}X_i\] N.B. The intercept \(\beta_0\) is not included for reasons of simplicity.
Obviously, we are introducing a new regression coefficient \(\beta_k\) for each possible value of the outcome \(k = 1, 2,.., K-1\). The log-odds are on the LHS while the linear model remains on the RHS.
Now we exponentiate the equations to arrive at the expressions for odds:
\[\frac{P(Y_i = 1)}{P(Y_i = K)} = e^{\beta_1X_i}\]
\[\frac{P(Y_i = 2)}{P(Y_i = K)} = e^{\beta_2X_i}\]
\[\frac{P(Y_i = K-1)}{P(Y_i = K)} = e^{\beta_{K-1}X_i}\]
And solve for \(P(Y_i = 1), P(Y_i = 2),.. P(Y_i = K-1)\):
\[P(Y_i = 1) = P(Y_i = K)e^{\beta_1X_i}\]
\[P(Y_i = 2) = P(Y_i = K)e^{\beta_2X_i}\]
\[P(Y_i = K-1) = P(Y_i = K)e^{\beta_{K-1}X_i}\]
From the fact that all probabilities \(P(Y_i = 1), P(Y_i = 2), .., P(Y_i = K-1)\) must sum to one it can be shown that
\[P(Y_i = K) = \frac{1}{1+\sum_{k=1}^{K-1}e^{\beta_KX_i}}\]
and then it is easy to derive the expressions for all \(K-1\) probabilities of the outcome resulting in a particular class:
\[P(Y_i = 1) = \frac{e^{\beta_1X_i}}{1+\sum_{k=1}^{K-1}e^{\beta_KX_i}}\]
\[P(Y_i = 2) = \frac{e^{\beta_2X_i}}{1+\sum_{k=1}^{K-1}e^{\beta_KX_i}}\]
\[P(Y_i = K-1) = \frac{e^{\beta_{K-1}X_i}}{1+\sum_{k=1}^{K-1}e^{\beta_KX_i}}\]
Let’s begin with a simple example: classify iris$Species
.
head(iris)
Now check if iris$Species
is a factor:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Let’s pick versicolor
as our baseline:
iris$Species <- relevel(iris$Species,
ref = 'versicolor')
Finally, we will use multinom()
from {nnet}
to estimate the Multinomial Logistic Regression model:
mlr_model <- nnet::multinom(Species ~ .,
data = iris)
# weights: 18 (10 variable)
initial value 164.791843
iter 10 value 8.858726
iter 20 value 6.109855
iter 30 value 5.978051
iter 40 value 5.960110
iter 50 value 5.949916
iter 60 value 5.949625
final value 5.949587
converged
summary(mlr_model)
Call:
nnet::multinom(formula = Species ~ ., data = iris)
Coefficients:
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa -0.4917771 2.205180 6.313709 -10.370916 -5.135036
virginica -42.3514649 -2.471389 -6.655958 9.391305 18.207938
Std. Errors:
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 42.21659 100.974966 162.889191 100.537988 48.061685
virginica 25.52521 2.394038 4.461874 4.703818 9.688497
Residual Deviance: 11.89917
AIC: 31.89917
The interpretation of the model results is similar to the Binomial Logistic Regression case. We will discuss the findings on iris$Species
in the Session.
Grab the model coefficients:
coeffs <- as.data.frame(
summary(mlr_model)$coeff
)
print(coeffs)
And do not forget the exp(coefficient)
thing to get back to the odds scale…
exp(coeffs)
What are the most influential predictors in categorizing setosa
and virginica
vs versicolor
?
which.max(exp(coeffs)[1, ])
Sepal.Width
3
which.max(exp(coeffs)[2, ])
Petal.Width
5
And we have learned that we can tell setosa
from versicolor
by Sepal.Width
and virginica
vs versicolor
by Petal.Width
! N.B. As ever, stop before jumping to conclusions… It is a bit more complicated than that, I assure you.
Q: How well does the model perform overall? We will first derive the model predictions from the mlr_model
model object in R:
predictions <- predict(mlr_model,
newdata = iris,
"probs")
predictions <- apply(predictions, 1, which.max)
table(predictions)
predictions
1 2 3
50 50 50
Store predictions in iris
and than compute the model accuracy:
classes <- levels(iris$Species)
predictions <- classes[predictions]
iris$Prediction <- predictions
head(iris)
accuracy <- round(
sum(iris$Species == iris$Prediction)/dim(iris)[1]*100,
2)
print(paste0("The overal model accuracy is ", accuracy, "%."))
[1] "The overal model accuracy is 98.67%."
Predicted probabilities can also be obtained easily from fitted()
:
fittedProbs <- fitted(mlr_model)
head(fittedProbs)
versicolor setosa virginica
1 3.047579e-08 1.0000000 6.185179e-35
2 1.113008e-06 0.9999989 1.032478e-31
3 1.734736e-07 0.9999998 2.724517e-33
4 3.236004e-06 0.9999968 8.282840e-31
5 2.020802e-08 1.0000000 2.698905e-35
6 7.889966e-08 0.9999999 3.398755e-33
Of course:
table(rowSums(fittedProbs))
1
150
Important. Accuracy in itself does not tell the full story of how a given model performs. Moreover, from accuracy alone one can be led to completely incorrect and misleading conclusions about model performance. We need to introduce a set of more granular measures of the quality of a categorization. Consider the following four cases:
This is the Receiver-Operating Characteristic (ROC) Analysis of model performance. Compute the True Positive Rate (TPR), False Negative Rate (FNR), False Positive Rate (FPR), and True Negative Rate (TNR) for all classes:
mlr_model_ROC <- lapply(unique(iris$Species), function(x) {
tpr <- sum(iris$Species == x & iris$Prediction == x)
tpr <- tpr/sum(iris$Species == x)
fnr <- sum(iris$Species == x & iris$Prediction != x)
fnr <- fnr/sum(iris$Species == x)
fpr <- sum(iris$Species != x & iris$Prediction == x)
fpr <- fpr/sum(iris$Species != x)
tnr <- sum(iris$Species != x & iris$Prediction != x)
tnr <- tnr/sum(iris$Species != x)
roc <- c(tpr, fnr, fpr, tnr)
names(roc) <- c('tpr', 'fnr', 'fpr', 'tnr')
return(roc)
})
mlr_model_ROC <- reduce(mlr_model_ROC, rbind)
rownames(mlr_model_ROC) <- levels(iris$Species)
print(mlr_model_ROC)
tpr fnr fpr tnr
versicolor 1.00 0.00 0.00 1.00
setosa 0.98 0.02 0.01 0.99
virginica 0.98 0.02 0.01 0.99
A good model does not commit to classification errors in the following sense: it maintains a high True Positive (Hit) Rate and a low False Positive (False Alarm, Type I Error) Rate across the classes.
We need to reconsider briefly the Binomial Logistic Regression example from Session 16 now.
dataSet <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
head(dataSet)
dataSet$rank <- factor(dataSet$rank)
mylogit <- glm(admit ~ gre + gpa + rank,
data = dataSet,
family = "binomial")
modelsummary <- summary(mylogit)
print(modelsummary)
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = dataSet)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6268 -0.8662 -0.6388 1.1490 2.0790
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
Number of Fisher Scoring iterations: 4
Ok. Now we will do the following:
dataSet$predictions <- fitted(mylogit)
head(dataSet)
Now the function. We will write out the predictClass()
function that takes two arguments: decBoundary
- the decision boundary, and decData
- a data.frame with two columns representing the observed (observed
) class (either 1
or 0
) and the predicted (predicted
) probability of being 1
.The predictClass()
function will automatically perform the ROC analysis following the class prediction.
predictClass <- function(decBoundary, decData) {
# - predict class
predClass <- ifelse(decData$predicted >= decBoundary, 1, 0)
# - ROC analysis
tpr <- sum(decData$observed == 1 & predClass == 1)
tpr <- tpr/sum(decData$observed == 1)
fnr <- sum(decData$observed == 1 & predClass == 0)
fnr <- fnr/sum(decData$observed == 1)
fpr <- sum(decData$observed == 0 & predClass == 1)
fpr <- fpr/sum(decData$observed == 0)
tnr <- sum(decData$observed == 0 & predClass == 0)
tnr <- tnr/sum(decData$observed == 0)
roc <- c(decBoundary, tpr, fnr, fpr, tnr)
names(roc) <- c('decBoundary', 'tpr', 'fnr', 'fpr', 'tnr')
return(roc)
}
Now we will inspect the ROC analysis for this Binomial Logistic Regression model across a range of decision boundaries:
decBoundaries <- seq(.001, .999, by = .001)
ROC_dataset <- data.frame(observed = dataSet$admit,
predicted = dataSet$predictions)
ROC_results <- lapply(decBoundaries, function(x) {
as.data.frame(t(predictClass(x, ROC_dataset)))
})
ROC_results <- rbindlist(ROC_results)
head(ROC_results)
We now plot the False Positive Rate (fpr
, False Alarm) vs the True Positive Rate (tpr
, Hit):
ROC_results$diff <- ROC_results$tpr - ROC_results$fpr
ROC_results$label <- ""
ROC_results$label[which.max(ROC_results$diff)] <- "Here!"
ggplot(data = ROC_results,
aes(x = fpr,
y = tpr,
label = label)) +
geom_path(color = "red") + geom_abline(intercept = 0, slope = 1) +
geom_text_repel(arrow = arrow(length = unit(0.06, "inches"),
ends = "last",
type = "closed"),
min.segment.length = unit(0, 'lines'),
nudge_y = .1) +
ggtitle("ROC analysis for the Binomial Regression Model") +
xlab("Specificity (False Alarm Rate)") + ylab("Sensitivity (Hit Rate)") +
theme_bw() +
theme(plot.title = element_text(hjust = .5))
The value of the decision boundary (sometimes called the cut-off) is where the model achieves the highest possible value for the Hit Rate given the lowest possible value for the False Alarm Rate.
This analysis can be generalized to multiclass classification problems in different ways, but that definitely goes beyond the scope of an introductory course…
Recall from Binomial Logistic Regression that the Wald \(Z\)-test for a regression coefficient is obtained by dividing the estimate with the respective standard error:
summary(mlr_model)$coefficients
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa -0.4917771 2.205180 6.313709 -10.370916 -5.135036
virginica -42.3514649 -2.471389 -6.655958 9.391305 18.207938
summary(mlr_model)$standard.errors
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 42.21659 100.974966 162.889191 100.537988 48.061685
virginica 25.52521 2.394038 4.461874 4.703818 9.688497
Then we can obtain the \(Z\)-tests in the following way:
z <- summary(mlr_model)$coefficients/summary(mlr_model)$standard.errors
print(z)
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa -0.01164891 0.02183888 0.03876076 -0.1031542 -0.1068426
virginica -1.65920119 -1.03230969 -1.49174051 1.9965279 1.8793358
And because the \(Z\)-test follows a standard normal distribution (see this discussion), the respective probabilities of the Type I Error (i.e. statistical significance) are:
p <- (1 - pnorm(abs(z), 0, 1)) * 2
print(p)
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 0.99070573 0.9825765 0.9690811 0.91784059 0.91491384
virginica 0.09707526 0.3019271 0.1357672 0.04587649 0.06019866
However, using \(Z\)-tests in Multinomial Logistic Regression is not recommended. Instead, we can use MASS:dropterm()
to perform the Likelihood-Ratio Tests for the model coefficients:
lrt <- MASS::dropterm(object = mlr_model,
scope = Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
test = 'Chisq',
trace = F, # - to print additional information or not
k = 2, # - for Akaike Information Criterion (AIC)
)
# weights: 15 (8 variable)
initial value 164.791843
iter 10 value 19.997212
iter 20 value 6.933728
iter 30 value 6.659047
iter 40 value 6.646751
iter 50 value 6.641402
iter 60 value 6.637619
final value 6.633941
converged
# weights: 15 (8 variable)
initial value 164.791843
iter 10 value 12.235903
iter 20 value 7.983247
iter 30 value 7.750879
iter 40 value 7.746346
final value 7.746251
converged
# weights: 15 (8 variable)
initial value 164.791843
iter 10 value 41.301109
iter 20 value 16.748468
iter 30 value 14.595863
iter 40 value 12.950927
final value 12.950889
converged
# weights: 15 (8 variable)
initial value 164.791843
iter 10 value 15.825451
iter 20 value 12.188082
iter 30 value 11.888231
final value 11.885927
converged
as.data.frame(lrt)
The dropterm()
function will fit all models that differ from the full model by successively dropping a single predictor and perform a comparison between the performance of the full model and the model with a dropped predictor. The results of this procedure suggest that we could drop both Sepal.Length
and Sepal.Width
from the model; let’s try it out:
data(iris)
iris$Species <- relevel(iris$Species,
ref = 'versicolor')
mlr_model_drop <- multinom(Species ~ Petal.Length + Petal.Width,
data = iris)
# weights: 12 (6 variable)
initial value 164.791843
iter 10 value 13.836346
iter 20 value 10.411183
iter 30 value 10.285364
iter 40 value 10.284340
iter 50 value 10.284134
iter 60 value 10.283943
final value 10.283879
converged
summary(mlr_model_drop)
Call:
multinom(formula = Species ~ Petal.Length + Petal.Width, data = iris)
Coefficients:
(Intercept) Petal.Length Petal.Width
setosa 27.34100 -8.761018 -7.64127
virginica -45.27185 5.754243 10.44731
Std. Errors:
(Intercept) Petal.Length Petal.Width
setosa 99.26516 80.219938 157.926079
virginica 13.61140 2.305799 3.755773
Residual Deviance: 20.56776
AIC: 32.56776
The \(AIC\) of the model with all predictors included was 31.89917
and following the removal of Sepal.Length
and Sepal.Width
it increased only a bit. Let’s perform additional checks:
z <- summary(mlr_model_drop)$coefficients/summary(mlr_model_drop)$standard.errors
print(z)
(Intercept) Petal.Length Petal.Width
setosa 0.275434 -0.1092125 -0.0483851
virginica -3.326025 2.4955534 2.7816679
p <- (1 - pnorm(abs(z), 0, 1)) * 2
print(p)
(Intercept) Petal.Length Petal.Width
setosa 0.7829828275 0.91303397 0.961409330
virginica 0.0008809408 0.01257608 0.005408034
predictions <- predict(mlr_model_drop,
newdata = iris,
"probs")
predictions <- apply(predictions, 1, which.max)
classes <- levels(iris$Species)
predictions <- classes[predictions]
iris$Prediction <- predictions
head(iris)
mlr_model_ROC <- lapply(unique(iris$Species), function(x) {
tpr <- sum(iris$Species == x & iris$Prediction == x)
tpr <- tpr/sum(iris$Species == x)
fnr <- sum(iris$Species == x & iris$Prediction != x)
fnr <- fnr/sum(iris$Species == x)
fpr <- sum(iris$Species != x & iris$Prediction == x)
fpr <- fpr/sum(iris$Species != x)
tnr <- sum(iris$Species != x & iris$Prediction != x)
tnr <- tnr/sum(iris$Species != x)
roc <- c(tpr, fnr, fpr, tnr)
names(roc) <- c('tpr', 'fnr', 'fpr', 'tnr')
return(roc)
})
mlr_model_ROC <- reduce(mlr_model_ROC, rbind)
rownames(mlr_model_ROC) <- levels(iris$Species)
print(mlr_model_ROC)
tpr fnr fpr tnr
versicolor 1.00 0.00 0.00 1.00
setosa 0.94 0.06 0.03 0.97
virginica 0.94 0.06 0.03 0.97
accuracy <- round(
sum(iris$Species == iris$Prediction)/dim(iris)[1]*100,
2)
print(paste0("The overal model accuracy is ", accuracy, "%."))
[1] "The overal model accuracy is 96%."
Recall how the accuracy used to be above 98% with the full model. Would you trade a 2% drop in model accuracy against the cost of collecting two additional predictors?
Now, as of the multicollinearity in Multinomial Logistic Regression: run lm()
as if the categorical outcome was a continuous variable and use vif()
from {car}
to assess the Variance Inflation Factor:
data(iris)
iris$Species <- relevel(iris$Species,
ref = 'versicolor')
iris$Species <- sapply(iris$Species, function(x) {
if (x == 'versicolor') {
x <- 0
} else if (x == 'setosa') {
x <- 1
} else {
x <- 2
}
})
mlr_model_vif <- lm(Species ~ .,
data = iris)
vif(mlr_model_vif)
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.072722 2.100872 31.261498 16.090175
And now one could perhaps see what the real problem with our Multinomial Logistic Regression models of the iris
dataset is?
cor(select(iris, -Species))
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Multinomial Logistic Regression from English Wikipedia - An excellent introductory read!
{mnlogit} package vignette - {mnlogit} has many advantages in GLMs in comparison to {nnet} which is typically used for educational purposes
Goran S. Milovanović
DataKolektiv, 2020/21
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.