Feedback should be send to
goran.milovanovic@datakolektiv.com
. These notebooks
accompany the Intro to Data Science: Non-Technical Background course
2020/21.
We will now begin to introduce a set of extremely useful statistical learning models in this Session. Their name - Generalized Linear Models (GLMs) - suggests that their are somehow related to Simple and Multiple Linear Regression models and yet somehow go beyond them. That is correct: GLMs generalize the linear model, where predictors and their respective coefficients produce a linear combination of vectors, by introducing link functions to solve those kinds of problems that cannot be handled by Linear Regression. For example, what if the problem is not to predict a continuous value of the criterion, but the outcome variable is rather a dichotomy and then the problem becomes the one of categorization? E.g. predict the sex of a respondent given a set \({X}\) of their features? Enters Binomial Logistic Regression, the simplest GLM. Another thing: GLMs cannot be estimated by minimizing the quadratic error as we have estimated Simple and Multiple Linear Regression in the previous Session15. The method used to estimate Linear Models is known as Least Squares Estimation. To fit GLMs to our data, we will introduce the concept of Likelihood in Probability Theory and learn about the Maximum Likelihood Estimate!
Setup:
dataDir <- paste0(getwd(), "/_data/")
library(tidyverse)
library(data.table)
Let us briefly recall the assumptions of the (Multiple) Linear Regression model:
-Inf
to Inf
.What if we observe a set of variables that somehow describe a
statistical experiment that can result in any of the two discrete
outcomes? For example, we observe a description of a behavior of a
person, quantified in some way, and organized into a set of variables
that should be used to predict the sex of that person? Or any other
similar problem where the outcome can take only two values, say
0
or 1
(and immediately recall the Binomial
Distribution)?
The assumptions of the Linear Model obviously constrain its application in such cases. We ask the following question now: would it be possible to generalize, or expand, modify the Linear Model somehow to be able to encompass the categorization problem? Because it sounds so appealing to be able to have a set of predictors, combine them in a linear fashion, and estimate the coefficients so to be able to predict whether the outcome would turn this way or another?
There is a way to develop such a generalization of the Linear Model. In its simplest form it represents the Binomial Logistic Regression. Binomial Logistic Regression is very similar to multiple regression, except that for the outcome variable is now a categorical variable (i.e. it is measured on a nominal scale that is a dichotomy).
Let’s recall the form of the Linear Model with any number of predictors:
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k + \epsilon\]
So we have a linear combination of \(k\) predictors \(\boldsymbol{X}\) plus the model error term \(\epsilon\) on the RHS, and the outcome variable \(Y\) on the LHS.
Now we assume that \(Y\) can take
only two possible values, call them 0
and 1
for ease of discussion. We want to predict whether \(Y\) will happen to be (1
) or
not (0
) given our observations of a set of predictors \(\boldsymbol{X}\). However, in Binary
Logistic Regression we do not predict the value of the outcome itself,
but rather the probability that the outcome will turn out
1
or 0
given the predictors.
In the simplest possible case, where there is only one predictor \(X_1\), this is exactly what we predict in Binary Logistic Regression:
\[P(Y) = p_1 = \frac{1}{1+e^{-(b_0 + b_1X_1)}}\] where \(b_0\) and \(b_1\) are the same old good linear model coefficients. As we will see, the linear coefficients have a new interpretation in Binary Logistic Regression - a one rather different that the one they receive in the scope of the Linear Model.
With \(k\) predictors we have:
\[P(Y) = p_1 = \frac{1}{1+e^{-(b_0 +
b_1X_1 + b_2X_2 + ... + b_kX_k)}}\] Now the above equations looks
like it felt from the clear blue sky to solve the problem. There is a
clear motivation for its form, of course: imagine that instead of
predicting the state of \(Y\) directly
we decide to predicts the odds of \(Y\) turning out 1
instead of
0
:
\[odds = \frac{p_1}{1-p_1}\] Now goes the trick: if instead of predicting the odds \(p_1/(1-p_1)\) we decide to predict the log-odds (also called: logit) from a linear combination of predictors
\[log \left( \frac{p_1}{1-p_1} \right) = b_0 + b_1X_1 + b_2X_2 + ... + b_kX_k\] it turns out that we can recover the odds by taking the exponent of both LHS and RHS:
\[\frac{p_1}{1-p_1} = e^{(b_0 + b_1X_1 +
b_2X_2 + ... + b_kX_k)}\] and then by simple algebraic
rearrangement we find that the probability \(p_1\) of the outcome \(Y\) turning out 1
is:
\[P(Y) = p_1 = \frac{1}{1+e^{-(b_0 + b_1X_1 + b_2X_2 + ... + b_kX_k)}}\]
Now, imagine we set a following criterion: anytime we estimate \(p_1\) to be larger than or equal to \(.5\) we predict that \(Y=1\), and anytime \(p_1 < .5\) we predict that \(Y=0\). What we need to do in order to be able to learn how to predict \(Y\) in this way is to estimate the coefficients \(b_0\), \(b_1\), \(b_2\), etc like we did in the case of a linear model. The estimation for GLMs is a bit different than we have learned in Session 15. But first let’s see how to perform Binary Logistic Regression in R.
We will use the dataset from the UCLA’s Institute of Digital Research and Education’s website on Statistical Consulting (they also have a nice exposition of the Binary Logistic regression):
A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable. Source: UCLA’s Institute of Digital Research and Education
dataSet <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
head(dataSet)
Inspect the dataset:
dim(dataSet)
[1] 400 4
Let’s see what is in for us:
str(dataSet)
'data.frame': 400 obs. of 4 variables:
$ admit: int 0 1 1 1 0 1 1 0 1 0 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
$ rank : int 3 3 1 4 4 2 1 2 3 2 ...
# - descriptive statistics
summary(dataSet)
admit gre gpa rank
Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
Median :0.0000 Median :580.0 Median :3.395 Median :2.000
Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
sapply(dataSet, sd)
admit gre gpa rank
0.4660867 115.5165364 0.3805668 0.9444602
So we need a model that predicts admit
- a dichotomy -
from the gre
and gpa
scores and the ranking of
the educational institution found in rank
. No wonder that
the model can be written as admit ~ gre + gpa + rank
in
R:
# - rank to factor
# - Q: Why does rank go to factor?
# - A: Dummy coding... Remember?
dataSet$rank <- factor(dataSet$rank)
Here goes the glm()
function:
# - model:
mylogit <- glm(admit ~ gre + gpa + rank,
data = dataSet,
family = "binomial")
modelsummary <- summary(mylogit)
print(modelsummary)
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = dataSet)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6268 -0.8662 -0.6388 1.1490 2.0790
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
Number of Fisher Scoring iterations: 4
A word on interpretation of the results:
lm()
these are the model
coefficients. The z value
stands for the Wald’s
test of whether the coefficient is significantly different from
zero (remember that we have used the t-test in Linear
Regression to test exactly the same hypothesis); the test is obtained by
dividing the coefficient by its standard error.N.B. There is a bug in the Wald’s Z, look:
The reason why the Wald statistic should be used cautiously is because, when the regression coefficient is large, the standard error tends to become inflated, resulting in the Wald statistic being underestimated (see Menard, 1995). The inflation of the standard error increases the probability of rejecting a predictor as being significant when in reality it is making a significant contribution to the model (i.e. you are more likely to make a Type II error). From: Andy Field, DISCOVERING STATISTICS USING SPSS, Third Edition, Sage.
1
by just
looking at the distribution of the outcome and pretend that there are no
predictors at all: that would be the baseline model for Binary
Logistic Regression. The null deviance describes the error of
the baseline while the residual deviance describe the error
from the current model. We will learn how to use them to assess the
overall effect of the model.In order to understand precisely how a Binary Logistic Regression model is assessed - and more importantly why it is assessed in the way it is assessed - we first need to introduce the concept of Likelihood in Probability Theory. Now this is one powerful idea that we need to discuss!
Imagine we toss a fair coin with \(p_H = .5\) twice and observe two Heads.The probability of observing two heads with \(p_H = .5\) is:
\[P(HH|p_H = .5) = .5 * .5 = .5^{2} = .25\] Now take a look at the following function: \(P(HH|p_H)\). Imagine that the data, the results of our observations - that we have seen two heads in a row - are fixed. Than \(P(HH|p_H)\) is a function of the parameter \(p_H\). Imagine that we start changing the value of \(p_H\) while keeping the data fixed and with every change in parameter we compute the value of \(P(HH|p_H)\) again and again. For example, what is the value of \(P(HH|p_H)\) if \(p_H = .3\)?
\[P(HH|p_H = .3) = .3 * .3 = .3^{2} = .09\] And what if \(p_H = .9\)?
\[P(HH|p_H = .9) = .9 * .9 = .9^{2} =
.81\] We have observed two heads; in the universe of our small
statistical experiment we have actually observed all heads,
right? So, as we increase the value of \(p_H\), the value of \(P(HH|p_H)\) tends to increase: it was
.09
when \(p_H = .3\),
then .25
for \(p_H = .5\),
and finally .81
for \(p_H =
.9\). Even if we already know that the coin is fair - hence \(p_H = .5\) - the observed data inform
us that it is more likely to be higher.
\(P(HH|p_H)\), also written as \(L(p_H|HH)\), reads: the likelihood of the parameter value \(p_H\) given the data \(HH\). We can plot the whole Likelihood function for this experiment easily:
likelihood <- data.frame(parameter = seq(.01, .99, by = .01))
likelihood$likelihood <- likelihood$parameter^2
ggplot(likelihood,
aes(x = parameter,
y = likelihood)) +
geom_smooth(size = .25, se = F) +
ggtitle("Likelihood function for HH") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
What if we have observed \(HHTTH\) in five tosses?
likelihood <- data.frame(parameter = seq(.01, .99, by = .01))
likelihood$likelihood <-
likelihood$parameter^2 * (1-likelihood$parameter)^2 * likelihood$parameter
ggplot(likelihood,
aes(x = parameter,
y = likelihood)) +
geom_smooth(size = .25, se = F) +
ggtitle("Likelihood function for HHTTH") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
How did we get to it? Look, if the data are \(HHTTH\), then the Likelihood function for \(p_H = .2\) must be:
\[P(HHTTH|p_H = .2) = .2 * .2 * (1-.2) * (1-.2) * .2 = .2^{2} * (1-.2)^{2} * .2 = .00512\] Let’s check in R:
.2^2*(1-.2)^2*.2
[1] 0.00512
And now we just need to compute the Likelihood function across the whole domain of \(p_H\). As simple as that!
Say we have observed the following data on higher education admissions: \(HHTHTTHHHT\). Assume that we know the parameter \(p_H\). We can compute the Likelihood function from the following equation:
\(L(p_H|HHTHTTHHHT)\) exactly as we did before. Now, this is the general form of the Binomial Likelihood (where \(Y\) stands for the observed data):
\[L(p|Y) = p_1^y(1-p_1)^{n-y}\] where \(y\) is the number of successes and \(n\) the total number of observations. For each observed data point then we have
\[L(p|y_i) =
p_1^{y_i}(1-p_1)^{\bar{y_i}}\] where \({y_i}\) is the observed value of the
outcome, \(Y\), and \(\bar{y_i}\) is its complement
(e.g. 1
for 0
and 0
for
1
). This form just determines which value will be used in
the computation of the Likelihood function at each observed data point:
it will be either \(p_1\) or \(1-p_1\). The likelihood function for a
given value of \(p_1\) for the whole
dataset is computed by multiplying the values of \(L(p|y_i)\) across the whole dataset
(remember that multiplication in Probability is what conjunction is in
Logic and Algebra).
Q: But… how do we get to \(p_1\), the parameter value that we will use at each data point? A: We will search the parameter space, of course, \(\beta_0, \beta_1, ... \beta_k\) of linear coefficients in our Binary Logistic Model, computing \(p_1\) every time, and compute the likelihood function from it! In other words: we will search the parameter space to find the combination of \(\beta_0, \beta_1, ... \beta_k\) that produces the maximum of the likelihood function similarly as we have searched the space of linear coefficients to find the combination that minimizes the squared error in Simple Linear Regression.
So what combination of the linear coefficients is the best one?
It is the one which gives the maximum likelihood. This approach, known as Maximum Likelihood Estimation (MLE), stands behind many important statistical learning models. It presents the corner stone of the Statistical Estimation Theory. It is contrasted with the Least Squares Estimation that we have used to estimate the Simple Linear Regression model in Session 15.
Now, there is a technical problem related to this approach. To obtain the likelihood for the whole dataset one needs to multiply as many very small numbers as there are data points. That can cause computational problems related to the smallest real numbers that can be represented by digital computers. The workaround is to use the logarithm of likelihood instead, known as log-likelihood (\(LL\)).
Thus, while the Likelihood function for the whole dataset would be
\[L(p|Y) = \prod_{i=1}^{n}p_1^{y_i}(1-p_1)^{\bar{y_i}}\] the Log-Likelihood function would be:
\[LL(p|Y) = \sum_{i=1}^{n} y_ilog(p_1)+\bar{y_i}log(1-p_1)\] And finally here is how we solve the Binomial Logistic Regression problem:
Technically, in optimization we would not go exactly for the maximum of the Likelihood function, because we use \(LL\) instead of \(L(p|Y)\). The solution is to minimize the negative \(LL\), sometimes written simply as \(NLL\), the Negative Log-Likelihood function.
We will begin with the model coefficients. Coefficients in Binary Logistic Regression tell about the change in the log-odds of the outcome for a one unit increase in the predictor variable:
modelcoefs <- modelsummary$coefficients
print(modelcoefs)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979073 1.139950936 -3.500132 0.0004650273
gre 0.002264426 0.001093998 2.069864 0.0384651284
gpa 0.804037549 0.331819298 2.423119 0.0153878974
rank2 -0.675442928 0.316489661 -2.134171 0.0328288188
rank3 -1.340203916 0.345306418 -3.881202 0.0001039415
rank4 -1.551463677 0.417831633 -3.713131 0.0002047107
To extract the coefficients only:
# - coefficients only:
mcoefs <- coef(mylogit)
mcoefs
(Intercept) gre gpa rank2 rank3 rank4
-3.989979073 0.002264426 0.804037549 -0.675442928 -1.340203916 -1.551463677
Now, the log-odds scale is odd in itself… But if take the
exp()
of the coefficients then their interpretation
becomes: how much do the odds of being 1
instead of
being 0
increase with a unit increase in the
predictor.
# - You can also exponentiate the coefficients and
# - interpret them as factors of odds-ratios:
exp(coef(mylogit))
(Intercept) gre gpa rank2 rank3 rank4
0.0185001 1.0022670 2.2345448 0.5089310 0.2617923 0.2119375
The obtain the confidence intervals:
# - Confidence Intervals on model coefficients:
confint(mylogit, level = .99)
0.5 % 99.5 %
(Intercept) -7.0108205916 -1.114337828
gre -0.0005261955 0.005132102
gpa -0.0401196327 1.676398025
rank2 -1.5005096781 0.137948894
rank3 -2.2490901281 -0.461746023
rank4 -2.6816859220 -0.509488785
Of course, exp()
to change to the odds scale:
# - Confidence Intervals on model coefficients:
exp(confint(mylogit, level = .99))
0.5 % 99.5 %
(Intercept) 0.0009020681 0.3281325
gre 0.9994739429 1.0051453
gpa 0.9606745042 5.3462641
rank2 0.2230164646 1.1479169
rank3 0.1054951680 0.6301824
rank4 0.0684476594 0.6008026
The model log-likelihood at the optimal parameter values is obtained
from logLik()
:
# - The model likelihood is:
logLik(mylogit)
'log Lik.' -229.2587 (df=6)
# - Note: models w. lower logLike are better
Let’s now talk about AIC, the Akaike Information Criterion.
# - Akaike Information Criterion
# - see: https://en.wikipedia.org/wiki/Akaike_information_criterion
mylogit$aic
[1] 470.5175
# - Note: models w. lower AIC are better
The AIC is computed in the following way:
\[AIC = 2k - 2LL\] where \(k\) is the number of parameters estimated:
-2*as.numeric(logLik(mylogit)) + 2*6
[1] 470.5175
There is also the AIC()
function:
AIC(mylogit)
[1] 470.5175
Finally, let’s explain what the deviance residual is. The (squared)
deviance of each data point is equal to -2
times the
logarithm of the difference between its predicted probability and the
complement of its actual value (1
for a
0
and a 0
for a 1
). And why would
anyone construct such a measure of model error?
# - model deviance:
print(mylogit$deviance)
[1] 458.5175
Interesting enough, \(-2LL\) equals the total model deviance…
ll <- logLik(mylogit)
-2*as.numeric(ll) == mylogit$deviance
[1] TRUE
Check again… The total deviance first:
deviances <- residuals(mylogit,
type = "deviance")
sum(deviances^2) == mylogit$deviance
[1] TRUE
So, the deviances decompose the total model likelihood similarly as residuals in Linear Regression decompose the total model error…
sum(deviances^2) == -2*as.numeric(ll)
[1] TRUE
Finally, does the model in itself have any effect? Did we gain anything from introducing the predictors? The following difference between the residual and null deviance follows a \(\chi^2\)-distribution…
# - Comparison to a so-called Null model (intercept only)
# - The following is Chi-Square distributed...
dev <- mylogit$null.deviance - mylogit$deviance
print(dev)
[1] 41.45903
… with the dfs
number of degrees of freedom:
dfs <- mylogit$df.null - mylogit$df.residual
print(dfs)
[1] 5
And the, as ever: how extreme the probability of observing this is to check for the Type I Error:
pchisq(dev, dfs, lower.tail = FALSE)
[1] 7.578194e-08
Goran S. Milovanović
DataKolektiv, 2020/21
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.