Feedback should be send to goran.milovanovic@datakolektiv.com
. These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.
We will complete our journey into the world of Generalized Linear Models (GLMs) by discussing the application of the Poisson regression for count data and its generalization known as Negative bionomial regression which can help save the day when Poisson regression fails. We will then introduce cross-validation: a very important and extremely useful model selection procedure. In this Session we will discuss two forms of cross-validation (CV): the Leave-Out-One-CV (LOOCV) and the K-fold CV.
Grab the fHH1.csv
dataset from the proback/BeyondMLR.
dataDir <- paste0(getwd(), "/_data/")
library(tidyverse)
library(data.table)
library(car)
library(ggrepel)
As in all GLMs in general, the problem to be solved is how to apply the Linear Model in a situation where our data fail to satisfy its assumptions. The Poisson regression deals with the situation where our observations - the outcome variable - are counts per some unit of time or space. Remember the Poisson distribution, introduced earlier in our sessions, that is used to model the frequency of occurrences in a given time interval (or spatial area as well)? Its probability density is given by:
\[f(k; \lambda) = P(X = k) = \frac{\lambda^ke^{-\lambda}}{k!}\] where \(k\) is the number of occurrences, \(k=0,1,2,...\), \(e\) is the Euler number \(e=2.7182...\), and \(\lambda\) is the only distribution parameter which represents the Poisson distribution mean and variance at the same time, \(\lambda = E(x) = Var(x)\).
To introduce the crucial model assumption precisely before we derive the model, Poisson regression assumes the response variable \(Y\) has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of (a) predictors and (b) coefficients (that need to be estimated, of course). More precisely, the Poisson regression models \(\lambda_i\), the average number of occurrences of a phenomenon, as a function of one or more predictors. For example, we could consider the average number of car accidents in some state S in year Y as a function of a specific set of regulations that were in force in that state in that year. Let’s take a closer look at the model:
\[log(\lambda_i) = \beta_0 + \beta_1x_i\]
or more generally
\[log(\lambda) = \beta_0 + \beta_1X\]
while assuming that where the observed values\(Y_i\) follow a Poisson distribution with \(\lambda = \lambda_i\) for a given \(x_i\). For each observation then we could have a different value of \(\lambda\) depending on a particular value of the predictor \(X\). The model is easily expanded to encompass more predictors:
\[log(\lambda) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n\]
The model assumptions are:
We will the Household Size in the Philippines case study from this excellent book on applied regression models in R: Beyond Multiple Linear Regression: Applied Generalized Linear Models and Multilevel Models in R, Paul Roback and Julie Legler.
fHH1 <- fread(paste0(getwd(), "/_data/fHH1.csv"))
str(fHH1)
Classes ‘data.table’ and 'data.frame': 1500 obs. of 6 variables:
$ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ location: chr "CentralLuzon" "MetroManila" "DavaoRegion" "Visayas" ...
$ age : int 65 75 54 49 74 59 54 41 50 59 ...
$ total : int 0 3 4 3 3 6 5 5 6 4 ...
$ numLT5 : int 0 0 0 0 0 0 0 0 0 0 ...
$ roof : chr "Predominantly Strong Material" "Predominantly Strong Material" "Predominantly Strong Material" "Predominantly Strong Material" ...
- attr(*, ".internal.selfref")=<externalptr>
What we would like to do is to be able to predict the total
variable, which represents the number of people living in a household (other than the head of the household), from the following covariates:
location
: where the house is located (regions in the Philippines, whose The Philippine Statistics Authority (PSA) spearheads from the Family Income and Expenditure Survey (FIES) are the source of this dataset);age
: the age of the head of household;numLT5
: the number of people in the household under 5 years of age;roof
: the type of roof in the household (either Predominantly Light/Salvaged Material, or Predominantly Strong Material: stronger material can sometimes be used as a proxy for greater wealth).Let’s take a look at the probability distribution of the outcome variable:
ggplot(data = fHH1,
aes(x = total)) +
geom_bar(stat = 'count',
colour = "black",
fill = "aliceblue",
alpha = .75,
) +
theme_bw() +
theme(panel.border = element_blank())
We have two integer and two categorical predictors. Let’s have a look at them, categorical predictors first:
ggplot(data = fHH1,
aes(x = location)) +
geom_bar(stat = 'count',
colour = "black",
fill = "indianred",
alpha = .75,
) +
theme_bw() +
theme(panel.border = element_blank())
as.data.frame(table(fHH1$roof))
Now the numbers:
ggplot(data = fHH1,
aes(x = age)) +
geom_bar(stat = 'count',
colour = "black",
fill = "green",
alpha = .75,
) +
theme_bw() +
theme(panel.border = element_blank())
ggplot(data = fHH1,
aes(x = numLT5)) +
geom_bar(stat = 'count',
colour = "black",
fill = "green",
alpha = .75,
) +
theme_bw() +
theme(panel.border = element_blank())
We need to test the assumption that the outcome variable follows a Poisson distribution. While it is almost always very difficult to say what distribution does a variable follows in empirical problems, we can test the relationship between the outcome mean and variance: they should be equal, right?
meanVar <- fHH1 %>%
dplyr::select(age, total) %>%
dplyr::mutate(ints = cut(age, breaks = 15)) %>%
dplyr::group_by(ints) %>%
dplyr::summarise(mean = mean(total),
var = var(total),
n = n())
ggplot(meanVar,
aes(x = mean, y = var)) +
geom_smooth(method = "lm", size = .25) +
geom_point(size = 2) +
geom_point(size = 1.5, color = "white") +
theme_bw() +
theme(panel.border = element_blank())
What I really did is to cut()
- a handy {dplyr} function indeed! - the age
variable in 15 intervals and plot mean against the variance of the values of the outcome in the age intervals that I have obtained.
Another trick that we can pool is to perform a non-parametric bootstrap and see if the obtained means and variances are correlated:
outcome <- data.frame(total = fHH1$total)
meanVar <- lapply(1:1000, function(x) {
s <- sample_n(outcome,
size = dim(outcome)[1],
replace = TRUE)
return(
data.frame(mean = mean(s$total),
var = var(s$total))
)
})
meanVar <- rbindlist(meanVar)
ggplot(meanVar,
aes(x = mean, y = var)) +
geom_smooth(method = "lm", size = .25) +
geom_point(size = 2) +
geom_point(size = 1.5, color = "white") +
theme_bw() +
theme(panel.border = element_blank())
What this exercise shows is that the mean and the variance of total
seem to be correlated to a degree, but carefully observe how variances exceed the respective bootstrap sample means.
Now, the Poisson regression model:
poiss_model = glm(total ~ age,
family = "poisson",
data = fHH1)
summary(poiss_model)
Call:
glm(formula = total ~ age, family = "poisson", data = fHH1)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9079 -0.9637 -0.2155 0.6092 4.9561
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.5499422 0.0502754 30.829 < 2e-16 ***
age -0.0047059 0.0009363 -5.026 5.01e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2362.5 on 1499 degrees of freedom
Residual deviance: 2337.1 on 1498 degrees of freedom
AIC: 6714
Number of Fisher Scoring iterations: 5
The obtain the model coefficients (with Exponentiation, of course - remember the \(log(\lambda_i)\) transform of the outcome in the model):
exp(coef(poiss_model))
(Intercept) age
4.7111980 0.9953052
The model log-likelihood and the Akaike Information Criterion (AIC) are:
logLik(poiss_model)
'log Lik.' -3354.984 (df=2)
poiss_model$aic
[1] 6713.968
And the confidence intervals for model coefficients are again obtained from confint()
:
confint(poiss_model)
2.5 % 97.5 %
(Intercept) 1.451170100 1.648249185
age -0.006543163 -0.002872717
Now the full model:
poiss_model_full = glm(total ~ location + age + numLT5 + roof,
family = "poisson",
data = fHH1)
summary(poiss_model_full)
Call:
glm(formula = total ~ location + age + numLT5 + roof, family = "poisson",
data = fHH1)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5315 -0.7207 -0.0900 0.6103 3.8415
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0613397 0.0750613 14.140 <2e-16 ***
locationDavaoRegion -0.0459257 0.0538820 -0.852 0.394
locationIlocosRegion 0.0090312 0.0527603 0.171 0.864
locationMetroManila 0.0381394 0.0472689 0.807 0.420
locationVisayas 0.0486360 0.0422648 1.151 0.250
age -0.0002257 0.0009587 -0.235 0.814
numLT5 0.3643417 0.0165647 21.995 <2e-16 ***
roofPredominantly Strong Material 0.0597142 0.0437171 1.366 0.172
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2362.5 on 1499 degrees of freedom
Residual deviance: 1897.9 on 1492 degrees of freedom
AIC: 6286.8
Number of Fisher Scoring iterations: 5
And let’s take a look at the full model coefficients:
exp(coef(poiss_model_full))
(Intercept) locationDavaoRegion locationIlocosRegion
2.8902405 0.9551130 1.0090721
locationMetroManila locationVisayas age
1.0388760 1.0498382 0.9997744
numLT5 roofPredominantly Strong Material
1.4395661 1.0615331
Compare the AICs of the model encompassing age
only and the full model:
poiss_model_full$aic
[1] 6286.755
poiss_model$aic
[1] 6713.968
And we can see that the full model performs somewhat better, as (maybe) expected.
Remember how we have discovered that variances in total
- the outcome for the model in the fHH1
dataset - seems to be larger than its variance?
Overdispersion describes the observation that variation is higher than would be expected. Some distributions do not have a parameter to fit variability of the observation. For example, the normal distribution does that through the parameter \(\sigma\) (i.e. the standard deviation of the model), which is constant in a typical regression. In contrast, the Poisson distribution has no such parameter, and in fact the variance increases with the mean (i.e. the variance and the mean have the same value). In this latter case, for an expected value of \(E(y)= 5\), we also expect that the variance of observed data points is \(5\). But what if it is not? What if the observed variance is much higher, i.e. if the data are overdispersed? Source: Introduction: what is overdispersion? From: Advice for Problems in Environmental Statistics (APES) of the Department of Biometry and Environmental System Analysis at the University of Freiburg, and the Professorship for Theoretical Ecology at the University of Regensburg
One way to test whether overdispersion is present is to compute the dispersion parameter. It is obtained by dividing the model deviance with the respective number of degrees of freedom (which is \(n-p\) in this case, \(n\) being the number of observations and \(p\) the number of parameters). If no overdispersion is present the value of the dispersion parameter should be close to 1. Let’s see:
deviance <- poiss_model_full$deviance
n <- dim(fHH1)[1]
p <- length(poiss_model_full$coefficients)
df = n - p
dispersionParameter <- deviance/df
print(dispersionParameter)
[1] 1.272035
Ok, and what do we do then? We need a model less constrained than the Poisson regression!
For example, Poisson regression analysis is commonly used to model count data. If overdispersion is a feature, an alternative model with additional free parameters may provide a better fit. In the case of count data, a Poisson mixture model like the negative binomial distribution can be proposed instead, in which the mean of the Poisson distribution can itself be thought of as a random variable drawn – in this case – from the gamma distribution thereby introducing an additional free parameter (note the resulting negative binomial distribution is completely characterized by two parameters). Source: Overdispersion, From Wikipedia, the free encyclopedia
In **Negative Binomial Regression* we generate a value of \(\lambda\) for each observation from a Gamma distribution and then generate a count using a Poisson distribution with the generated value of \(\lambda\). Mathematically, with this process we arrive at the Gamma-Poisson mixture which is described by a Negative Binomial Distribution which has two parameters (one more than the simpler Poisson). Because we introduce that one more parameter the situation is way more flexible and the counts can be more dispersed than it would be expected for observations based on a Poisson with rate \(\lambda\) alone.
library(MASS)
nb_model <- glm.nb(total ~ location + age + numLT5 + roof,
data = fHH1)
summary(nb_model)
Call:
glm.nb(formula = total ~ location + age + numLT5 + roof, data = fHH1,
init.theta = 24.74987389, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4636 -0.6913 -0.0850 0.5678 3.3832
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.080240 0.080844 13.362 <2e-16 ***
locationDavaoRegion -0.049495 0.057712 -0.858 0.391
locationIlocosRegion 0.008724 0.056592 0.154 0.877
locationMetroManila 0.039579 0.050658 0.781 0.435
locationVisayas 0.046035 0.045308 1.016 0.310
age -0.000590 0.001033 -0.571 0.568
numLT5 0.367798 0.018586 19.789 <2e-16 ***
roofPredominantly Strong Material 0.059352 0.047098 1.260 0.208
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(24.7499) family taken to be 1)
Null deviance: 2076.9 on 1499 degrees of freedom
Residual deviance: 1678.0 on 1492 degrees of freedom
AIC: 6273.4
Number of Fisher Scoring iterations: 1
Theta: 24.75
Std. Err.: 7.04
2 x log-likelihood: -6255.41
N.B. Lower value of \(\theta\) implies larger overdispersion.
When we decide to derive a model of some phenomenon in mathematical statistics what we typically do is to draw a sample of relevant observations and predictors and the train the model on the sampled data. As we have seen in the course of our previous sessions, the question of generalization is of crucial importance. We are not looking for models with significant coefficients and high goodnes-of-fit (GoF) measures (such as \(R^2\), for example) or low badness-of-fit measures (such as \(AIC\)), but for models that will guarantee to generalize beyond the data that we have used to train them. Thus, we are very much interested in the standard errors of the model coefficients: they let us know how widely would they vary if we were to sample the data over and over again from the respective population.
There is one serious problem to face here. Since we have one sample of data at our disposal, we can easily imagine how all following samples would vary in respect to it. Since all samples come from the same population we can safely assume that some order, some structure, some invariant information will be present in all of them. However, we also know that each sample will be idiosyncratic up to some level. And we manage to fit a great model to the sample of data only we will be probably start to explain that idiosyncratic aspect of the sample too: a part of information contained in the data that is a consequence of the sampling procedure and not in any way a general pattern present in the population and thus probably not present in each or a majority of all other samples that we could have drawn from it. This problem is called overfitting and needs to be handled very carefully in Data Science.
Cross-validation (CV) is one method to deal with the possibility of overfitting. In all forms of CV the idea is basically the same: train (i.e. estimate) the model on a subset (or subsets) of data and test the model on the remaining “unseen” part (or parts) of it. We will introduce two basic CV approaches here: the Leave-Out-One-CV and the K-fold CV.
I will use iris
to demonstrate the LOOCV procedure. We want to compare two models of Sepal.Width
:
Sepal.Length
and Petal.Length
only, and thenSepal.Length
, Petal.Length
, and Petal.Width
.The LOOCV procedure is used when the dataset is really small - which iris
definitely is.
Here is the LOOCV approach:
data point - prediction)^2
);The selected model would be the one with the lower mean prediction error, of course. R:
head(iris)
Compute the MSPE for the simpler model first:
dataSet <- dplyr::select(iris,
Sepal.Length, Petal.Length, Sepal.Width)
mspe1 <- sapply(1:dim(dataSet)[1], function(x) {
fitData <- dataSet[-x, ]
leftOut <- dataSet[x, ]
model <- lm(Sepal.Width ~ Sepal.Length + Petal.Length,
data = fitData)
prediction <- predict(model,
newdata = leftOut[, c('Sepal.Length', 'Petal.Length')])
spe <- (leftOut$Sepal.Width - prediction)^2
})
print(mean(mspe1))
[1] 0.1069126
And now compute the MSPE for the model encompassing the additional predictor:
dataSet <- dplyr::select(iris,
Sepal.Length, Petal.Length, Petal.Width, Sepal.Width)
mspe2 <- sapply(1:dim(dataSet)[1], function(x) {
fitData <- dataSet[-x, ]
leftOut <- dataSet[x, ]
model <- lm(Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width,
data = fitData)
prediction <- predict(model,
newdata = leftOut[, c('Sepal.Length', 'Petal.Length', 'Petal.Width')])
spe <- (leftOut$Sepal.Width - prediction)^2
})
print(mean(mspe2))
[1] 0.09509893
In K-fold CV we randomly place all available observations in K folds (i.e. groups). If we decide to go for three groups, for example, we proceed in the following way:
For the simpler Sepal.Width ~ Sepal.Length + Petal.Length
model first:
iris$Species <- NULL
iris$fold <- sample(1:3, 150, replace = TRUE)
table(iris$fold)
1 2 3
59 39 52
N.B. The folds should be of approximately the same size at least. I could have opted for 50/50/50 as well. However:
dataSet <- dplyr::select(iris,
Sepal.Length, Petal.Length, Sepal.Width, fold)
mspe1 <- sapply(1:3, function(x) {
fitData <- dataSet[dataSet$fold != x, ]
leftOut <- dataSet[dataSet$fold == x, ]
model <- lm(Sepal.Width ~ Sepal.Length + Petal.Length,
data = fitData)
prediction <- predict(model,
newdata = leftOut[, c('Sepal.Length', 'Petal.Length')])
spe <- mean((leftOut$Sepal.Width - prediction)^2)
})
print(mspe1)
[1] 0.09788912 0.09628600 0.12544086
And the mean of MSPEs for the simpler model is:
print(mean(mspe1))
[1] 0.1065387
Now for the model encompassing Petal.Width
too:
dataSet <- dplyr::select(iris,
Sepal.Length, Petal.Length, Petal.Width, Sepal.Width, fold)
mspe2 <- sapply(1:3, function(x) {
fitData <- dataSet[dataSet$fold != x, ]
leftOut <- dataSet[dataSet$fold == x, ]
model <- lm(Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width,
data = fitData)
prediction <- predict(model,
newdata = leftOut[, c('Sepal.Length', 'Petal.Length', 'Petal.Width')])
spe <- mean((leftOut$Sepal.Width - prediction)^2)
})
print(mspe2)
[1] 0.08141191 0.09419004 0.10977537
And the mean MSPE from three folds is:
print(mean(mspe2))
[1] 0.09512577
N.B. Cross-validation procedures do not have to be based on MSPE. Other criteria like AIC or log-likelihood are used as well. For classification problems, the ROC analysis for each prediction can be used to select the best available model.
Goran S. Milovanović
DataKolektiv, 2020/21
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.