Feedback should be send to goran.milovanovic@datakolektiv.com
. These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.
We introduce the concepts of partial and part correlation, of essential importance in multiple linear regression, and take a look at what R has to offer in the {ppcor} package in that respect. Then we go back to the simple linear regression model and discuss its assumption. All statistical models make some assumptions about the data, and only if their assumptions are satisfied then the application of the model to the data is justified. We proceed by introducing the concept of bias of a statistical estimate and introduce a way to estimate it via numerical simulations: the parametric bootstrap.
Install:
install.packages('car')
install.packages('ppcor')
Setup:
library(tidyverse)
library(data.table)
library(Hmisc)
library(ppcor)
library(car)
dataDir <- paste0(getwd(), "/_data/")
The concepts of partial and part correlation are useful in the description of mediation. We have two RVs, \(X\) and \(Y\), and we are interested in the strength of their linear relationship. However, there is also another variable (or, a set of variables), \(Z\), that is related to \(X\) and \(Y\), and we ask: how does this additional \(Z\) variable affects the relationship between \(X\) and \(Y\)?
Partial correlation presents the most straightforward answer to this question. It is the coefficient of linear correlation that one obtains between \(X\) and \(Y\) after removing the shared variance of \(X\) and \(Z\), and of \(Y\) and \(Z\).
We will use the {ppcor} package to compute partial correlations in the following example. Before that: can we explain partial correlation conceptually? It turns out that is not difficult to explain what partial correlation is once the simple linear regression model is introduced.
linFit <- lm(data = iris,
Petal.Length ~ Sepal.Length)
linFitPlot <- data.frame(
x = iris$Sepal.Length,
y = iris$Petal.Length,
predicted = linFit$fitted.values,
residuals = linFit$residuals
)
ggplot(data = linFitPlot,
aes(x = x, y = y)) +
geom_smooth(method = lm, se = F, color = "red", size = .25) +
geom_segment(aes(x = x,
y = predicted,
xend = x,
yend = predicted + residuals),
color = "black", size = .2, linetype = "dashed") +
geom_point(aes(x = x, y = y), color = "black", size = 2) +
geom_point(aes(x = x, y = y), color = "white", size = 1.5) +
geom_point(aes(x = x, y = predicted), color = "red", size = 1) +
xlab("Sepal.Length") + ylab("Petal.Length") +
theme_bw() +
theme(panel.border = element_blank())
We have Sepal.Length
on the x-axis, and Petal.Length
on the y-axis, producing a scatter plot of a sort that we have already seen many times before. If the relationship between the two variables was perfectly linear, all points would fall on the regression line. In this plot, the red dots represent the predictions from the best-fitting line in a plane where the x and y axes are spawned by the variables of interest. However, the relationship is not perfectly linear, as it can be observed from the vertical deviations of the data points from the line. The distance between what a linear model would predict (red points) and the actual data (white points) is called a residual. Residuals are represented by vertical lines connecting the model predictions and actual data points in this plot. The best-fitting line is exactly the one that minimizes these residuals that are considered as model errors.
Take \(X\) to be Sepal.Length
, \(Y\) to be Petal.Length
, and \(Z\) to be Sepal.Width
: how does the correlation between \(Z\) and \(X\), on one, and \(Z\) and \(Y\), on the other hand, affects the correlation between \(X\) and \(Y\)? Let’s plot \(Z\) vs. \(X\) and \(Z\) vs. \(Y\):
linFit <- lm(data = iris,
Sepal.Length ~ Sepal.Width)
linFitPlot1 <- data.frame(
x = iris$Sepal.Width,
y = iris$Sepal.Length,
predicted = linFit$fitted.values,
residuals = linFit$residuals
)
linFit <- lm(data = iris,
Petal.Length ~ Sepal.Width)
linFitPlot2 <- data.frame(
x = iris$Sepal.Width,
y = iris$Petal.Length,
predicted = linFit$fitted.values,
residuals = linFit$residuals
)
linFitPlot <- rbind(linFitPlot1, linFitPlot2)
linFitPlot$Plot <- factor(c(rep("Sepal.Length",150), rep("Petal.Length",150)),
levels = c("Sepal.Length", "Petal.Length"))
ggplot(data = linFitPlot,
aes(x = x, y = y)) +
geom_smooth(method = lm, se = F, color = "red", size = .25) +
geom_segment(aes(x = x,
y = predicted,
xend = x,
yend = predicted + residuals),
color = "black", size = .2, linetype = "dashed") +
geom_point(aes(x = x, y = y), color = "black", size = 2) +
geom_point(aes(x = x, y = y), color = "white", size = 1.5) +
geom_point(aes(x = x, y = predicted), color = "red", size = 1) +
xlab("Sepal.Width") + ylab("") +
theme_classic() +
facet_grid(. ~ Plot) +
theme_bw() +
theme(panel.border = element_blank()) +
theme(strip.background = element_blank()) +
theme(axis.text.x = element_text(size = 6)) +
theme(axis.text.y = element_text(size = 6))
Sepal.Width
has some correlation with both Sepal.Length
and Petal.Length
; upon plotting the best fitting lines, we can observe some residuals on both plots too. Partial correlation of Sepal.Length
and Petal.Length
, while controlling for Sepal.Width
, is nothing else than the correlation between the residuals of Sepal.Length
and Petal.Length
following the linear regression of Sepal.Width
on both variables:
partialCor <- cor(linFitPlot$residuals[1:150], # Sepal.Length residuals
linFitPlot$residuals[151:300] # Petal.Length residuals
)
partialCor
[1] 0.9153894
In comparison to:
cor(iris$Sepal.Length, iris$Petal.Length)
[1] 0.8717538
we can conclude that the coefficient of linear correlation between these two variables increases after controlling for the effect of Sepal.Width
!
In {ppcor}, the same partial correlation would be computed in the following way:
# partial correlation w. {ppcor}
dataSet <- iris
dataSet$Species <- NULL
partialCor1 <- pcor.test(dataSet$Sepal.Length, dataSet$Petal.Length,
dataSet$Sepal.Width,
method = "pearson")
partialCor1$estimate
[1] 0.9153894
And of course:
partialCor1$p.value
[1] 5.847914e-60
partialCor1$statistic
[1] 27.56916
For the matrix of partial correlations, where the correlation of each pair of variables is computed after controlling for the effects of all the remaining variables, {ppcor} offers:
#### partial correlation in R
dataSet <- iris
dataSet$Species <- NULL
irisPCor <- pcor(dataSet, method = "pearson")
irisPCor$estimate # partial correlations
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 0.6285707 0.7190656 -0.3396174
Sepal.Width 0.6285707 1.0000000 -0.6152919 0.3526260
Petal.Length 0.7190656 -0.6152919 1.0000000 0.8707698
Petal.Width -0.3396174 0.3526260 0.8707698 1.0000000
irisPCor$p.value # results of significance tests
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.000000e+00 1.199846e-17 7.656980e-25 2.412876e-05
Sepal.Width 1.199846e-17 0.000000e+00 8.753029e-17 1.104958e-05
Petal.Length 7.656980e-25 8.753029e-17 0.000000e+00 7.332477e-47
Petal.Width 2.412876e-05 1.104958e-05 7.332477e-47 0.000000e+00
t-test on n-2-k degrees of freedom ; k = num. of variables conditioned:
irisPCor$statistic
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.000000 9.765380 12.502483 -4.362929
Sepal.Width 9.765380 0.000000 -9.431189 4.553279
Petal.Length 12.502483 -9.431189 0.000000 21.398708
Petal.Width -4.362929 4.553279 21.398708 0.000000
Good. And now, what a part - also known as semi-partial correlation would be? Take a look again at the previous plot, where Sepal.Width
predicts Sepal.Length
on the left, and Petal.Length
on the right panel; residuals from both linear regressions are present. Partial correlation of Sepal.Length
and Petal.Length
was obtained by removing the effect of Sepal.Width
from both variables, and, in effect, all that we had to do to obtain was to compute the correlation coefficient from the residuals - or, from what remains after removing what was predicted by Sepal.Width
from these two variables. A semi-partial, or part correlation would be obtained if we had removed the effect of Sepal.Width
from the second variable only: that would be Petal.Length
in this case. It results in a correlation between (a) Sepal.Length
and (b) what is left from Petal.Length
(the residuals) after controlling for the effect of Sepal.Width
:
partCor <- cor(iris$Sepal.Length, # Sepal.Length in itself
linFitPlot$residuals[151:300] # Petal.Length residuals
)
partCor
[1] 0.9090408
In {ppcor}, this part correlation is obtained by:
partCor <- spcor.test(dataSet$Sepal.Length, dataSet$Petal.Length,
dataSet$Sepal.Width,
method = "pearson")
partCor$estimate
[1] 0.9090408
As ever, the p-value:
partCor$p.value
[1] 9.407918e-58
and the t-test:
partCor$statistic
[1] 26.44911
If we’re interested in a matrix of semi-partial correlations, where the first variable - the one from which no effects of any other variables will be removed - is found rows, and the second variable - the one from which the effects of all the remaining variables in the data set will be removed - found in columns:
irisSPCor <- spcor(dataSet, method = "pearson")
irisSPCor$estimate
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.00000000 0.30389212 0.3890689 -0.1357714
Sepal.Width 0.55758743 1.00000000 -0.5385056 0.2599849
Petal.Length 0.18506103 -0.13959991 1.0000000 0.3167424
Petal.Width -0.09001634 0.09394365 0.4415000 1.0000000
irisSPCor$p.value
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.000000e+00 0.0001734384 1.023577e-06 9.989683e-02
Sepal.Width 1.823304e-13 0.0000000000 1.672234e-12 1.417359e-03
Petal.Length 2.433732e-02 0.0906073062 0.000000e+00 8.776581e-05
Petal.Width 2.765857e-01 0.2560833579 1.944768e-08 0.000000e+00
irisSPCor$statistic
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.000000 3.854222 5.103228 -1.655866
Sepal.Width 8.116139 0.000000 -7.722074 3.253281
Petal.Length 2.275404 -1.703473 0.000000 4.034967
Petal.Width -1.092105 1.140168 5.945498 0.000000
We will spend some time in inspecting the validity of this linear regression model as a whole. Usually termed model diagnostics, the following procedures are carried over to ensure that the model assumptions hold. Unfortunately, even for a model as simple as a simple linear regression, testing for model assumptions tends to get nasty all the way down… Most of the criteria cannot be judged by simply assessing the values of the respective statistics, and one should generally consider the model diagnostics step as a mini-study on its own - and the one going well beyond the time spent to reach the conclusions of the model performance on the data set, because none of one’s conclusions on the data set truly hold from a model whose assumptions are not met. Sadly, this is a fact that is overlooked too often in contemporary applied statistics.
In fact, we have already tested for this:
The linearity assumption is obviously not satisfied!
We have already played with this too, in a different way only:
What is rstandard()
? See: standardized residual. We will discuss this in the session as we introduce the concept of leverage.
Let’ see what does the Shapiro-Wilk tells:
shapiro.test(reg$residuals)
Shapiro-Wilk normality test
data: reg$residuals
W = 0.99437, p-value = 0.831
The model error (i.e. variance, computed from the model residuals) should be constant on all levels of the criterion:
# Predicted vs. residuals {ggplot2}
predReg <- predict(reg) # get predictions from reg
resReg <- residuals(reg) # get residuals from reg
plotFrame <- data.frame(predicted = predReg,
residual = resReg)
# plot w. {ggplot2}
ggplot(data = plotFrame,
aes(x = predicted, y = residual)) +
geom_point(size = 1.5,
colour = 'blue') +
geom_smooth(method = 'lm',
size = .25,
alpha = .25,
color = "red") +
ggtitle("Predicted vs Residual") +
xlab("Predicted") + ylab("Residual") +
theme_bw() +
theme(legend.position = "none") +
theme(panel.border = element_blank())
This can be confusing if one does not recall that we have fitted the model by having only one sample of observations at hand. Imagine drawing a sample of Sepal.Length
and Petal.Length
values repeatedly and trying to predict one from another from a previously fitted simple linear regression model. It is quite probable that we would get at observing varying residuals (model errors) for different draws of Petal.Length
observed on the same level Sepal.Length
upon prediction. However, the distribution of these residuals, more precisely: its variance, must be constant across all possible levels of Petal.Length
. That is why we choose to inspect the scatter plot of predicted Petal.Length
values vs. their respective residuals. Essentially, one should not be able to observe any regularity in this plot; if it turns out that any pattern emerges, i.e. that it is possible to predict the residuals from the levels of criterion - the simple linear model should abandoned.
Our simple linear regression model obviously suffers from heteroscedasticity, or a lack of constant variance across the measurements. The cause of the heteroscedasticity in this case is related to the existence of clusters of related observations, determined by the type of flower in the iris
data set.
There are a plenty of proposed procedures to detect influential cases in simple linear regression. The influence.mesures()
function will return most of the interesting statistics in this respect:
infMeas <- influence.measures(reg)
class(infMeas)
[1] "infl"
str(infMeas)
List of 3
$ infmat: num [1:150, 1:6] -0.09604 -0.07337 -0.04819 0.00823 -0.08682 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:150] "1" "2" "3" "4" ...
.. ..$ : chr [1:6] "dfb.1_" "dfb.Sp.L" "dffit" "cov.r" ...
$ is.inf: logi [1:150, 1:6] FALSE FALSE FALSE FALSE FALSE FALSE ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:150] "1" "2" "3" "4" ...
.. ..$ : chr [1:6] "dfb.1_" "dfb.Sp.L" "dffit" "cov.r" ...
$ call : language lm(formula = Petal.Length ~ Sepal.Length, data = iris)
- attr(*, "class")= chr "infl"
What you need to extract from infMeans
now is the $infmat
field:
# as data.frame
infReg <- as.data.frame(influence.measures(reg)$infmat)
head(infReg)
Sometimes, people focus on Cook’s distances: they are used to detect the influential cases by inspecting the effect of removal of each data point on the linear model. Cook and Weisberg (1982) consider values greater than 1 are troublesome:
wCook <- which(infReg$cook.d >1)
wCook # we seem to be fine here
integer(0)
The leverage tells us how influential a data point is by measuring how unusual or atypical is the combination of predictor values - in case of multiple linear regression - for that observation; in case of simple linear regression, try to think of it as simply a measure of how “lonely” a particular point is found on the predictor scale. For an introductory text I recommend: Influential Observations, from David M. Lane’s Online Statistics Education: An Interactive Multimedia Course of Study; for details, see: Leverage in simple linear regression.
# - Leverage: hat values
# - Average Leverage = (k+1)/n, k - num. of predictors, n - num. observations
# - Also termed: hat values, range: 0 - 1
# - see: https://en.wikipedia.org/wiki/Leverage_%28statistics%29
# - Various criteria (twice the average leverage, three times the average leverage...)
# - Say, twice the leverage:
k <- 1 # - number of predictors
n <- dim(iris)[1] # - number of observations
wLev <- which(infReg$hat > 2*((k + 1)/n))
print(wLev)
[1] 9 14 39 43 106 108 118 119 123 131 132 136
Finally, to inspect the influential cases visually, we can produce the Influence Plot, combining information on standardized residuals, leverage, and Cook’s distances:
## Influence plot
plotFrame <- data.frame(residual = resStReg,
leverage = infReg$hat,
cookD = infReg$cook.d)
ggplot(plotFrame,
aes(y = residual,
x = leverage)) +
geom_point(size = plotFrame$cookD*100, shape = 1, color = 'blue') +
ggtitle("Influence Plot\nSize of the circle corresponds to Cook's distance") +
theme(plot.title = element_text(size=8, face="bold")) +
ylab("Standardized Residual") + xlab("Leverage") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
The final twist is related to the assumption that the model errors are not autocorrelated. The autocorrelation of a variable exists when its previously measured values are correlated with its subsequent measurements. The autocorrelation can be computed for different values of the lag, defining how far apart are the “present” and “past” observations of a variable assumed to be. For example, given \(X = x_1, x_2, ..., x_n\), one can compute the autocorrelation at lag of 1 by correlating \(X_i\) with \(X_{i-1}\), or at lag of 2 by correlating \(X_i\) with \(X_{i-2}\), etc.
In the setting of simple linear regression, the autocorrelation test that is most frequently met is the Durbin-Watson statistic:
# - D-W Statistic < 1 --> problematic {car}
durbinWatsonTest(reg)
lag Autocorrelation D-W Statistic p-value
1 0.3869188 1.203885 0
Alternative hypothesis: rho != 0
The Durbin-Watson statistic will have a value of 2 if no autocorrelation at all is present in the model residuals. It tests the null hypothesis that the autocorrelation is zero, so that its statistical significance (\(p < .05\), conventionally) indicates that the autocorrelation in residuals is present.
Let’s estimate the linear regression model with Petal.Length
as a criterion and Sepal.Length
as a predictor again:
regModel <- lm(Petal.Length ~ Sepal.Length,
data = iris)
coefs <- coefficients(regModel)
print(coefs)
(Intercept) Sepal.Length
-7.101443 1.858433
Ok. Now, we need to learn about the variation in the model error. Let’s compute the standard deviation of the model residuals:
residuals <- residuals(regModel)
std_res <- sd(residuals)
print(std_res)
[1] 0.8648977
Enters the sim-fit loop: parametric bootstrap for the simple linear regression model. We will use 10,000
bootstrap samples. In each iteration we use the original values of the predictor - iris$Sepal.Length
- and the estimated model coefficients to compute the predictions: y_hat
values. Then we introduce randomness from the estimated standard deviation of the residuals - std_res
- by drawing one observation from `Normal(mean = y_hat, sd = std_res) and pairing that value with the respective value of the predictors. Then we estimate the linear regression model from the boostrapped sample and pick up the coefficients from it. The distribution of the boostraped model coefficients is quite telling as we will observe.
The population is to the sample as the sample is to the bootstrap samples. – John Fox, 2002, Bootstrapping Regression Models, Appendix to An R and S-PLUS Companion to Applied Regression.
n_bootstrap_samples <- 1:10000
bootstrap_estimates <- lapply(n_bootstrap_samples, function(x) {
# - bootstrap!
y_hats <- coefs[2] * iris$Sepal.Length + coefs[1]
y_hats <- sapply(y_hats, function(x) {
rnorm(1, x, std_res)
})
# - boostrapped data:
boot_frame <- data.frame(Sepal.Length = iris$Sepal.Length,
Petal.Length = y_hats)
# - estimate linear model
boot_regModel <- lm(Petal.Length ~ Sepal.Length,
data = boot_frame)
# - extract coefficients
boot_coefs <- coefficients(boot_regModel)
return(as.data.frame(t(boot_coefs)))
})
bootstrap_estimates <- rbindlist(bootstrap_estimates)
head(bootstrap_estimates)
ggplot(bootstrap_estimates,
aes(x = Sepal.Length)) +
geom_histogram(binwidth = .001,
fill = 'darkorange',
color = 'darkorange') +
ggtitle("Boostrap estimates of the slope") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
coefs[2]
Sepal.Length
1.858433
mean(bootstrap_estimates$Sepal.Length)
[1] 1.858593
The bias of the slope estimate is then:
coefs[2] - mean(bootstrap_estimates$Sepal.Length)
Sepal.Length
-0.0001605159
And the variance of the slope estimate is:
var(bootstrap_estimates$Sepal.Length)
[1] 0.007342718
Let’s see about the intercept then:
ggplot(bootstrap_estimates,
aes(x = `(Intercept)`)) +
geom_histogram(binwidth = .001,
fill = 'darkred',
color = 'darkred') +
ggtitle("Boostrap estimates of the intercept") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
coefs[1]
(Intercept)
-7.101443
mean(bootstrap_estimates$`(Intercept)`)
[1] -7.102706
var(bootstrap_estimates$`(Intercept)`)
[1] 0.255604
And the bias of the intercept estimate is then:
coefs[1] - mean(bootstrap_estimates$`(Intercept)`)
(Intercept)
0.00126272
Session 14 will introduce the Multiple Linear Regression Model. You can rely on David M. Lane’s online tutorial to refresh your math:
R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.
Goran S. Milovanović
DataKolektiv, 2020/21
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.