1. Simulate correlated variables from a Multivariate Normal Distribution
Random draws from the Multivariate Normal Distribution in R can be performed with MASS:mvrnorm()
. In the following chunk, meansVector
defines a vector of the two variables’ means, and covMat
is their variance-covariance matrix:
meansVector <- c(15, 25)
print(meansVector)
[1] 15 25
covMat <- matrix(
c(0.6856935,
1.274315,
1.2743154,
3.116278),
nrow = 2
)
print(covMat)
[,1] [,2]
[1,] 0.6856935 1.274315
[2,] 1.2743150 3.116278
The covariance between the two variables is set to be 1.274315
. Let’s take a sample of size 150
from the Multivariate Normal Distribution defined from meansVector
and covMat
:
mvn_sample <- as.data.frame(mvrnorm(n = 150,
mu = meansVector,
Sigma = covMat
)
)
head(mvn_sample)
What is the observed correlation?
cor(mvn_sample$V1,
mvn_sample$V2)
[1] 0.873424
NOTE. We understand that meansVector
and covMat
are population parameters.
Linear regression: predict mvn_sample$V2
from mvn_sample$V1
:
reg_model <- lm(V2 ~ V1,
data = mvn_sample)
summary(reg_model)
Call:
lm(formula = V2 ~ V1, data = mvn_sample)
Residuals:
Min 1Q Median 3Q Max
-2.0706 -0.6106 -0.0777 0.6661 2.6332
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.1800 1.3014 -2.444 0.0157 *
V1 1.8831 0.0863 21.820 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9181 on 148 degrees of freedom
Multiple R-squared: 0.7629, Adjusted R-squared: 0.7613
F-statistic: 476.1 on 1 and 148 DF, p-value: < 2.2e-16
The coefficients are:
summary_reg_model <- summary(reg_model)
summary_reg_model$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.180016 1.30138920 -2.443555 1.571951e-02
V1 1.883055 0.08629802 21.820373 4.197941e-48
So the intercept is found at -3.180016
with a standard error of 1.30138920
while the estimate of the slope is 1.883055
with a standard error of 0.08629802
; the respective t-tests against zero and the Type I Error probabilities are also reported.
Next we simulate 10,000 samples from the population parameters and each time perform a linear regression, recording the values of the parameter estimates:
mvn_samples <- function(meansVector, covMat) {
sample <- mvrnorm(n = 150,
mu = meansVector,
Sigma = covMat)
return(as.data.frame(sample))
}
lmSimulations <- lapply(1:10000, function(x) {
newData <- mvn_samples(meansVector, covMat)
model <- lm(V2 ~ V1,
data = newData)
return(data.frame(intercept = coefficients(model)[1],
slope = coefficients(model)[2]))
})
lmSimulations <- rbindlist(lmSimulations)
head(lmSimulations)
The distribution of the intercept estimates:
ggplot(lmSimulations,
aes(x = intercept)) +
geom_density(alpha = .15, color = "black") +
ggtitle("Distrubution of Model Intercept") +
xlab('Intercept') +
ylab('Density') +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
The estimated intercept from the first regression was found at -3.180016
: no wonder the mean of this distribution falls close to three then, right?
mean(lmSimulations$intercept)
[1] -2.863263
The distribution of the slope estimates:
ggplot(lmSimulations,
aes(x = slope)) +
geom_density(alpha = .15, color = "black") +
ggtitle("Distrubution of Model Slope") +
xlab('Slope') +
ylab('Density') +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
mean(lmSimulations$slope)
[1] 1.857532
… while the initial regression estimated a slope of 1.883055
!
And what about the standard errors? Initially, their values were 1.30138920
for the intercept and 0.08629802
for the slope. Let’s see:
sd(lmSimulations$intercept)
[1] 1.274814
sd(lmSimulations$slope)
[1] 0.0848986
Do you trust the Simple Linear Regression Model now? : ) Note. If you wonder why the mean(lmSimulation$intercept)
is -2.863263
while in the initial regression it was estimated to be -3.180016
: look at the standard errors of the model coefficients both in the initial regression, and their means obtained from numerical simulations.
R Markdown
R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.
Goran S. Milovanović
DataKolektiv, 2020/21
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.