1. The Student’s t-distribution
We begin by defining one Normal distribution with mean == 10
and variance == 5
. We draw 100,000 samples of size == 1000
from it, obtain the mean for each sample, and visualize the distribution of the sample mean.
# - number of samples
nsamples = 100000
# - sample size
n = 1000
# - normal parameters
# - mean:
mu = 10
# - variance:
sigma2 = 5
# - standard deviation:
std_dev = sqrt(sigma2)
# - n random draws from Normal(mu, std_dev), sample size = 1000,
# - take the mean *and the variance* of each sample:
normalSamples <- lapply(1:nsamples, function(x) {
sp <- rnorm(n, mu, std_dev)
m <- mean(sp)
v <- var(sp)
return(
data.frame(mean = m,
variance = v)
)
})
# - remember data.table::rbindlist from Session09?
normalSamples <- rbindlist(normalSamples)
# - The distribution of sample means:
ggplot(normalSamples,
aes(x = mean)) +
geom_histogram(binwidth = .001,
fill = 'darkred',
color = 'darkred') +
ggtitle("The sampling distribution of the mean") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
Consider the following variable:
\[t = \frac{\overline{X} - \mu}{{S}/{\sqrt{n}}}\]
where \(\overline{X}\) is the sample mean, \(\mu\) the population mean, \(S\) the standard deviation, and \(n\) the sample size. This quantity is known to follow a t-distribution with \(n-1\) degrees of freedom:
tdist <- (normalSamples$mean - mu)/(sqrt(normalSamples$variance)/sqrt(n))
tdist <- data.frame(t = tdist)
# - The Student's t-distribution:
ggplot(tdist,
aes(x = t)) +
geom_histogram(binwidth = .001,
fill = 'darkorange',
color = 'darkorange') +
ggtitle("The Student's t-distribution") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
The probability density function of the t-distribution presents a really nice exercise in \(LaTeX\):
\[f(t)=\frac{\Gamma(\frac{\nu+1}{2})}{\sqrt(\nu\pi)\Gamma(\frac{\nu}{2})}(1 + \frac{t^2}{\nu})^{-\frac{\nu+1}{2}}\] where \(\nu\) represents the degrees of freedom, and \(\Gamma\) is the Gamma function (just forget about it).
2. The t-test
Assume that we want to test if a mean of a sample drawn from some (presumably) Normal distribution is different than zero. We do not care if it is larger or not than zero, we just want to test if it is zero or not (or, in the lingo of mathematical statistics: if it statistically different from zero). I will use a Normal distribution with mean == 10
and variance == 5
to draw a sample from and then ask: is the sample mean significantly different from zero?
# - population (test) mean
real_mean <- 0
# - sample size
n = 10000
# - normal parameters
# - mean:
mu = 10
# - variance:
sigma2 = 5
# - standard deviation:
std_dev = sqrt(sigma2)
# - one random draw from Normal(mu, std_dev), sample size = n,
# - and take the mean
normalSample <- rnorm(n, mu, std_dev)
sample_mean <- mean(normalSample)
print(paste0("The sample mean is: ", sample_mean))
[1] "The sample mean is: 10.0274701328529"
sample_std_dev <- sd(normalSample)
print(paste0("The standard deviation is: ", sample_std_dev))
[1] "The standard deviation is: 2.22917426175989"
# - test statistic:
tStatistic <- (sample_mean - real_mean)/(sample_std_dev/sqrt(n))
print(paste0("The t-statistic is: ", tStatistic))
[1] "The t-statistic is: 449.828903234168"
# - degrees of freedom:
df <- n - 1
print(paste0("The number of degrees of freedom is: ", df))
[1] "The number of degrees of freedom is: 9999"
Now, let’s take a look again at:
\[t = \frac{\overline{X} - \mu}{{S}/{\sqrt{n}}}\]
We know that in our current experiment \(\overline{X}\) is around 10.00
, \(\mu\) - the population mean - is zero (because we want to test against zero), \(S\) is around 2.24
, and \(n\) is 10,000
, and we also know that this quantity follows a t-distribution with \(n-1\) degrees of freedom. What is the probability of obtaining the value of \(t\) of 447.58
from a t-distribution with 9,999 degrees of freedom?
2.1 The probability of obtaining some t-test statistic from a t-distribution
This might confuse you:
pt(abs(tStatistic), df, lower.tail = FALSE) * 2
[1] 0
But it is really easy: pt()
is the cumulative probability function for the t-distribution in R (remember the dpqr
notations: dt()
is its probability density function, pt()
the cumulative distribution function, qt()
its quantile function, and rt()
its random number generator, similar to dnorm()
, pnorm()
, qnorm()
, and rnorm()
). But why did we multiply the probability of observing the value of the tStatistic
from a t-distribution with df
degrees of freedom by two? Because we do not care if the sample mean that we are testing against zero is lower than zero or higher than zero - so we have to consider the possibility of obtaining a positive as well a as a negative value of the t-test statistic!
Of course, R has a handy t.test()
function to perform t-tests…
t.test(normalSample, mu = real_mean, alternative = "two.sided")
One Sample t-test
data: normalSample
t = 449.83, df = 9999, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
9.983774 10.071166
sample estimates:
mean of x
10.02747
Hint. Think: alternative = "two.sided"
in the t.test()
call above. Please: study the documentation, read something about the t-test, and figure out for yourself - it is easy - why can we have (and what sense does it make to have) both alternative = "two.sided"
, alternative = "less"
and alternative = "greater"
!
To test any sample mean against any hypothesized population value - just change the real_mean
value in the code above. It does not have to zero, of course:
t.test(normalSample, mu = 9.98, alternative = "two.sided")
One Sample t-test
data: normalSample
t = 2.1295, df = 9999, p-value = 0.03324
alternative hypothesis: true mean is not equal to 9.98
95 percent confidence interval:
9.983774 10.071166
sample estimates:
mean of x
10.02747
Or, by hand:
# - population (test) mean
real_mean <- 9.98
# - sample size
n = 10000
# - normal parameters
# - mean:
mu = 10
# - variance:
sigma2 = 5
# - standard deviation:
std_dev = sqrt(sigma2)
# - one random draw from Normal(mu, std_dev), sample size = n,
# - and take the mean
normalSample <- rnorm(n, mu, std_dev)
# - test statistic:
tStatistic <- (sample_mean - real_mean)/(sample_std_dev/sqrt(n))
print(paste0("The t-statistic is: ", tStatistic))
[1] "The t-statistic is: 2.12949403136401"
# - degrees of freedom:
df <- n - 1
print(paste0("The number of degrees of freedom is: ", df))
[1] "The number of degrees of freedom is: 9999"
# - p-value
pvalue <- pt(abs(tStatistic), df, lower.tail = FALSE) * 2
print(paste0("The p-value is: ", pvalue))
[1] "The p-value is: 0.0332377653132108"
Remember: p-value < .05
is the conventional value - well, one of the two conventional values, the other being .01
- beyond which we call the result of statistical hypothesis testing significant. Let’s discuss the interpretation of this probability in exact terms.
2.2 Statistical Hypothesis Testing: the null and the alternative hypothesis
First, let’s take a look at the t-test statistic once again:
\[t = \frac{\overline{X} - \mu}{{S}/{\sqrt{n}}}\]
What we can say about it that it really represents the difference between the sample mean \(\overline{X}\) and the hypothesized population mean \(\mu\), scaled by \({S}/{\sqrt{n}}\). The scaling is present only to make the difference follow the t-distribution. Now, let’s take a look at the t-distribution again:
# - The Student's t-distribution:
ggplot(tdist,
aes(x = t)) +
geom_histogram(binwidth = .001,
fill = 'darkorange',
color = 'darkorange') +
ggtitle("The Student's t-distribution") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
and note that its mean is zero (yes, you are looking at the result of a numerical simulation, but trust me: it is zero). So the probability density around zero is very high. Let’s assume now that we expect to see no difference between the sample mean and the population mean: we expect to observe a t-test statistic of zero or close to zero. We call this assumption the null hypothesis in statistical hypothesis testing, and note how the t-distribution, centered around zero, nicely represents the belief that there is no difference between the sample mean and the population mean!
Now, the rejection of the null hypothesis is called an alternative hypothesis, which is very simple in the t-test: it states that sample mean is really different from the population mean. Ok, how do we know when to reject the null hypothesis and when not? Simply put, by doing:
pt(abs(tStatistic), df, lower.tail = FALSE) * 2
which will tell us what is the probability that a given value of the t-test statistic is obtained from a distribution that represents our null hypothesis. So, if this probability - the p-value
in the output of t.test()
- is low, we understand that it is unlikely that we have observed the value of the t-test that we have computed from our sample if the sample mean was obtained from a population with a hypothesized (test) mean. And then, by convention, we say: if that probability is lower than .05
, the finding is called statistically significant.
The p-value
is a probability of committing to a Type I Error (a.k.a. “a false positive”) in statistics: to reject a null hypothesis when a null hypothesis is indeed true. If you do this that would be the same as to claim that some result is significant while in fact it occurred by chance.
2.3 Building an intuition on the value of a t-test, the respective p-value, and the sample size
Remember that we have used the population mean of real_mean == 0
and then drawn a sample from a Normal with a mean of mu == 10
and variance of sigma2 = 5
to exemplify the t-test? Let’s see what happens if we vary the sample mean as c(10, 5, 1, .5, .1, .01, .001, .0001)
:
options(scipen = 999)
# - population (test) mean
real_mean <- 0
# - sample size
n = 10000
# - normal parameters
# - mean:
mu = c(10, 5, 1, .5, .1, .01, .001, .0001)
# - variance:
sigma2 = 5
# - standard deviation:
std_dev = sqrt(sigma2)
# - random draws from Normal(mu, std_dev), sample size = n,
# - and then take the mean
t_tests <- lapply(mu, function(x) {
test_sample <- rnorm(n, x, std_dev)
test_result <- t.test(test_sample, mu = real_mean, alternative = "two.sided")
return(
data.frame(sample_mean = x,
population_mean = real_mean,
t = round(test_result$statistic, 3),
p = round(test_result$p.value, 3)
)
)
})
t_tests <- rbindlist(t_tests)
print(t_tests)
So, when the sample mean was taken to be .01
and lower, the t-test was not able to differentiate it from zero anymore - given the sample size of n = 10000
that we used. Please observe how the value of the t-test statistic decreased with a decrease in the difference between the sample and the population mean, and how at the same time the probability of obtaining a particular value of the t-test from a t-distribution that represents the null hypothesis increased. What happens if we set the sample size, n
, to one million?
options(scipen = 999)
# - population (test) mean
real_mean <- 0
# - sample size
n = 1e06
# - normal parameters
# - mean:
mu = c(10, 5, 1, .5, .1, .01, .001, .0001)
# - variance:
sigma2 = 5
# - standard deviation:
std_dev = sqrt(sigma2)
# - random draws from Normal(mu, std_dev), sample size = n,
# - and then take the mean
t_tests <- lapply(mu, function(x) {
test_sample <- rnorm(n, x, std_dev)
test_result <- t.test(test_sample, mu = real_mean, alternative = "two.sided")
return(
data.frame(sample_mean = x,
population_mean = real_mean,
t = round(test_result$statistic, 3),
p = round(test_result$p.value, 3)
)
)
})
t_tests <- rbindlist(t_tests)
print(t_tests)
Now we needed to reach the value of the sample_mean == 0.001
for the t-test to be unable to tell that it is statistically significantly different from zero. Do not forget about this exercise ever.