Feedback should be send to goran.milovanovic@datakolektiv.com
. These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.
The Relational Data Model: we will work to build an understanding of join operations in relational data representations (e.g. such as sets of R dataframes and similar structures). Statistical Hypothesis Testing begins: we will learn about the \(\chi^2\) distribution and the related statistical test. Can you tell one distribution from another? Examples on real world data.
We will use the Wikimedia Foundation’s Product Analytics/Comparison datasets in this session. In order to download the datasets you will need to open the following Google Spreadsheet: Wiki comparison [public].
Then:
Wiki comparison [public] - Dec 2020.csv
and Wiki comparison [public] - Dec 2019.csv
into your _data
folder for this session.Of course:
library(tidyverse)
set.seed(9999)
The two new datasets that we will be using in this session were produced by the Wikimedia Foundation’s Product Analytics team and are publicly shared via a Google Spreadsheet. Both datasets encompass a number of variables describing each Wikimedia Foundation’s Wiki - notice that only some of them are Wikipedias, because we also have Wikivoyage, Wikidata, Wiktionaries, etc. - by quantitative measurements related mostly to editor and reader behavior. Let’s load the data and see what is there!
dataDir <- "C:/Users/goran/___DataKolektiv/__EDU/01_IntroDataScience_Non-Tech/_Code/IntroDataScience_NonTech_S09/_data/"
lF <- list.files(dataDir)
print(lF)
[1] "Wiki comparison [public] - Dec 2019.csv" "Wiki comparison [public] - Dec 2020.csv"
Ok, load:
wikiComparison2019 <- read.csv(paste0(dataDir, "Wiki comparison [public] - Dec 2019.csv"),
header = T,
check.names = F,
stringsAsFactors = F)
wikiComparison2020 <- read.csv(paste0(dataDir, "Wiki comparison [public] - Dec 2020.csv"),
header = T,
check.names = F,
stringsAsFactors = F)
The variables present in wikiComparison2019
and wikiComparison2020
:
colnames(wikiComparison2019)
[1] ""
[2] "overall size rank"
[3] "monthly unique devices"
[4] "mobile unique devices"
[5] "mobile web pageviews"
[6] "mobile app pageviews"
[7] "unique devices per editor"
[8] "monthly editors"
[9] "monthly active editors"
[10] "monthly active administrators"
[11] "majority-mobile editors"
[12] "monthly new active editors"
[13] "second-month new editor retention"
[14] "monthly nonbot edits"
[15] "bot edits"
[16] "mobile edits"
[17] "visual edits"
[18] "anonymous edits"
[19] "revert rate"
[20] "edits Gini coefficient"
[21] "monthly structured discussions (Flow) messages"
[22] "content pages"
[23] "cumulative content edits"
[24] "edits per content page"
[25] "script direction"
[26] "database code"
[27] "project code"
[28] "language code"
[29] "domain name"
[30] "language name"
[31] "wiki name"
colnames(wikiComparison2020)
[1] "" "overall SIZE rank"
[3] "monthly unique devices" "mobile unique devices"
[5] "mobile web pageviews" "mobile app pageviews"
[7] "unique devices per editor" "monthly editors"
[9] "monthly active editors" "monthly active administrators"
[11] "majority mobile editors" "monthly new active editors"
[13] "second month editor retention" "monthly nonbot edits"
[15] "bot edits" "mobile edits"
[17] "visual edits" "anonymous edits"
[19] "revert rate" "edits Gini coefficient"
[21] "monthly structured discussions messages" "content pages"
[23] "cumulative content edits" "edits per content page"
[25] "script direction" "database code"
[27] "project code" "project code"
[29] "language code" "domain"
[31] "language name"
Note. Be very careful about the following fact:
glimpse(wikiComparison2019)
Rows: 732
Columns: 31
$ `` <chr> "English Wikipedia", "Spanish Wiki...
$ `overall size rank` <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,...
$ `monthly unique devices` <chr> "857,711,498", "180,324,565", "108...
$ `mobile unique devices` <chr> "71.8%", "73.7%", "61.0%", "66.8%"...
$ `mobile web pageviews` <chr> "55.8%", "67.1%", "49.4%", "58.2%"...
$ `mobile app pageviews` <chr> "1.8%", "0.7%", "4.1%", "1.5%", "0...
$ `unique devices per editor` <chr> "6,315", "9,868", "5,340", "5,017"...
$ `monthly editors` <chr> "135,822", "18,274", "20,254", "19...
$ `monthly active editors` <chr> "30,924", "4,375", "5,442", "4,966...
$ `monthly active administrators` <int> 421, 54, 128, 108, 30, 165, 100, 9...
$ `majority-mobile editors` <chr> "26.2%", "31.6%", "14.6%", "18.4%"...
$ `monthly new active editors` <chr> "5,078", "1,043", "449", "735", "7...
$ `second-month new editor retention` <chr> "7.0%", "4.4%", "6.6%", "5.8%", "1...
$ `monthly nonbot edits` <chr> "4,126,576", "586,863", "757,880",...
$ `bot edits` <chr> "13.4%", "25.3%", "12.9%", "20.9%"...
$ `mobile edits` <chr> "7.4%", "11.6%", "2.3%", "4.2%", "...
$ `visual edits` <chr> "5.7%", "12.0%", "5.4%", "11.2%", ...
$ `anonymous edits` <chr> "19.1%", "30.9%", "9.8%", "14.3%",...
$ `revert rate` <chr> "11.8%", "19.7%", "9.2%", "7.6%", ...
$ `edits Gini coefficient` <dbl> 0.95, 0.96, 0.96, 0.96, 0.93, 0.97...
$ `monthly structured discussions (Flow) messages` <chr> "0", "0", "0", "3,075", "0", "0", ...
$ `content pages` <chr> "5,996,691", "1,572,302", "2,386,9...
$ `cumulative content edits` <chr> "403,123,640", "50,537,842", "94,2...
$ `edits per content page` <int> 67, 32, 39, 34, 30, 4, 29, 24, 23,...
$ `script direction` <chr> "left-to-right", "left-to-right", ...
$ `database code` <chr> "enwiki", "eswiki", "dewiki", "frw...
$ `project code` <chr> "wikipedia", "wikipedia", "wikiped...
$ `language code` <chr> "en", "es", "de", "fr", "ja", "en"...
$ `domain name` <chr> "https://en.wikipedia.org", "https...
$ `language name` <chr> "English", "Spanish", "German", "F...
$ `wiki name` <chr> "English Wikipedia", "Spanish Wiki...
You can use dplyr::glimpse()
in a way similar as we have previously used str()
on dataframes in R: it has a more user-friendly output. And the fact that I had on my mind is that we are reading many numerical values as members of the character()
class in R!
We can see that both dataframes have a column titled ""
in the first position, signifying the Wiki project that identifies a particular row. That is not good, and then the solution is to do read.csv()
with row.names = 1
:
wikiComparison2019 <- read.csv(paste0(dataDir, "Wiki comparison [public] - Dec 2019.csv"),
header = T,
check.names = F,
row.names = 1,
stringsAsFactors = F)
wikiComparison2020 <- read.csv(paste0(dataDir, "Wiki comparison [public] - Dec 2020.csv"),
header = T,
check.names = F,
row.names = 1,
stringsAsFactors = F)
However, row names are not - trust me on this one - particularly useful in data analysis with R. Let me introduce a new column, produced from the row names in both dataframes, that will serve as our unique identifier for each Wiki in the data:
wikiComparison2019$wiki <- rownames(wikiComparison2019)
wikiComparison2020$wiki <- rownames(wikiComparison2020)
Note. You might have spotted the presence of the wiki name
column in wikiComparison2019
which is the same as the wiki
column that I have just produced. However, having and id column which has a name exactly the same across the dataframes under analysis is a bit more consistent and makes life easier. However, we will keep the wiki name
column in wikiComparison2019
to demonstrate something later on.
Do the column names in the two dataframes match?
identical(tolower(colnames(wikiComparison2019)),
tolower(colnames(wikiComparison2020)))
[1] FALSE
Not even after tolower()
! What about the rownames?
identical(tolower(rownames(wikiComparison2019)),
tolower(rownames(wikiComparison2020)))
[1] FALSE
And it is not possible for them to be identical()
, of course, as we know that dim(wikiComparison2019)
is:
print(dim(wikiComparison2019))
[1] 732 31
while dim(wikiComparison2020)
is:
print(dim(wikiComparison2020))
[1] 738 31
Be aware that situations like this are more than common in our line of work. The team that has produced these data - look at their Github repo - is a very good one indeed, but the complexity that they need to struggle with can be overwhelming from time to time and no wonder then that a small inconsistency like non-matching column names appears here and there. However, most of such things are easily fixed. Remember: nothing is perfect, and because that is so the most of your work as a Data Scientist/Analyst falls in the Data Wrangling arena, and that means mastering things like {dplyr}
, {tidyr}
, {data.table}
and others is a must.
And I still want to compare Wikipedias. For example, I would like to be able to answer the following questions: is the number of active monthly editors
across the Wikipedias any different in 2019 and 2020? What do I need to do to find out?
We have already used dplyr::select()
and dplyr::filter()
, and we now want to introduce the join operations in {dplyr}
. Let’s begin by selecting exactly what we need from the two dataframes in order to compare the numbers of active monthly editors in 2019 and 2020:
monthlyActive2019 <- wikiComparison2019 %>%
select(wiki,
`monthly active editors`) %>%
filter(str_detect(wiki, "Wikipedia"))
# - fix `monthly active editors`: from character() to numeric()
monthlyActive2019$`monthly active editors` <- as.numeric(
gsub(",", "", monthlyActive2019$`monthly active editors`)
)
monthlyActive2019 <- arrange(monthlyActive2019,
desc(`monthly active editors`))
The select()
part of the pipeline should be self-explanatory - we need to know what Wikipedias do we have in the dataset and what count of active monthly editors stands for which Wikipedia - while the filter()
in combination with str_detect()
from {stringr}
serves to filter out Wikipedias only (remember: there is more than Wikipedia in the Wikimedia Universe). The arrange(desc())
piece sorts the dataset by a decreasing count of editors, and before that we had to use a few lines to fix active monthly editors
from character
to dbl
. The same for wikiComparison2020
:
monthlyActive2020 <- wikiComparison2020 %>%
select(wiki,
`monthly active editors`) %>%
filter(str_detect(wiki, "Wikipedia"))
# - fix `monthly active editors`: from character() to numeric()
monthlyActive2020$`monthly active editors` <- as.numeric(
gsub(",", "", monthlyActive2020$`monthly active editors`)
)
monthlyActive2020 <- arrange(monthlyActive2020,
desc(`monthly active editors`))
And now I should just cbind()
the two dataframes, right? Well, no. Q. Is the order of Wikipedias the same in the two dataframes? Are they of the same dimension (i.e. do they both encompass the same number of observations, because they are made to have the same columns)? The answer to the second question is no. For the first one: it is not necessarily so. So what do we do? We need to join these two dataframes.
colnames(monthlyActive2019)[2] <- 'monthlyActive2019'
colnames(monthlyActive2020)[2] <- 'monthlyActive2020'
monthlyActiveComparison <- left_join(monthlyActive2020,
monthlyActive2019,
by = 'wiki')
What does dplyr::left_join()
do?
left_join()
we have first specified monthlyActive2020
, and then only monthlyActive2019
, and that means that monthlyActive2020
will be the left table in the join, and that monthlyActive2019
will be the right table in the join;by = 'wiki'
- choosing a key that is present in both tables;left_join()
operation proceeds as following: (1) look in the right table and find everything that has its match in the left table on the defined key column, (2) grab all the values from the columns in the right table and copy them into the left table in the place where a corresponding match on the key column is found, and (3) keep everything from the left table and use NA
to indicate that no match was found.Now, my decision to use monthlyActive2020
as the left table was motivated by the fact that left_join()
keeps everything from the left table, i.e. it does not eliminate the non-matching rows from it, and monthlyActive2020
has more observations present than monthlyActive2019
:
dim(monthlyActive2020)
[1] 302 2
dim(monthlyActive2019)
[1] 298 2
and I am thus certain that
sum(is.na(monthlyActiveComparison$monthlyActive2019))
[1] 4
is exactly four, because 302 - 298 == 4
is TRUE
.
There is also something called right_join()
that accomplishes the same if you switch the order of your dataframes:
monthlyActiveComparisonRight <- right_join(monthlyActive2019,
monthlyActive2020,
by = 'wiki')
except for that the order of columns will be different in the resulting dataframe obtained from left_join()
and right_join()
.
Every entry in monthlyActiveComparison$wiki
contains " Wikipedia"
; fix:
monthlyActiveComparison$wiki <-
gsub(" Wikipedia", "", monthlyActiveComparison$wiki)
Let’s visualize the number of active monthly editors in all Wikipedias which had >=50
of them either in 2019 or 2020:
plotMonthlyActive <- monthlyActiveComparison %>%
filter(monthlyActive2020 >= 50|monthlyActive2019 >= 50) %>%
pivot_longer(cols = -wiki,
names_to = 'observation',
values_to = 'editorCount')
ggplot(plotMonthlyActive, aes(x = wiki,
y = editorCount,
group = observation,
color = observation,
fill = observation)) +
geom_line(size = .25) +
geom_point(size = 1.5) +
scale_color_manual(values = c("darkorange", "darkred")) +
scale_y_continuous(trans = 'log') +
ggtitle("Wikipedia Comparison: Active Monthly Editors 2019/2020") +
ylab("log(Active Monthly Editors)") + xlab("Wikipedia") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5)) +
theme(axis.text.x = element_text(angle = 90,
hjust = 0.95,
vjust = 0.2,
size = 9)) +
theme(legend.position = "top")
The 2020 observations seem to follow the 2019 observations pretty much, but we will use a stricter method to test if that is true later on. We need more focus on join operations: there is more to left_join()
and right_join()
, of course!
You might recall how we have decided to position monthlyActive2019
to the right and monthlyActive2020
to the left in left_join()
previously? The decision was motivated by the fact that there were four more observations present in monthlyActive2020
. But I was incorrectly assuming - incorrectly only because I did not check - that the fact that there are more observations present in the later dataset implies that it at the same time encompasses everything found in the former dataset. That is not necessarily so. What if I want to make sure to collect only the observations for Wikipedias that are certainly found in both datasets? Enters inner_join()
:
monthlyActiveComparisonInner <- inner_join(monthlyActive2020,
monthlyActive2019,
by = 'wiki')
Let’s compare monthlyActiveComparisonInner
with monthlyActiveComparison
(the later was produced from a left_join()
, remember):
dim(monthlyActiveComparisonInner)
[1] 298 3
dim(monthlyActiveComparison)
[1] 302 3
And:
sum(is.na(monthlyActiveComparisonInner$monthlyActive2019))
[1] 0
sum(is.na(monthlyActiveComparisonInner$monthlyActive2020))
[1] 0
because only matching observations (i.e. records) from both tables were kept. In other words:
“The most important property of an inner join is that unmatched rows are not included in the result. This means that generally inner joins are usually not appropriate for use in analysis because it’s too easy to lose observations.” – Hadley Wickham & Garrett Grolemund, R for Data Science.
In contrast to inner_join()
, left_join()
, right_join()
, and full_join()
are called outter joins. You might have wondered, after learning about inner_join()
which in effect takes an intersection of the data somehow, whether there is a join operation that keeps everything, since left_join()
filters out from the right table and right_join()
filters out from the left table? There is, and it is called full_join()
:
monthlyActiveComparisonInner <- full_join(monthlyActive2020,
monthlyActive2019,
by = 'wiki')
dim(monthlyActiveComparisonInner)
[1] 302 3
dim(monthlyActiveComparison)
[1] 302 3
sum(is.na(monthlyActiveComparisonInner$monthlyActive2019))
[1] 4
sum(is.na(monthlyActiveComparisonInner$monthlyActive2020))
[1] 0
There is something that we have learned about our data now. Remember how I thought the following:
But I was incorrectly assuming - incorrectly only because I did not check - that the fact that there are more observations present in the later dataset implies that it at the same time encompasses everything found in the former dataset.
Well, now we are sure that monthlyActive2020
has all the Wikipedias found in monthlyActive2019
! Because otherwise, following a full_join()
operation, it would not be possible to observe 0 NAs
after sum(is.na(monthlyActiveComparisonInner$monthlyActive2020))
!
Note. Playing with joins across the tables that you are still inspecting, trying to learn about their characteristics as much as you can, while performing small post hoc tricks and checks like what I am doing here, is a very good way to get know your data before even entering the EDA phase. I call it Exploratory Data Wrangling (EDW).
There is a class of join operations that we call filtering joins: semi_join()
and anti_join()
. They are very useful indeed. Back to my dilemma, say I want to find out what Wikipedias from monthlyActive2020
are present in monthlyActive2019
:
monthlyActive2020_2019 <- semi_join(monthlyActive2020,
monthlyActive2019,
by = "wiki")
Now:
dim(monthlyActive2020_2019)
[1] 298 2
dim(monthlyActive2019)
[1] 298 2
sum(monthlyActive2020_2019$wiki %in% monthlyActive2019$wiki)
[1] 298
And then we have anti_join()
, the complement of semi_join()
:
monthlyActive2019_2020 <- anti_join(monthlyActive2020,
monthlyActive2019,
by = "wiki")
monthlyActive2019_2020
And finally we’ve found the four observations in monthlyActive2020
that are not present in monthlyActive2019
.
anti_join()
Let me show you something really interesting and useful. Consider monthlyActive2020
:
head(monthlyActive2020)
What if one would want to compare the number of active monthly editors in each pair of Wikipedias? Look:
wikis <- monthlyActive2020$wiki
cmpData <- lapply(wikis, function(x) {
ltab <- filter(monthlyActive2020, wiki == x)
tab <- anti_join(monthlyActive2020, ltab, by = "wiki")
tab$wiki2 <- ltab$wiki
tab$monthlyActive2020_2 <- ltab$monthlyActive2020
return(tab)
})
cmpData <- reduce(cmpData, rbind)
head(cmpData)
But there is one problem with this approach: it produces duplicates… Let’s make a new key, uniquePair
, which is an example of what Wickham and Grolemund call “a surrogate key” in their book:
cmpData$uniquePair <- apply(cbind(cmpData$wiki, cmpData$wiki2),
1,
function(x) {
canonical <- sort(x)
if (x[1] == canonical[1] & x[2] == canonical[2]) {
return(paste(x, collapse = "-"))
} else {
return(paste(x[2], x[1], sep = "-"))
}
})
w <- which(duplicated(cmpData$uniquePair))
# - N.B. If length(w) == 0, then cmpData[-w, ] deletes everything...
if (length(w) > 1) {
cmpData <- cmpData[-w, ]
}
Now the duplicates are gone and the Analyst can proceed by performing pair-wise comparisons of Wikipedias.
One more thing: reduce()
with cbind()
that we have used many times before seems to be slow. Here’s a preview of the {data.table
} package and its super-useful and super-fast rbindlist()
function:
cmpData
list:wikis <- monthlyActive2020$wiki
cmpData <- lapply(wikis, function(x) {
ltab <- filter(monthlyActive2020, wiki == x)
tab <- anti_join(monthlyActive2020, ltab, by = "wiki")
tab$wiki2 <- ltab$wiki
tab$monthlyActive2020_2 <- ltab$monthlyActive2020
return(tab)
})
reduce(cmpData, rbind)
take?system.time(reduce(cmpData, rbind))
user system elapsed
11.21 0.04 11.26
data.table::rbindlist(cmpData)
take?library(data.table)
system.time(rbindlist(cmpData))
user system elapsed
0 0 0
The documentation says:
The definition of ‘user’ and ‘system’ times is from your OS. Typically it is something like The ‘user time’ is the CPU time charged for the execution of user instructions of the calling process. The ‘system time’ is the CPU time charged for execution by the system on behalf of the calling process.
Theory: say X follows a Standard Normal Distribution (\(\mathcal{N}(0,1)\)). Take k = 3 such variables, square them, sum up the squares, and repeat the experiment 100,000 times.
stdNormals3 <- sapply(seq(1, 100000), function(x) {
sum((rnorm(3, mean = 1, sd = 1))^2)
})
Q: How are these sums of standard normal distributions distributed?
# set plot parameters
hist(stdNormals3, 50, main = "k = 3",
xlab = "Sums of squared Gaussians",
ylab = "Frequency",
col = "steelblue")
Repeat for k = 30:
stdNormals30 <- sapply(seq(1,100000), function(x) {
sum((rnorm(30, mean = 1, sd = 1))^2)
})
hist(stdNormals30, 50, main = "k = 30",
xlab = "Sums of squared Gaussians",
ylab = "Frequency",
col = "steelblue")
Here it is: the sum of squared IID random variables - each of them distributed as \(\mathcal{N}(0,1)\) - follows a \(\chi^2\) distribution.
par(mfrow = c(1, 2))
curve(dchisq(x, 3),
from = 0, to = 40,
main = "k = 3",
col = "blue",
xlab = "x", ylab = "Density")
curve(dchisq(x, 30),
from = 0, to = 120,
main = "k = 30",
col = "blue",
xlab = "x", ylab = "Density")
This probability distribution plays a very important role in statistical hypothesis testing; its domain encompasses strictly positive real numbers, and the probability density is given by:
\[f(x;k) = \begin{cases}{2}{\frac{x^{(k/2-1)e^{-x/2}}}{2^{k/2}\Gamma(\frac{k}{2})}}, \:\:{for}\:\ x > 0;\\{0,\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\: otherwise}\end{cases}\]
where \(\Gamma\) is the gamma function, with a property of \(\Gamma(n)=(n-1)!\) for any positive integer \(n\).
Assume the following: an urn contains white, blue, and red balls in proportion of 5:3:2, and we draw a sample of size n = 100
from the urn. Thus,
n <- 100
Our task is to determine whether the sample reflects the hypothesized distribution of balls in the urn. Let’s simulate this experiment in R:
# Step 1: Population parameters (probabilities)
populationP <- c(.5, .3, .2)
expectedCounts <- n * populationP
expectedCounts
[1] 50 30 20
Of course, 50 white, 30 blue, and 20 red balls is the expected outcome of the experiment, given the population parameters. We are dealing with a multinomial distribution here, obviously. Let’s start sampling from it, by first providing the population parameters to it:
# Step 2: Sampling
# random draw from a multinomial distribution of three events:
sample <- as.numeric(rmultinom(1, 100, prob = populationP))
sample
[1] 48 27 25
Does this sample deviates significantly from the expected counts (i.e. does our “theory” of population parameters fit the empirical data well)? We use the \(\chi^2\)-test to check it out:
# Step 3: Chi-Square Statistic:
chiSq <- sum(((sample - expectedCounts)^2)/expectedCounts)
print(paste0("The chi-Square statistic is: ", chiSq))
[1] "The chi-Square statistic is: 1.63"
df <- 3 - 1 # k == 3 == number of events
print(paste0("Degrees of freedom: ", df))
[1] "Degrees of freedom: 2"
sig <- pchisq(chiSq, df, lower.tail = F) # upper tail
print(paste0("Type I Error probability is: ", sig))
[1] "Type I Error probability is: 0.442639327361351"
print(paste0("Type I Error < .05: ", sig < .05))
[1] "Type I Error < .05: FALSE"
The \(\chi^2\) statistic, let us refresh our Stats 101, would be…
\[\chi^2 = \sum_{i=1}^{n}\frac{(Observed\:Counts_i - Expected\:Counts_i)^2}{Expected\: Counts_i}\]
And the probability that we are about to commit a \(Type\:I\: Error\) (i.e. accepting that the Observed Counts are different from the Expected Counts while in the population they are not) must be assessed from the cumulative \(\chi^2\) distribution, provided by pchisq()
in R: pchisq(chiSq, df, lower.tail = F)
- going for the upper tail to figure out how improbable would a particular test value be from a \(\chi^2\) distribution with df
degrees of freedom.
In this case, the conclusion is: the obtained sample really looks like it comes from the specified population.
Now with a sample from a different distribution:
# random draw from a multinomial distribution of three events:
populationP <- c(.3, .3, .4)
sample <- as.numeric(rmultinom(1, 100, prob = populationP))
sample
[1] 29 27 44
# Step 3: Chi-Square Statistic:
expectedCounts <- populationP * 100
chiSq <- sum(((sample - expectedCounts)^2)/expectedCounts)
print(paste0("The chi-Square statistic is: ", chiSq))
[1] "The chi-Square statistic is: 0.733333333333333"
df <- 3 - 1 # k == 3 == number of events
print(paste0("Degrees of freedom: ", df))
[1] "Degrees of freedom: 2"
sig <- pchisq(chiSq, df, lower.tail = F) # upper tail
print(paste0("Type I Error probability is: ", sig))
[1] "Type I Error probability is: 0.693040620086442"
print(paste0("Type I Error < .05: ", sig < .05))
[1] "Type I Error < .05: FALSE"
To perform this basic hypothesis test in R, use the chisq.test
function:
# Testing the independence of rows and columns
chisq.test(x = sample,
y = populationP)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: sample and populationP
X-squared = 3, df = 2, p-value = 0.2231
Q. Is the distribution of the numbers of Active Monthly Editors in Wikipedias in 2020 different than it was in 2019?
We will focus only on Wikipedias with >=50
active monthly editors. Assume that the expected (theoretical, population) distribution is the one observed in 2019:
mac <- monthlyActiveComparison %>%
filter(monthlyActive2020 >= 50|monthlyActive2019 >= 50)
colnames(mac)[2:3] <- c('Observed', 'Expected')
head(mac)
mac$populationP <- mac$Expected/sum(mac$Expected)
mac$expectedCounts <- mac$populationP * sum(mac$Observed)
chiSq <- sum(((mac$Observed - mac$expectedCounts)^2)/mac$expectedCounts)
print(paste0("The chi-Square statistic is: ", chiSq))
[1] "The chi-Square statistic is: 3076.82364144078"
df <- dim(mac)[1] - 1 # k == 3 == number of events
print(paste0("Degrees of freedom: ", df))
[1] "Degrees of freedom: 64"
sig <- pchisq(chiSq, df, lower.tail = F) # upper tail
print(paste0("Type I Error probability is: ", sig))
[1] "Type I Error probability is: 0"
print(paste0("Type I Error < .05: ", sig < .05))
[1] "Type I Error < .05: TRUE"
Let’s check w. chisq.test()
:
populationP <- mac$Expected/sum(mac$Expected)
chisq.test(x = mac$Observed,
p = populationP)
Chi-squared test for given probabilities
data: mac$Observed
X-squared = 3076.8, df = 64, p-value < 2.2e-16
R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.
Goran S. Milovanović
DataKolektiv, 2020/21
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.