Feedback should be send to goran.milovanovic@datakolektiv.com
. These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.
Getting to know {data.table}
, the essential package for efficient processing of large datasets in RAM in R. Comparison: {dplyr} and {data.table}: that would be the “data” part of today’s session. The “science” part considers conditional probabilities and the famous Bayes’ Theorem.
nycflights13
should be there already; alsolibrary(tidyverse)
library(data.table)
dataDir <- paste0(getwd(), "/_data/")
data.table::fread()
We will use the flights
dataset from the {nycflights13}
package in this Session:
flightsFrame <- nycflights13::flights
dim(flightsFrame)
Let’s be clear: a data frame with 336776
rows and 19
columns is everything but “big” in any sense nowadays. However:
write.csv(flightsFrame, paste0(dataDir, "flights.csv"))
And now it is just a .csv
file in our local filesystem.
system.time(flightsFrame_baseR <- read.csv(paste0(dataDir, "flights.csv"),
header = T,
row.names = 1,
stringsAsFactors = F,
check.names = F)
)
With data.table::fread()
:
system.time(flightsFrame_DT <- fread(paste0(dataDir, "flights.csv"),
header = T)
)
Compare..?
cmp1 <- system.time(flightsFrame_baseR <- read.csv(paste0(dataDir, "flights.csv"),
header = T,
row.names = 1,
stringsAsFactors = F,
check.names = F)
)
cmp2 <- system.time(flightsFrame_DT <- fread(paste0(dataDir, "flights.csv"),
header = T)
)
as.numeric(cmp1[1])/as.numeric(cmp2[1])
The result may vary for various reasons: however, data.table::fread()
does it order of magnitude faster than the baseR read.csv()
. There is no reason to abandon read.csv()
(or the tidyverse {readr}
package functions). But when it comes to really large datasets… fread()
.
Please be reminded of the following:
class(flightsFrame_baseR)
class(flightsFrame_DT)
There is a specific data.table
class attached to R objects used as data.tables
.
rm(flightsFrame_baseR)
rm(flightsFrame)
Subsetting data.table can be similar to what we have already learned about base R.
flightsFrame_DT$V1 <- NULL
flightsFrame_DT[3:4, ]
Exclude rows by negative indexing (taking a small dataset to illustrate)
irisDT <- data.table(iris)
irisDT[!3:7, ]
Filter rows in {data.table}:
flightsFrame_DT[month > 6]
Let’s compare with {dplyr}:
cmp1 <- system.time(flightsFrame_DT[month > 6])
cmp2 <- system.time(filter(flightsFrame_DT, month > 6))
print("{data.table}:")
print(cmp1)
print("{dplyr}:")
print(cmp2)
Filter using multiple conditions
cmp1 <- system.time(flightsFrame_DT[month == 1 & dep_delay > 0])
cmp2 <- system.time(filter(flightsFrame_DT, month > 1 & dep_delay > 0))
print("{data.table}:")
print(cmp1)
print("{dplyr}:")
print(cmp2)
Sorting rows:
cmp1 <- system.time(flightsFrame_DT[order(dep_delay)])
cmp2 <- system.time(arrange(flightsFrame_DT, dep_delay))
print("{data.table}:")
print(cmp1)
print("{dplyr}:")
print(cmp2)
Enters data.table::setkey()
setkey(flightsFrame_DT, dep_delay)
key(flightsFrame_DT)
cmp1 <- system.time(flightsFrame_DT[order(dep_delay)])
cmp2 <- system.time(arrange(flightsFrame_DT, dep_delay))
print("{data.table}:")
print(cmp1)
print("{dplyr}:")
print(cmp2)
Please be aware of the fact that the system.time()
results will vary, and that our flightsFrame_DT
is far from being of a size considerable for comparisons between {data.table}
and {dplyr} or base R.
Selecting columns is done in the following way. To return a data.table object from an existing data.table:
arr_time <- flightsFrame_DT[ , list(arr_time)]
class(arr_time)
The following accomplishes the same with .
used as an alias for list()
:
arr_time <- flightsFrame_DT[, .(arr_time)]
class(arr_time)
To obtain a vector from a data.table column:
arr_time <- flightsFrame_DT[, arr_time]
class(arr_time)
Or:
arr_time <- flightsFrame_DT[['arr_time']]
class(arr_time)
Now we can try a simple conjunction of filtering and selection operations in {data.table}, for example:
selectedFlights <-
flightsFrame_DT[arr_time > 100, .(arr_time, arr_delay, dest)]
head(selectedFlights)
The {dplyr} equivalent is:
selectedFlights_dplyr <- flightsFrame_DT %>%
select(arr_time, arr_delay, dest) %>%
filter(arr_time > 100)
head(selectedFlights_dplyr)
Let’s create new variables:
arr_time_hours <-
flightsFrame_DT[air_time > 20, .(air_time, air_time_hours = air_time/60)]
head(arr_time_hours)
Enters :=
- the column assignment operator in {data.table}
, which allows to modify a data.table object by reference (to be explained in the Session). We will first filter out all rows with an NA
value in air_time
:
dim(flightsFrame_DT)
flightsFrame_DT <- flightsFrame_DT[!is.na(air_time)]
dim(flightsFrame_DT)
Using :=
we can modify a data.table object without having to make a copy of it - which happens with both base R data.frame objects and in {dplyr}:
flightsFrame_DT[ , air_time_hours := air_time/60]
flightsFrame_DT[ , .(air_time, air_time_hours)]
by =
), aggregation, and joinsA simple aggregation:
flightsFrame_DT[ , .(avg_air_time = mean(air_time)), by = dest]
This is equivalent to {dplyr}:
flightsFrame_DT %>%
select(dest, air_time) %>%
group_by(dest) %>%
summarise(avg_air_time = mean(air_time))
Simple aggregation with filtering:
flightsFrame_DT[dep_delay <= 0,
.(avg_dep_delay = mean(dep_delay), avg_air_time = mean(air_time)),
by = dest]
Count rows (observations) by groups:
flightsFrame_DT[dep_delay <= 0, .N, by = dest]
To demonstrate a join operation in {data.table} load the planes
data.frame from nycflights13
:
planes <- nycflights13::planes
head(planes)
As a reminder, this is how it would be done in {dplyr}
with left_join()
over tailnum
(see Session 10):
flights_relations <- dplyr::left_join(flightsFrame_DT,
planes,
by = "tailnum")
In {data.table}, first use setkey()
, then promote planes
to a data.table object, and finally use merge()
:
setkey(flightsFrame_DT, tailnum)
planes <- data.table(planes)
setkey(planes, tailnum)
flightPlanes <- merge(flightsFrame_DT,
planes,
by = "tailnum",
all.x = T)
Oh. One final thing… data.table::fwrite()
.
system.time(
write.csv(flightsFrame_DT, paste0(dataDir, "flightsFrame_DT_writecsv.csv"))
)
system.time(
fwrite(flightsFrame_DT, paste0(dataDir, "flightsFrame_DT_fwrite.csv"))
)
Imagine the following situation: there are two popular social groups, A
and B
, and we have the knowledge on how many men and women are members of them: in group A
we find 87 men and 57 women, while in group B
we find 57 men and 96 women.
probs <- data.frame(Group = c('A', 'B'),
Male = c(87, 44),
Female = c(57, 96))
probs
Now imagine that we want to make one random draw from this sample and identify one single individual. What is the probability to randomly select a man vs a woman?
Let’s see: we have
men <- sum(probs$Male)
print(men)
[1] 131
men, and
women <- sum(probs$Female)
print(women)
[1] 153
women, so the probability P(man)
would be:
p_man <- men/(men + women)
p_man
[1] 0.4612676
where men + women
is also the total sample size. The probability to randomly draw a woman from the sample is:
p_woman <- women/(men + women)
p_woman
[1] 0.5387324
Of course,
p_man + p_woman
[1] 1
Now: what would be the probability to randomly draw a man from the sample if we already know that we will be picking someone from group A
? Let’s take a look at the sample once again:
probs
Obviously, we would focus our attention on the first row only, where we find 87 men and 57 women, neglecting the second row because we already know that the person is a member of group A
.
cp_man_A <- probs$Male[probs$Group == 'A']/(probs$Male[probs$Group == 'A'] + probs$Female[probs$Group == 'A'])
cp_man_A
[1] 0.6041667
cp_woman_A <-
probs$Female[probs$Group == 'A']/(probs$Male[probs$Group == 'A'] + probs$Female[probs$Group == 'A'])
cp_woman_A
[1] 0.3958333
And again we have:
cp_woman_A + cp_man_A
[1] 1
The probabilities cp_man_A
and cp_woman_A
are called conditional probabilities and play a very important role in many mathematical models used in Data Science and Machine Learning:
\(P(Y|X) = \frac{P(Y{\cap}X)}{P(X)}\)
where \(P(Y|X)\) is the conditional probability of observing Y given that we already know that X obtains, while \({P(X{\cap}Y)}\) is the joint probability of observing both X and Y. Let’ see: what is the joint probability of observing both a person from group A and a man? From the table we see that there are 87 men in group A, so that probability must be 87 divided by the total sample size which is 284:
probs$Male[1]/sum(probs[, 2:3])
[1] 0.306338
Now we need to divide this joint probability by P(A)
in total:
sum(probs[1, 2:3])/sum(probs[, 2:3])
[1] 0.5070423
and the desired conditional probability is:
jointP <- probs$Male[1]/sum(probs[, 2:3])
totalP <- sum(probs[1, 2:3])/sum(probs[, 2:3])
jointP/totalP
[1] 0.6041667
Two events, \(X\) and \(Y\), are said to be statistically independent iff:
\[P(X{\cap}Y) = P(X)P(Y)\]
If \(P(X)\) is not zero, and given that
\[P(Y|X) = \frac{P(Y{\cap}X)}{P(X)}\],
that means that statistical independence implies something very intuitive, namely:
\[P(Y|X) = P(Y)\]
It can be easily shown that the following holds:
\[P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}\]
The expression is called Bayes’ Theorem and plays a role of immense importance in contemporary mathematical statistics, Data Science and Machine Learning.
This is how we speak of its terms:
\[P(Y|X)\]
is the posterior probability of obtaining \(Y\) from knowing \(X\)
\[P(X|Y)\]
is the likelihood of obtaining \(X\) from knowing \(Y\)
\[P(Y)\]
is the prior probability of obtaining \(Y\).
Imagine that we wish to express our belief about the parameter \(p\) of a Binomial Distribution by another function. It can be shown that it makes sense to use the Beta Distribution to express such beliefs, which has a support constrained to [0, 1]:
\[P(x;\alpha, \beta) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\]
where \(B(\alpha,\beta)\) is the Beta function
\[B(\alpha,\beta)=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha)+\Gamma(\beta)}\]
and serves the purposes of normalization only.
For example, a \(Beta(x;\alpha,\beta)\) distribution with \(\alpha = 7\) and \(\beta=13\) looks like this:
library(ggplot2)
x <- seq(.01, .99, by = .01)
density <- dbeta(x, 7, 13)
densFrame <- data.frame(x = x,
density = density)
ggplot(densFrame, aes(x = x, y = density)) +
geom_path(size = .15, group = 1, color = "red") +
ggtitle("Beta(3,7)") +
theme_bw() +
theme(panel.border = element_blank())
And this is how \(Beta(x;\alpha,\beta)\) distribution with \(\alpha = 1\) and \(\beta=1\) (i.e. the Uniform Prior) looks like
library(ggplot2)
x <- seq(.01, .99, by = .01)
density <- dbeta(x, 1, 1)
densFrame <- data.frame(x = x,
density = density)
ggplot(densFrame, aes(x = x, y = density)) +
geom_path(size = .15, group = 1, color = "red") +
ggtitle("Beta(1,1)") +
theme_bw() +
theme(panel.border = element_blank())
While \(Beta(x;\alpha,\beta)\) distribution with \(\alpha = 1/2\) and \(\beta=1/2\) looks like this (this is also called a Jeffrey’s Prior):
library(ggplot2)
x <- seq(.01, .99, by = .01)
density <- dbeta(x, 1/2, 1/2)
densFrame <- data.frame(x = x,
density = density)
ggplot(densFrame, aes(x = x, y = density)) +
geom_path(size = .15, group = 1, color = "red") +
ggtitle("Beta(1/2,1/2)") +
theme_bw() +
theme(panel.border = element_blank())
Turns out that the \(Beta(\alpha, \beta)\) distribution has a nice property that makes it a suitable conjugate prior distribution in Bayesian inference for the Binomial Distribution. Namely, if our prior beliefs about the Binomial \(p\) parameter are expressed as \(Beta(p; \alpha, \beta)\), and then we observe \(x\) successes from \(n\) trials in a Binomial experiment, our posterior belief about the Binomial \(p\) parameter can be expressed as:
\[Beta(p;\alpha',\beta')\] where
\[\alpha' = \alpha + x\]
and
\[\beta' = \beta + n - x\] Let’s illustrate. Say that in the beginning we know nothing about the possible value of \(p\). We express this absence of knowledge by a uniform prior, \(Beta(\alpha=1, \beta=1))\):
x <- seq(.01, .99, by = .01)
density <- dbeta(x, 1, 1)
densFrame <- data.frame(p = x,
density = density)
ggplot(densFrame, aes(x = p, y = density)) +
geom_path(size = .35, group = 1, color = "red") +
ggtitle("Our a priori is: Beta(1,1)") +
theme_bw() +
theme(panel.border = element_blank())
Let’s assume that than we observe 1,000 coin tosses of which 275 resulted in success (i.e. \(Head\)); we update our prior beliefs accordingly:
library(tidyr)
x <- seq(.01, .99, by = .01)
# - prior
alpha <- 1
beta <- 1
density <- dbeta(x, alpha, beta)
# - update
post_alpha <- alpha + 275
post_beta <- beta + 1000 - 275
post_density <- dbeta(x, post_alpha, post_beta)
densFrame <- data.frame(p = x,
prior = density,
posterior = post_density) %>%
tidyr::pivot_longer(-p,
names_to = "beliefs",
values_to = "value")
ggplot(densFrame, aes(x = p,
y = value,
group = beliefs,
color = beliefs)) +
geom_path(size = .35, ) +
ggtitle("Prior and Posterior") +
theme_bw() +
theme(panel.border = element_blank())
And what if we have some prior knowlegde and do not wish to begin from a uniform prior distribution? What if we have already observed 1,000 coin tosses that resulted in 275 heads, and then only we observe another 500 tosses resulting in 150 heads (weird, but still)?
library(tidyr)
x <- seq(.01, .99, by = .01)
# - prior
alpha <- 276
beta <- 726
density <- dbeta(x, alpha, beta)
# - update
post_alpha <- alpha + 150
post_beta <- beta + 500 - 150
post_density <- dbeta(x, post_alpha, post_beta)
densFrame <- data.frame(p = x,
prior = density,
posterior = post_density) %>%
tidyr::pivot_longer(-p,
names_to = "beliefs",
values_to = "value")
ggplot(densFrame, aes(x = p,
y = value,
group = beliefs,
color = beliefs)) +
geom_path(size = .35, ) +
ggtitle("Prior and Posterior") +
theme_bw() +
theme(panel.border = element_blank())
R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.
Goran S. Milovanović
DataKolektiv, 2020/21
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.