apply
family of functionsFeedback should be send to goran.milovanovic@datakolektiv.com
. These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.
We want to provide an overview of the apply
family of functions in R and focus on their similarities and differences.
lapply() and sapply()
Let’s create a vector of 100 uniformly distributed random numbers on the [0, 1] interval:
unifs <- runif(n = 100, min = 0, max = 1)
head(unifs)
[1] 0.429210843 0.605104316 0.181764671 0.814155390 0.008703795
[6] 0.329078791
We can find out which elements in unifs
are > .5
by a simple R expression:
probable <- unifs > .5
sum(probable)
[1] 52
But we can do the same from lapply()
:
probable <- lapply(unifs, function(x) {x > .5})
head(probable)
[[1]]
[1] FALSE
[[2]]
[1] TRUE
[[3]]
[1] FALSE
[[4]]
[1] TRUE
[[5]]
[1] FALSE
[[6]]
[1] FALSE
Except for now probable
is a list - because lapply()
returns a list by design. We want a vector; two ways to go, first is to use unlist()
:
probable <- unlist(lapply(unifs, function(x) {x > .5}))
head(probable)
[1] FALSE TRUE FALSE TRUE FALSE FALSE
unlist()
is easy:
a <- list(a = 5,
b = 10,
c = 17)
unlist(a)
a b c
5 10 17
unlist()
will automatically convert a named list to a named vector, as we can see; beware of the implicit type conversion in R:
a <- list(a = "5",
b = 10,
c = 17)
unlist(a)
a b c
"5" "10" "17"
The other way to obtain a vector in place of a list is to use the sapply()
function:
probable <- sapply(unifs, function(x) {x > .5})
head(probable)
[1] FALSE TRUE FALSE TRUE FALSE FALSE
sapply()
will try to simplify the lapply()
result whenever it is possible to do so. Since our input was very simple - a vector of random numbers - the output was also very simple - a list with each element representing a single logical
value - sapply()
was able to complete the task exactly as expected.
However, you need to be careful when using sapply()
. Let’s implement a slightly more complicated function:
probable_57 <- sapply(unifs, function(x) {
point_5 <- x > .5
point_75 <- x > .75
return(c(point_5,
point_75))
})
head(probable_57)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
[2,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
[,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33]
[1,] FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
[2,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44]
[1,] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
[2,] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
[,45] [,46] [,47] [,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55]
[1,] TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
[2,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
[,56] [,57] [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65] [,66]
[1,] FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[,67] [,68] [,69] [,70] [,71] [,72] [,73] [,74] [,75] [,76] [,77]
[1,] TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
[2,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
[,78] [,79] [,80] [,81] [,82] [,83] [,84] [,85] [,86] [,87] [,88]
[1,] TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
[2,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[,89] [,90] [,91] [,92] [,93] [,94] [,95] [,96] [,97] [,98] [,99]
[1,] TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
[,100]
[1,] FALSE
[2,] FALSE
sapply()
has returned the result as a matrix:
class(probable_57)
[1] "matrix" "array"
dim(probable_57)
[1] 2 100
The result of the point_5 <- x > .5
call from the function are found in the first row, while the results of the point_75 <- x > .75
call from the function are found in the second row.
mapply() and Map()
Let’s now turn to mapply()
and Map()
. We have already seen Map()
in action, e.g:
v1 <- 1:10
v2 <- seq(2, 20, by = 2)
exps <- Map("^", v1, v2)
print(exps)
[[1]]
[1] 1
[[2]]
[1] 16
[[3]]
[1] 729
[[4]]
[1] 65536
[[5]]
[1] 9765625
[[6]]
[1] 2176782336
[[7]]
[1] 678223072849
[[8]]
[1] 2.81475e+14
[[9]]
[1] 1.500946e+17
[[10]]
[1] 1e+20
A list of results is returned, as in lapply()
. We can unlist()
that, of course:
exps <- unlist(Map("^", v1, v2))
print(exps)
[1] 1.000000e+00 1.600000e+01 7.290000e+02 6.553600e+04 9.765625e+06
[6] 2.176782e+09 6.782231e+11 2.814750e+14 1.500946e+17 1.000000e+20
Now, we can accomplish exactly the same with laplly()
or sapply()
by rewriting our code in the following way:
l <- Map(list, v1, v2)
exps <- unlist(
lapply(l, function(x) {
x[[1]]^x[[2]]
})
)
print(exps)
[1] 1.000000e+00 1.600000e+01 7.290000e+02 6.553600e+04 9.765625e+06
[6] 2.176782e+09 6.782231e+11 2.814750e+14 1.500946e+17 1.000000e+20
What have I done? I have first used Map()
to create a list of pairs of lists, each element in the pair coming first from v1
and then from v2
, look:
l <- Map(list, v1, v2)
head(l)
[[1]]
[[1]][[1]]
[1] 1
[[1]][[2]]
[1] 2
[[2]]
[[2]][[1]]
[1] 2
[[2]][[2]]
[1] 4
[[3]]
[[3]][[1]]
[1] 3
[[3]][[2]]
[1] 6
[[4]]
[[4]][[1]]
[1] 4
[[4]][[2]]
[1] 8
[[5]]
[[5]][[1]]
[1] 5
[[5]][[2]]
[1] 10
[[6]]
[[6]][[1]]
[1] 6
[[6]][[2]]
[1] 12
And then lapply()
to compute:
exps <- unlist(
lapply(l, function(x) {
x[[1]]^x[[2]]
})
)
print(exps)
Because lapply()
takes a vector (or a list, but list is a vector) as its input, I can use only function(x)
of one argument; I have created a list of lists so that I can compute x[[1]]^x[[2]]
in the function call.
And of course I could have used sapply()
to simplify the result:
exps <- sapply(l, function(x) {
x[[1]]^x[[2]]
})
print(exps)
[1] 1.000000e+00 1.600000e+01 7.290000e+02 6.553600e+04 9.765625e+06
[6] 2.176782e+09 6.782231e+11 2.814750e+14 1.500946e+17 1.000000e+20
Another way to accomplish the same and avoid creating a list of lists to pass to lapply()
or sapply()
was to create a 2D array - a matrix - and pass it to apply()
:
v
v1 v2
[1,] 1 2
[2,] 2 4
[3,] 3 6
[4,] 4 8
[5,] 5 10
[6,] 6 12
[7,] 7 14
[8,] 8 16
[9,] 9 18
[10,] 10 20
apply(v, 1, function(x) {
x[1]^x[2]
})
[1] 1.000000e+00 1.600000e+01 7.290000e+02 6.553600e+04 9.765625e+06
[6] 2.176782e+09 6.782231e+11 2.814750e+14 1.500946e+17 1.000000e+20
Remember: the second argument to apply()
- 1
in this example - specifies the array dimension across which function(x)
will operate; 1
for rows, 2
for columns, etc.
Now, what is mapply()
? We did not use this function before. Its relationship to Map()
is similar to the relationship of sapply()
to mapply()
.
While Map()
returns a list…
l <- Map(list, v1, v2)
head(l)
[[1]]
[[1]][[1]]
[1] 1
[[1]][[2]]
[1] 2
[[2]]
[[2]][[1]]
[1] 2
[[2]][[2]]
[1] 4
[[3]]
[[3]][[1]]
[1] 3
[[3]][[2]]
[1] 6
[[4]]
[[4]][[1]]
[1] 4
[[4]][[2]]
[1] 8
[[5]]
[[5]][[1]]
[1] 5
[[5]][[2]]
[1] 10
[[6]]
[[6]][[1]]
[1] 6
[[6]][[2]]
[1] 12
… mapply()
tries to simplify:
l <- mapply(list, v1, v2)
head(l)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 2 3 4 5 6 7 8 9 10
[2,] 2 4 6 8 10 12 14 16 18 20
Map()
is really just a wrapper around mapply()
; you can pass a SIMPLIFY
argument to it set to either TRUE
or FALSE
, look:
l <- mapply(list, v1, v2, SIMPLIFY = FALSE)
head(l)
[[1]]
[[1]][[1]]
[1] 1
[[1]][[2]]
[1] 2
[[2]]
[[2]][[1]]
[1] 2
[[2]][[2]]
[1] 4
[[3]]
[[3]][[1]]
[1] 3
[[3]][[2]]
[1] 6
[[4]]
[[4]][[1]]
[1] 4
[[4]][[2]]
[1] 8
[[5]]
[[5]][[1]]
[1] 5
[[5]][[2]]
[1] 10
[[6]]
[[6]][[1]]
[1] 6
[[6]][[2]]
[1] 12
So Map()
is really just the same as mapply(fun, fun_arguments, SIMPLIFY = TRUE)
.
R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.
Goran S. Milovanović
DataKolektiv, 2020/21
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.