Session 04: Functions and vectorization. Overview: R programming. Some non-tabular data representations: XML and JSON.

Feedback should be send to goran.milovanovic@datakolektiv.com. These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.


What do we want to do today?

Exactly as the Course Overview states: “Week 4. Serious programming in R begins: functions, vectorization. More data formats and structures (simple things in JSON and XML). Overview: data types and structures, control flow, functions and functional programming. Why do we have different and rich data structures in R: the philosophy of functional programming.” This session encompasses three “units”. In the first one (sections 1, 2) we will learn about user-defined functions in R, introduce matrices, and discuss vectorization in R. We will use the second third of our time in this session to provide an overview of (almost) everything learned about R thus far: data types and structures (i.e. classes), control flow, and functions and functional programming - in an attempt to start synthesizing our knowledge of R programming. Finally, and while we are still at the beginning of our journey, we introduce XML and JSON data formats in order to discover the world of non-tabular data representation. These two formats will be of uttermost importance to us when we start contacting REST APIs from our R environments to obtain data.

0. Prerequisits.

Install the following packages:

install.packages('XML')
install.packages('jsonlite')
install.packages('httr')

{XML} is a well-know R package to manipulate, load, and store data in the XML format, while {jsonlite} does the same for JSON.

1. Functions

We have already used a myriad of R functions: lapply(), sapply(), print(), paste0(), unique(), c()… And now we want to become able to write our own, user-defined functions! Think of a function as a piece of reusable code that can be called upon whenever we need it to perform the same action. Also, functions can vary in their application, and that variability is controlled by their arguments, paste0(a, b) will do a different thing when a <- 5; b <- "Alpha" and when `a <- “Beta”; b <- 3". It is really easy to define a function in R:

sumAB <- function(a, b) {
  return(a + b)
}

1.1 Functions: their arguments and behaviour

Upon executing this code, we find a new object in our R environment, a function sumAB:

ls()
[1] "m"     "sumAB"

So, let’s start getting used to the fact that functions in R are first-class citizens and can perform many things that variables and other objects can: for example, a function can be an argument to another function, a function can return a function as its output… Let’s call sumAB() on some values:

result <- sumAB(a = 10, b = 2)
print(result)
[1] 12

The function return() used inside the sumAB() function tells R what is the function output. In an assignment of result <- sumAB(a = 10, b = 2), that output will be assigned to the result. You do not have to use return(), in which case R will return the result of the last evaluation inside the function as its output, so

sumAB <- function(a, b) {
  a + b
}

works just nice. But it is a convention to always use return().

In the previous call to sumAB we have explicitly used the argument names, a and b. However, it would also work if we did this:

result <- sumAB(10, 2)
print(result)
[1] 12

How did R knew that a should be 10 and b set to 2? There are three ways in which R determines the binding of the actual arguments - the one supplied in a function call - to the formal arguments - the ones supplied in the definition of the function. sumAB(10, 2) worked by interpreting 10 and 2 as positional arguments - because no formal argument names a and b were provided - and thus bound a = 10 and b = 2. What else can happen?

First, match by argument names:

sumAB(a = 5, b = 2)
[1] 7

Second, match by position:

sumAB(5, 2)
[1] 7

Then, we can have a partial match by argument name (when the match is unambiguous):

sumAB(2, a = 5)
[1] 7

And finally, if we have long formal argument names, we can abbreviate them (again only when the match is unambiguous):

sumSquares <- function(a_Argument, b_Argument) {
  return(a_Argument^2 + b_Argument^2)
}
sumSquares(a_ = 10, 3)
[1] 109

1.2 Functions, vectorization, and functional programming

Do not be surprised by the following:

squares <- function(x) {
  return(x^2)
}
someVector <- 1:10
squares(someVector)
 [1]   1   4   9  16  25  36  49  64  81 100

“What - my functions are vectorized?” - Well, if you use vectorized base R functions to compute their value - such as "^"() - yes, they are :) - Vector in, ^ - vector out. Such a beautiful programming language. Look:

sumSquares <- function(x, y) {
  return(x^2 + y^2)
}
a <- 1:10
b <- 11:20
sumSquares(a, b)
 [1] 122 148 178 212 250 292 338 388 442 500

Note. Do not ever forget about recycling in R. Never.

Remember Map(), only mentioned in our Session03? Look:

summa <- function(x, y) {
  return(x + y)
}
a <- 1:10
b <- 11:20
unlist(Map(summa, x = a, y = b))
 [1] 12 14 16 18 20 22 24 26 28 30

Or:

summa <- function(x) {
  return(x[1] + x[2])
}
arguments <- list(c(1, 11), 
                  c(2, 12), 
                  c(3, 13), 
                  c(4, 14), 
                  c(5, 15),
                  c(6, 16), 
                  c(7, 17), 
                  c(8, 18), 
                  c(9, 19), 
                  c(10, 20))
unlist(lapply(arguments, summa))
 [1] 12 14 16 18 20 22 24 26 28 30

In other words: of course you can do with your own functions whatever you were able to do with the in-built R functions.

1.3 Scoping

Take a look at the following example:

b <- 10
myFun <- function(a) {
  b <- 5
  return(a + b)
}
result <- myFun(3)

What will be the value of result: 13, or 8? Let’s see:

print(result)
[1] 8

Ok: how did R know that b should be 5 (as defined inside the myFun function) and not 10 (as defined outside the function and before it was introduced)?

Variable Scoping in R simplified. If there is a variable outside the function bearing the same name as the variable inside the function, the variable inside the function will have priority. In other words, define b to be 10 outside the function, and declare b to be 3 inside the function, and we you use b to compute something inside the function it will have the value of 3 i.e. the one assigned to it inside the function.

Function in R are a serious topic indeed, and we are able to provide a basic introduction here only. To learn about nuances and observe R functions in all their beauty, refer to the Functions, Chapter 6 in Hadley Wickham, Advanced R.

1.4 Creating functions on the fly

Imagine creating a function that can be customized for specific needs every now and then? For example, to create a function that returns x^n for a given n:

createPower <- function(exp) {
  f <- function(x) {x^exp}
  return(f)
}

createPower() is a function factory: it returns a function after customizing it for one given argument, exp in this case. Now we can easily create functions that return x^2, x^3, etc:

square <- createPower(2)
square(4)
[1] 16
cube <- createPower(3)
cube(10)
[1] 1000

2. Vectorization, matrices, and sequences

Here is a glimpse at what follows in Session05 on vectorization and vector arithmetic. We have met various beings in R already, but I think that we were not introduced to matrices yet:

square <- matrix(1:4, 
                 nrow = 2)
square
     [,1] [,2]
[1,]    1    3
[2,]    2    4

Good. Now, what happens if:

square <- matrix(1:4, 
                 nrow = 2)
constant <- 10
constant*square
     [,1] [,2]
[1,]   10   30
[2,]   20   40

Vectorization. Look:

square <- matrix(1:4, 
                 ncol = 2)
constant <- 2
square^constant
     [,1] [,2]
[1,]    1    9
[2,]    4   16

Similarly:

constant <- 2
(1:10)^constant
 [1]   1   4   9  16  25  36  49  64  81 100

Do not confuse vectorization with linear algebra:

square1 <- matrix(1:4, 
                 nrow = 2)
square2 <- matrix(rep(10, 4), 
                  nrow  = 2)
square1 * square2
     [,1] [,2]
[1,]   10   30
[2,]   20   40

But that is not how matrices are multiplied:

square1 <- matrix(1:4, 
                 nrow = 2)
square2 <- matrix(rep(10, 4), 
                  nrow  = 2)
square1 %*% square2
     [,1] [,2]
[1,]   40   40
[2,]   60   60

Right? Check it out manually to confirm. So, * is a vectorized operator (a function, in fact) in R, while %*% is a matrix multiplication operator. We will cover more about vector arithmetics in our next Session05.

rep() and seq() both produce vectors in R. rep() takes two arguments, x which is a vector, and times which explains how many times should X be repeated:

myVec <- c("Paris", "London", "New York")
rep(x = myVec, times = 4)
 [1] "Paris"    "London"   "New York" "Paris"    "London"   "New York" "Paris"    "London"  
 [9] "New York" "Paris"    "London"   "New York"

But also:

square <- matrix(1:4,
                 nrow = 2)
rep(square, 3)
 [1] 1 2 3 4 1 2 3 4 1 2 3 4

Note how square was stripped to a numeric vector when rep() was applied to it. Also, times can play a different role if a vector is provided:

myVec <- c("Paris", "London", "New York")
rep(x = myVec, times = 1:3)
[1] "Paris"    "London"   "London"   "New York" "New York" "New York"

seq():

seq(from = 10, to = 100, by = 10)
 [1]  10  20  30  40  50  60  70  80  90 100

Easy.

Matrix subsetting is similar to what can be done with a dataframe:

square <- matrix(1:9,
                 nrow = 3)
square
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
square[1, 1]
[1] 1
square[2, 2]
[1] 5

Subsetting rows:

square[2, ]
[1] 2 5 8

Subsetting columns:

square[ , 2]
[1] 4 5 6
square <- matrix(1:9,
                 nrow = 3)
colnames(square) <- paste0("c_", 1:3)
rownames(square) <- paste0("r_", 1:3)
square
    c_1 c_2 c_3
r_1   1   4   7
r_2   2   5   8
r_3   3   6   9

This is possible; however, square is not a data.frame class, so square$c_1 will not work.

The matrix main diagonal:

diag(square)
[1] 1 5 9

The matrix transpose:

t(square)
    r_1 r_2 r_3
c_1   1   2   3
c_2   4   5   6
c_3   7   8   9

Sums:

colSums(square)
c_1 c_2 c_3 
  6  15  24 
rowSums(square)
r_1 r_2 r_3 
 12  15  18 

Let’s take a look at the following example: we will multiply the square matrix by c(2, 3):

square <- matrix(1:9,
                 nrow = 3)
square
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Now:

square * c(2, 3)
     [,1] [,2] [,3]
[1,]    2   12   14
[2,]    6   10   24
[3,]    6   18   18

Question for you: how does R order the elements in a matrix? Do you understand recycling in R fully?

3. Overview: R programming

A long, elaborated synthesis of what is learned by now: - data types and structures, - control flow, - functions, functionals (lapply(), for example: vector + function enter -> results a function), and functional programming, - vectorization.

What is missing in this (big) picture are the principles of Object-Oriented Programming (OOP) in R which are not covered by this introductory course. Anyways, what is important to understand is that at this point already you are have a nice toolbox of R principles and techniques to develop serious Data Science projects - while our journey has just started!

4. Not all data are tabular: a glimpse at XML and JSON

4.1 Various data formats: why? Theory

Until now, we have used only data stored as .csv or .xlsx files. As we have seen it was possible to map such files 1:1 onto dataframes in R: columns were defined, with the first rows holding column names by convention… However, in Data Science, many times we face the situation in which we have to obtain some data from a source that does not necessarily deliver “tabular” data structures. Also, many times we will be facing data structures that are essentially not “tabular”: for example, data structures in which some elements contain certain fields which other elements do not - and not because there are missing data, but because sometimes it does not make sense for something to be described by a certain attribute at all. Imagine describing David Bowie, a famous English musician, by a data structure: we can know his date of birth, and the release dates of his albums perhaps, but should be place all that data into one single column? No, because the semantics of such a field would be weird, of course. How would we organize the rows in a dataframe describing David Bowie: the first row describes the person, while other rows describe his works of art? So, we have a column for dateOfBirth, and that column has a value in the first row of the dataframe only, and then we have a column for releaseDate, and that column holds NA in the first row and then a timestamp in all other rows that refer to his albums? Wait, what about his spouse, children, collaborators: assign a row in a dataframe to each one?

No. Of course, a list in R would do, correct?

JSON (abbr. JavaScript Object Notation) and XML (abbr. eXtensible Markup Language) are two data-interchange formats that are often used to store and transport data between different systems online. While it is not possible to dive into any details of JSON and XML representations in this session, it is important to make you aware of the fact that you can obtain any data from any online (or not) source in JSON or XML from within your R environment, transform them into R lists (or dataframes, if the transformation makes sense), and use them in your work.

In the following examples we will use the {httr} package to get in touch with one REST API on the Internet and obtain some data from it. Think of REST APIs as systems that can understand user requests for data formulated in an URL pointing to an online server, process such requests, and send the data back to the user. In the following examples we will be using the Wikibase API to obtain data from Wikidata, the World’s largest open knowledge base that comprises all structured information from Wikipedia and many other sources.

NOTE. Not all code in this section will be immediately transparent to you in this Session. We will work with APIs more in the future sessions and learn how to obtain data from them in a step by step fashion. Our immediate goal is to prepare you for what follows in the weeks ahead us: you need to see JSON and XML and start felling comfortable with transferring data from these two data formats into R data structures. Lists, of course. We will analyze some JSON representations online in this Session to get a glimpse of how it used to describe just any data at hand.

4.2 A JSON response from a REST API

Load {httr}, {XML}, {jsonlite}:

library(httr)
library(XML)
library(jsonlite)

We will contact the Wikibase API to obtain all data stored in Wikidata on David Bowie (who has a Q identifier of Q5383 in this knowlegde base). We will ask the Wikibase API to use JSON to describe its response. Here is how the JSON response will look like: Wikibase API response.

query <- 
  'https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q5383&languages=en&format=json'
response <- GET(URLencode(query))
response <- rawToChar(response$content)
response <- fromJSON(response)
class(response)
[1] "list"

Now, Wikidata is very complex (and thus very powerful as a descriptive system; after all, it’s goal is to be able to describe just anything that we can imagine, talk, and write about), so what fromJSON() returns is a nasty, nasty, nested list:

instaceOf_DavidBowie <- response$entities$Q5383$claims$P31
instaceOf_DavidBowie$mainsnak$datavalue$value

It is necessary to study the Wikidata DataModel carefully in order to be able to navigate the knowledge structures that it describes:

labelOf_DavidBowie <- response$entities$Q5383$labels$en$value
labelOf_DavidBowie
[1] "David Bowie"

However, once you do learn about Wikidata’s data model… More than 100 million highly structured items and relations among them will become accessible to you. Order emerges from chaos in this case, I assure you. Besides JSON, there is XML (and many more, but we will focus on these two formats).

4.3 An XML response from a REST API

Let’s get back to the Wikibase API and ask for the same data on David Bowie wrapped in an XML response:

query <- 
  'https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q5383&languages=en&format=xml'
response <- GET(URLencode(query))
response <- rawToChar(response$content)
response <- xmlParse(response)
response <- xmlToList(response)

The English label for David Bowie in Wikidata:

response$entities$entity$labels$label
     language         value 
         "en" "David Bowie" 
class(response$entities$entity$labels$label)
[1] "character"

4.4 All names of David Bowie

Now, David Bowie, in all languages available in Wikidata. First, we get the data.

query <- 
  'https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q5383&format=json'
response <- GET(URLencode(query))
response <- rawToChar(response$content)
response <- fromJSON(response)

Second: study the structure of the response, and then lapply() across the appropriate set of lists:

labels <- lapply(response$entities$Q5383$labels, function(x) {
  paste0(x$value, "(", x$language, ")")
})
labels <- paste(labels, collapse = ", ")
print(labels)
[1] "David Bowie(fr), David Bowie(de), David Bowie(en-ca), David Bowie(en-gb), ديفيد بوي(ar), Devid Boui(az), Дэвід Боўі(be), Дейвид Боуи(bg), David Bowie(br), David Bowie(bs), David Bowie(ca), David Bowie(co), David Bowie(cs), David Bowie(cy), David Bowie(da), David Bowie(diq), Ντέιβιντ Μπόουι(el), David Bowie(eo), David Bowie(es), David Bowie(et), David Bowie(eu), دیوید بویی(fa), David Bowie(fi), David Bowie(ga), David Bowie(gl), דייוויד בואי(he), डेविड बोवी(hi), David Bowie(hr), David Bowie(hu), David Bowie(id), David Bowie(io), David Bowie(is), David Bowie(it), デヴィッド・ボウイ(ja), David Bowie(jv), დევიდ ბოუი(ka), 데이비드 보위(ko), David Bowie(li), David Bowie(lt), Deivids Bovijs(lv), Дејвид Боуви(mk), David Bowie(nl), David Bowie(nn), David Bowie(oc), David Bowie(pl), David Bowie(pms), David Bowie(pt), David Bowie(pt-br), David Bowie(ro), Дэвид Боуи(ru), David Bowie(scn), David Bowie(sh), David Bowie(sk), David Bowie(sl), David Bowie(sq), Дејвид Боуи(sr), David Bowie(sv), డేవిడ్ బౌవీ(te), เดวิด โบอี(th), David Bowie(tr), Девід Боуї(uk), David Bowie(uz), David Bowie(vi), David Bowie(vls), 大卫·鲍伊(zh), 大衛寶兒(yue), David Bowie(de-ch), David Bowie(qu), David Bowie(la), Дэйвід Боўі(be-tarask), David Bowie(nb), David Bowie(mg), Դեյվիդ Բոուի(hy), David Bowie(ast), David Bowie(sco), David Bowie(lb), Дэвид Боуи(kk), David Bowie(ia), David Bowie(sd), David Bowie(gsw), डेविड बोवी(bho), David Bowie(fo), David Bowie(hsb), ਡੇਵਿਡ ਬੋਵੀ(pa), David Bowie(szl), David Bowie(af), David Bowie(an), David Bowie(bar), David Bowie(de-at), David Bowie(frp), David Bowie(fur), David Bowie(gd), David Bowie(ie), David Bowie(kg), David Bowie(lij), David Bowie(min), David Bowie(ms), David Bowie(nap), David Bowie(nds), David Bowie(nds-nl), David Bowie(nrm), David Bowie(pcd), David Bowie(rm), David Bowie(sc), David Bowie(sr-el), David Bowie(sw), David Bowie(vec), David Bowie(vo), David Bowie(wa), David Bowie(wo), David Bowie(zu), دیوید بویی(azb), ডেভিড বোয়ি(bn), David Bowie(eml), David Bowie(bcl), دەیڤد بویی(ckb), David Bowie(fy), David Bowie(lmo), David Bowie(nah), Devid Bowi(tt), Боуи Дэвид(cv), ഡേവിഡ് ബോയി(ml), David Bowie(nan), David Bowie(war), David Bowie(ilo), Дэвид Боуи(ba), 大衛寶兒(zh-hk), 大卫·鲍伊(zh-cn), 大卫·鲍伊(zh-hans), 大衛·鮑伊(zh-hant), 大衛·鮑伊(zh-mo), 大卫·鲍伊(zh-my), 大卫·鲍伊(zh-sg), 大衛·鮑伊(zh-tw), დევიდ ბოუი(xmf), David Bowie(ext), ديفيد باوى(arz), David Bowie(lfn), டேவிட் போவி(ta), ڈیوڈ بوئی(ur), David Bowie(en), 大卫·鲍伊(wuu), David Bowie(vro), ዴቪድ ቦሊ(am), Տէյվիտ Պոուի(hyw), David Bowie(kl), David Bowie(als), David Bowie(simple), David Bowie(tl)"

Didn’t I tell you how lists and functional programming are important in R?

Important sources, documentation, etc.

R Markdown

R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.

Exercises

  • E1. Write a function enlistVectors() that take three numeric() vectors as its arguments and returns a named list encompassing them.

  • E2. Write a function mySumSquares that takes one argument, x, and decides (a) if x is numeric, returns a square of sum(x), but (b) if x is a matrix, returns a numeric encompassing the squared sums of all its rows. Carefully: if m <- matrix(1:4, nrow = 2), what is class(m)?

  • E3. Write a function that returns a function! Create a function fixCase(string, how) that returns (1) a function which transforms all characters in a string to upper case if how == 'upper' and (2) transforms all characters in a string to lower case if how == 'lower'. Hint: study base R functions tolower() and toupper().

  • E4. Do you use any online service like Spotify, or a social network like Twitter? Ok, here is what you need to do: find if your service provides an REST API to obtain some data from it, read the API documentation and try to register as an API user. Hint: you will probably need to set a username and obtain your credentials (probably an authorization token, or a password, or something similar). In the following sessions we will be using APIs a lot to fetch data from the Internet.


Goran S. Milovanović

DataKolektiv, 2020/21

contact:


License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.


