Lab 05: {httr}, REST API calls, JSON, and ggplot2 in action

Feedback should be send to goran.milovanovic@datakolektiv.com. These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.


What do we want to do today?

We want to practice REST API access from R, a topic covered in Session 04. In the following example we will access a free REST API from within our R environment, collect the API response as JSON, convert it to an R list, and play with the data.

1. Setup the basic API access parameters

In this example we will rely on the free https://datausa.io/ API to obtain statistical data. Here is the intro to their API: datausa.io API.

  • You will find the API base endpoint there:
baseEndPoint <- "https://datausa.io/api/data"

2. Make a simple API call

We will use {httr} to get in touch with the API. It is a part of {tidyverse}.

library(httr)

Step 1. Define API parameters.

First we define the API parameters.

### --- compose API call
# - use base API endpoint
# - and concatenate with API parameters
# - from the following example: https://datausa.io/about/api/
# - parameter: drilldowns
drilldowns <- paste0("drilldowns=", "Nation")
# - parameter: measures
measures <- paste0("measures=", "Population")
# - parameters:
params <- paste("&", c(drilldowns, measures),
                sep = "", collapse = "")
cat(params)
&drilldowns=Nation&measures=Population

Step 2. Compose API call.

We put together the baseEndPoint with the API call parameters:

api_call <- paste0(baseEndPoint, "?", params)
cat(api_call)
https://datausa.io/api/data?&drilldowns=Nation&measures=Population

Step 3. Make API call.

We use httr::GET() to contact the API, ask for data, and fetch the result:

response <- GET(URLencode(api_call))
class(response)
[1] "response"

The URLencode(api_call) call to the base R URLencode() function will take care of Percent-encoding where and if necessary. Hint: always use URLencode(your_api_call).

We can see that response is now of a response class. It is pretty structured and rich indeed:

str(response)
List of 10
 $ url        : chr "https://datausa.io/api/data?&drilldowns=Nation&measures=Population"
 $ status_code: int 200
 $ headers    :List of 27
  ..$ date                            : chr "Sat, 05 Feb 2022 11:29:49 GMT"
  ..$ content-type                    : chr "application/json; charset=utf-8"
  ..$ x-dns-prefetch-control          : chr "off"
  ..$ strict-transport-security       : chr "max-age=15552000; includeSubDomains"
  ..$ x-download-options              : chr "noopen"
  ..$ x-content-type-options          : chr "nosniff"
  ..$ x-xss-protection                : chr "1; mode=block"
  ..$ content-language                : chr "en"
  ..$ etag                            : chr "W/\"55b-jEIUyvQphH/gM3DVlQl2pEdoLeo\""
  ..$ vary                            : chr "Accept-Encoding"
  ..$ content-encoding                : chr "gzip"
  ..$ last-modified                   : chr "Sat, 05 Feb 2022 10:25:59 GMT"
  ..$ x-cache-status                  : chr "MISS"
  ..$ x-frame-options                 : chr "SAMEORIGIN"
  ..$ access-control-allow-origin     : chr "*"
  ..$ access-control-allow-credentials: chr "true"
  ..$ access-control-allow-methods    : chr "GET, POST, OPTIONS"
  ..$ access-control-allow-headers    : chr "DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type"
  ..$ x-cache-key                     : chr "GET/api/data?&drilldowns=Nation&measures=Population"
  ..$ cache-control                   : chr "max-age=1800"
  ..$ cf-cache-status                 : chr "HIT"
  ..$ age                             : chr "3830"
  ..$ expect-ct                       : chr "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\""
  ..$ report-to                       : chr "{\"endpoints\":[{\"url\":\"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=FM8spWRDTZInGwYuO8rleDDLFu2qRN5h2Xy"| __truncated__
  ..$ nel                             : chr "{\"success_fraction\":0,\"report_to\":\"cf-nel\",\"max_age\":604800}"
  ..$ server                          : chr "cloudflare"
  ..$ cf-ray                          : chr "6d8bcd9c7b3878ac-VIE"
  ..- attr(*, "class")= chr [1:2] "insensitive" "list"
 $ all_headers:List of 1
  ..$ :List of 3
  .. ..$ status : int 200
  .. ..$ version: chr "HTTP/2"
  .. ..$ headers:List of 27
  .. .. ..$ date                            : chr "Sat, 05 Feb 2022 11:29:49 GMT"
  .. .. ..$ content-type                    : chr "application/json; charset=utf-8"
  .. .. ..$ x-dns-prefetch-control          : chr "off"
  .. .. ..$ strict-transport-security       : chr "max-age=15552000; includeSubDomains"
  .. .. ..$ x-download-options              : chr "noopen"
  .. .. ..$ x-content-type-options          : chr "nosniff"
  .. .. ..$ x-xss-protection                : chr "1; mode=block"
  .. .. ..$ content-language                : chr "en"
  .. .. ..$ etag                            : chr "W/\"55b-jEIUyvQphH/gM3DVlQl2pEdoLeo\""
  .. .. ..$ vary                            : chr "Accept-Encoding"
  .. .. ..$ content-encoding                : chr "gzip"
  .. .. ..$ last-modified                   : chr "Sat, 05 Feb 2022 10:25:59 GMT"
  .. .. ..$ x-cache-status                  : chr "MISS"
  .. .. ..$ x-frame-options                 : chr "SAMEORIGIN"
  .. .. ..$ access-control-allow-origin     : chr "*"
  .. .. ..$ access-control-allow-credentials: chr "true"
  .. .. ..$ access-control-allow-methods    : chr "GET, POST, OPTIONS"
  .. .. ..$ access-control-allow-headers    : chr "DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type"
  .. .. ..$ x-cache-key                     : chr "GET/api/data?&drilldowns=Nation&measures=Population"
  .. .. ..$ cache-control                   : chr "max-age=1800"
  .. .. ..$ cf-cache-status                 : chr "HIT"
  .. .. ..$ age                             : chr "3830"
  .. .. ..$ expect-ct                       : chr "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\""
  .. .. ..$ report-to                       : chr "{\"endpoints\":[{\"url\":\"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=FM8spWRDTZInGwYuO8rleDDLFu2qRN5h2Xy"| __truncated__
  .. .. ..$ nel                             : chr "{\"success_fraction\":0,\"report_to\":\"cf-nel\",\"max_age\":604800}"
  .. .. ..$ server                          : chr "cloudflare"
  .. .. ..$ cf-ray                          : chr "6d8bcd9c7b3878ac-VIE"
  .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
 $ cookies    :'data.frame':    0 obs. of  7 variables:
  ..$ domain    : logi(0) 
  ..$ flag      : logi(0) 
  ..$ path      : logi(0) 
  ..$ secure    : logi(0) 
  ..$ expiration: 'POSIXct' num(0) 
  ..$ name      : logi(0) 
  ..$ value     : logi(0) 
 $ content    : raw [1:1371] 7b 22 64 61 ...
 $ date       : POSIXct[1:1], format: "2022-02-05 11:29:49"
 $ times      : Named num [1:6] 0 0.0207 0.0442 0.1067 0.1485 ...
  ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
 $ request    :List of 7
  ..$ method    : chr "GET"
  ..$ url       : chr "https://datausa.io/api/data?&drilldowns=Nation&measures=Population"
  ..$ headers   : Named chr "application/json, text/xml, application/xml, */*"
  .. ..- attr(*, "names")= chr "Accept"
  ..$ fields    : NULL
  ..$ options   :List of 2
  .. ..$ useragent: chr "libcurl/7.77.0 r-curl/4.3.2 httr/1.4.2"
  .. ..$ httpget  : logi TRUE
  ..$ auth_token: NULL
  ..$ output    : list()
  .. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
  ..- attr(*, "class")= chr "request"
 $ handle     :Class 'curl_handle' <externalptr> 
 - attr(*, "class")= chr "response"

You need to check one thing: the server status response.

response$status_code
[1] 200

200 means that your request was processed successfully. Introduce yourself to server status responses and learn a bit about them from the following source: HTTP response status codes.

The results is found in response$content, but…

class(response$content)
[1] "raw"

What is raw? It means that your data were obtained as raw binary data and they need to be decoded into an R character class representation. It is easy:

resp <- rawToChar(response$content)
class(resp)
[1] "character"

Is resp lengthy?

nchar(resp)
[1] 1371
cat(resp)
{"data":[{"ID Nation":"01000US","Nation":"United States","ID Year":2019,"Year":"2019","Population":328239523,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2018,"Year":"2018","Population":327167439,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2017,"Year":"2017","Population":325719178,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2016,"Year":"2016","Population":323127515,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2015,"Year":"2015","Population":321418821,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2014,"Year":"2014","Population":318857056,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2013,"Year":"2013","Population":316128839,"Slug Nation":"united-states"}],"source":[{"measures":["Population"],"annotations":{"source_name":"Census Bureau","source_description":"The American Community Survey (ACS) is conducted by the US Census and sent to a portion of the population every year.","dataset_name":"ACS 1-year Estimate","dataset_link":"http://www.census.gov/programs-surveys/acs/","table_id":"B01003","topic":"Diversity","subtopic":"Demographics"},"name":"acs_yg_total_population_1","substitutions":[]}]}

Now we can see that the API response is JSON indeed. To work with JSON in R, we need to convert it into some R known data structures. For example a list.

Step 4. Convert JSON data to an R list.

We will use jsonlite, also a part of {tidyverse}, to convert from JSON to an R list:

library(jsonlite)
resp_list <- fromJSON(resp)
str(resp_list)
List of 2
 $ data  :'data.frame': 7 obs. of  6 variables:
  ..$ ID Nation  : chr [1:7] "01000US" "01000US" "01000US" "01000US" ...
  ..$ Nation     : chr [1:7] "United States" "United States" "United States" "United States" ...
  ..$ ID Year    : int [1:7] 2019 2018 2017 2016 2015 2014 2013
  ..$ Year       : chr [1:7] "2019" "2018" "2017" "2016" ...
  ..$ Population : int [1:7] 328239523 327167439 325719178 323127515 321418821 318857056 316128839
  ..$ Slug Nation: chr [1:7] "united-states" "united-states" "united-states" "united-states" ...
 $ source:'data.frame': 1 obs. of  4 variables:
  ..$ measures     :List of 1
  .. ..$ : chr "Population"
  ..$ annotations  :'data.frame':   1 obs. of  7 variables:
  .. ..$ source_name       : chr "Census Bureau"
  .. ..$ source_description: chr "The American Community Survey (ACS) is conducted by the US Census and sent to a portion of the population every year."
  .. ..$ dataset_name      : chr "ACS 1-year Estimate"
  .. ..$ dataset_link      : chr "http://www.census.gov/programs-surveys/acs/"
  .. ..$ table_id          : chr "B01003"
  .. ..$ topic             : chr "Diversity"
  .. ..$ subtopic          : chr "Demographics"
  ..$ name         : chr "acs_yg_total_population_1"
  ..$ substitutions:List of 1
  .. ..$ : list()

Step 5. Inspect the result and play with the data.

What is the length of resp_list?

length(resp_list)
[1] 2

Let’s discover what is inside:

class(resp_list$data)
[1] "data.frame"

How does the resp_list$data data.frame look like?

head(resp_list$data)

Oh, nice! Let’s plot the time series of the US population over years then:

library(ggplot2)
library(ggrepel)
ggplot(data = resp_list$data, 
       aes(x = Year,
           y = Population, 
           label = Population)) + 
  geom_path(size = .25, color = "blue", group = 1) + 
  geom_point(size = 2, color = "blue") + 
  geom_label_repel(size = 3) + 
  ggtitle("US Population") +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(hjust = .5))

What is the second element of resp_list?

class(resp_list$source)
[1] "data.frame"

Let’s see what is in:

head(resp_list$source)

Oh, no: there is a nested data.frame in resp_list$source; we do not like such things in R but that happens too often when we work with API responses. There is a nice function to take care about such occurrences: jsonlite::flatten():

source <- flatten(resp_list$source, recursive = TRUE)
colnames(source)
 [1] "measures"                       "name"                           "substitutions"                 
 [4] "annotations.source_name"        "annotations.source_description" "annotations.dataset_name"      
 [7] "annotations.dataset_link"       "annotations.table_id"           "annotations.topic"             
[10] "annotations.subtopic"          

I understand now: resp_list$source are the metadata! The API informed us about the sources of the data that it delivered:

source$measures
[[1]]
[1] "Population"

And then:

source$annotations.source_description
[1] "The American Community Survey (ACS) is conducted by the US Census and sent to a portion of the population every year."

Awesome: we get the data and the documentation for it!

3. Make another API call and inspect the data

For each API that you want to use you will need to read its documentation and learn about the parameters that you may pass to it.

I have stripped this API call from https://datausa.io/profile/soc/education-legal-community-service-arts-media-occupations: just click on View data in the top-right corner.

You can copy and paste the entire API call into your browsers navigation bar to obtain the JSON response directly.

The data are on education, legal, community service, arts, & media occupations in the USA.

Make a call and check the server response status:

api_call <- paste0(baseEndPoint, 
                   "?", 
                   paste("PUMS Occupation=210000-270000", 
                         "measure=Total Population,Total Population MOE Appx,Record Count",
                         "drilldowns=Wage Bin",
                         "Workforce Status=true",
                         "Record Count>=5", 
                         sep = "&"))
response <- GET(URLencode(api_call))
response$status
[1] 200

Convert the response to JSON and than to list and a data.frame:

response <- rawToChar(response$content)
response <- fromJSON(response)
data <- response$data
head(data)

Visualize with {ggplot2}:

R Markdown

R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.


Goran S. Milovanović

DataKolektiv, 2020/21

contact:


License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.


---
title: Intro to Data Science (Non-Technical Background, R) - Lab05
author:
- name: Goran S. Milovanović, PhD
  affiliation: DataKolektiv, Chief Scientist & Owner; Data Scientist for Wikidata, WMDE
abstract: 
output:
  html_notebook:
    code_folding: show
    theme: spacelab
    toc: yes
    toc_float: yes
    toc_depth: 5
  html_document:
    toc: yes
    toc_depth: 5
---

![](../_img/DK_Logo_100.png)

***
# Lab 05: {httr}, REST API calls, JSON, and ggplot2 in action
 
**Feedback** should be send to `goran.milovanovic@datakolektiv.com`. 
These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.

***

### What do we want to do today?

We want to practice REST API access from R, a topic covered in Session 04. In the following example we will access a free REST API from within our R environment, collect the API response as [JSON](https://www.json.org/json-en.html), convert it to an R list, and play with the data.

### 1. Setup the basic API access parameters

In this example we will rely on the free [https://datausa.io/](https://datausa.io/) API to obtain statistical data. Here is the intro to their API: [datausa.io API](https://datausa.io/about/api/).

- You will find the API base endpoint there:

```{r echo = T}
baseEndPoint <- "https://datausa.io/api/data"
```

### 2. Make a simple API call

We will use [{httr}](https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html) to get in touch with the API. It is a part of [{tidyverse}](https://www.tidyverse.org/).

```{r echo = T}
library(httr)
```

**Step 1. Define API parameters.**

First we define the API parameters.

```{r echo = T}
### --- compose API call
# - use base API endpoint
# - and concatenate with API parameters
# - from the following example: https://datausa.io/about/api/
# - parameter: drilldowns
drilldowns <- paste0("drilldowns=", "Nation")
# - parameter: measures
measures <- paste0("measures=", "Population")
# - parameters:
params <- paste("&", c(drilldowns, measures),
                sep = "", collapse = "")
cat(params)
```

**Step 2. Compose API call.**

We put together the `baseEndPoint` with the API call parameters:

```{r echo = T}
api_call <- paste0(baseEndPoint, "?", params)
cat(api_call)
```

**Step 3. Make API call.**

We use `httr::GET()` to contact the API, ask for data, and fetch the result:

```{r echo = T}
response <- GET(URLencode(api_call))
class(response)
```

The `URLencode(api_call)` call to the base R `URLencode()` function will take care of [Percent-encoding](https://en.wikipedia.org/wiki/Percent-encoding) where and if necessary. Hint: always use `URLencode(your_api_call)`.

We can see that `response` is now of a `response` class. It is pretty structured and rich indeed:

```{r echo = T}
str(response)
```

You need to check one thing: the server status response.

```{r echo = T}
response$status_code
```

`200` means that your request was processed successfully. Introduce yourself to server status responses and learn a bit about them from the following source: [HTTP response status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

The results is found in `response$content`, but...

```{r echo = T}
class(response$content)
```

What is `raw`? It means that your data were obtained as *raw binary data* and they need to be decoded into an R `character` class representation. It is easy:

```{r echo = T}
resp <- rawToChar(response$content)
class(resp)
```

Is `resp` lengthy?

```{r echo = T}
nchar(resp)
```

```{r echo = T}
cat(resp)
```
Now we can see that the API response is JSON indeed. To work with JSON in R, we need to convert it into some R known data structures. For example a list.

**Step 4. Convert JSON data to an R list.**

We will use [jsonlite](https://cran.r-project.org/web/packages/jsonlite/index.html), also a part of {tidyverse}, to convert from JSON to an R list:

```{r echo = T}
library(jsonlite)
resp_list <- fromJSON(resp)
str(resp_list)
```

**Step 5. Inspect the result and play with the data.**

What is the length of `resp_list`?

```{r echo = T}
length(resp_list)
```

Let's discover what is inside:

```{r echo = T}
class(resp_list$data)
```

How does the `resp_list$data` data.frame look like?

```{r echo = T}
head(resp_list$data)
```
Oh, nice! Let's plot the time series of the US population over years then:

```{r echo = T}
library(ggplot2)
library(ggrepel)
ggplot(data = resp_list$data, 
       aes(x = Year,
           y = Population, 
           label = Population)) + 
  geom_path(size = .25, color = "blue", group = 1) + 
  geom_point(size = 2, color = "blue") + 
  geom_label_repel(size = 3) + 
  ggtitle("US Population") +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(hjust = .5))
```

What is the second element of `resp_list`?

```{r echo = T}
class(resp_list$source)
```

Let's see what is in:

```{r echo = T}
head(resp_list$source)
```
Oh, no: there is a nested data.frame in `resp_list$source`; we do not like such things in R but that happens too often when we work with API responses. There is a nice function to take care about such occurrences: `jsonlite::flatten()`:

```{r echo = T}
source <- flatten(resp_list$source, recursive = TRUE)
colnames(source)
```

I understand now: `resp_list$source` are the *metadata*! The API informed us about the sources of the data that it delivered:

```{r echo = T}
source$measures
```

And then:

```{r echo = T}
source$annotations.source_description
```

Awesome: we get the data and the documentation for it!

### 3. Make another API call and inspect the data

For each API that you want to use you will need to read its documentation and learn about the parameters that you may pass to it. 

I have stripped this API call from [https://datausa.io/profile/soc/education-legal-community-service-arts-media-occupations](https://datausa.io/profile/soc/education-legal-community-service-arts-media-occupations): just click on **View data** in the top-right corner.

You can copy and paste [the entire API call](https://datausa.io/api/data?measure=Average%20Wage,Average%20Wage%20Appx MOE,Record Count&drilldowns=Minor Occupation Group&Workforce Status=true&Record Count>=5) into your browsers navigation bar to obtain the JSON response directly.

The data are on education, legal, community service, arts, & media occupations in the USA.

Make a call and check the server response status:

```{r echo = T}
api_call <- paste0(baseEndPoint, 
                   "?", 
                   paste("PUMS Occupation=210000-270000", 
                         "measure=Total Population,Total Population MOE Appx,Record Count",
                         "drilldowns=Wage Bin",
                         "Workforce Status=true",
                         "Record Count>=5", 
                         sep = "&"))
response <- GET(URLencode(api_call))
response$status
```

Convert the response to JSON and than to list and a data.frame:

```{r echo = T}
response <- rawToChar(response$content)
response <- fromJSON(response)
data <- response$data
head(data)
```
Visualize with {ggplot2}:

```{r echo = T, fig.height=18}
data$`Wage Bin` <- factor(data$`Wage Bin`, 
                          levels = unique(data$`Wage Bin`))
ggplot(data = data, 
       aes(x = Year,
           y = log(`Total Population`), 
           color = `Wage Bin`,
           fill = `Wage Bin`)) + 
  geom_path(size = 1.5, group = 1) + 
  geom_point(size = 13) + 
  facet_wrap(~`Wage Bin`, ncol = 3) +
  ggtitle("US: Education, legal, community service, arts, & media occupations") +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(hjust = .5, size = 70)) + 
  theme(axis.text.x = element_text(angle = 90, size = 40)) +
  theme(axis.title.x = element_text(size = 50)) + 
  theme(axis.text.y = element_text(size = 40)) +
  theme(axis.title.y = element_text(size = 50)) + 
  theme(legend.text = element_text(size = 50)) +
  theme(legend.title = element_text(size = 50)) +
  theme(strip.text = element_text(size = 45)) +
  theme(strip.background = element_blank()) +
  theme(legend.position = "top")
```

### R Markdown

[R Markdown](https://rmarkdown.rstudio.com/) is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here's a book: [R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.](https://bookdown.org/yihui/rmarkdown/) 


***
Goran S. Milovanović

DataKolektiv, 2020/21

contact: goran.milovanovic@datakolektiv.com

![](../_img/DK_Logo_100.png)

***
License: [GPLv3](http://www.gnu.org/licenses/gpl-3.0.txt)
This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this Notebook. If not, see <http://www.gnu.org/licenses/>.

***

