API calls with httr and Web Scraping with rvest

class: center, middle

## API calls with `httr` and Web Scraping with `rvest`

### Kelly McConville

.large[Math 241 | Week 5 | Spring 2021]

---

## Announcements/Reminders

* Lab 4 posted.

* Mini-Project 1 due next Thursday (Mar 5th) at 8:30am
    + Presentations during class.
    + Add slides [here](https://docs.google.com/presentation/d/1Vn_YwO9hqDhDELJ-TqLiHyoLFZ4vroA-FX1aLem3c9U/edit?usp=sharing).

---

## Looking Ahead...

Thursday's class: 5 minute presentations about your R package

+ I will cut you off at 5 minutes time.
+ Each group member needs to participate.
    
--

> "How long does it take you to prepare one of your speeches?" asked a friend of President Wilson.

> "That depends on the length of the speech," answered the President. "If it is a 10-minute speech, it takes me all of two weeks to prepare it; if it is a half hour speech, it takes me a week; if I can talk as long as I want to, it requires no preparation at all. I am ready now."

Make sure to include:

* A clear and concise description of the data.
* A hook to get people interested in using the data themselves.
    + Will vote for the data packages we are most excited to use!

---

## Goals for Today

* Web data
    + Talking to APIs
    + Web-scraping

---

## Grabbing Data From The Web

Four main categories (listed from easiest to hardest):

* **Download and Go**: Flat files, such as csvs, that you can download and then install via something like `readr`

* **Package API Wrapper**: R packages that talk to APIs.

* **API**: Talking to the APIs directly.

* **Scrap**: Scraping directly from a website.

---

## Data from APIs: What To Do

* Ask the internet if there is an R package for a particular API.

* If so, read the vignette/help files.

* If not, you must talk to the API directly.

---

# But first... Lists in R

The most general way to store things:

```r
groceries <- list()

groceries$new_seasons <- c("apples", "chocolate", "kale", "garlic")
groceries$safeway <- c("vinegar", "soap")
groceries$salt_n_straw <- c("almond_brittle", "double_fold_vanilla", 
                            "tanzanian_tin_roof")
groceries$budget <- data.frame(stores = c("new_seasons", "safeway", 
                                          "salt_n_straw"), 
                               fund = c(100, 25, 200))
```

---

```r
groceries
```

```
## $new_seasons
## [1] "apples"    "chocolate" "kale"      "garlic"   
## 
## $safeway
## [1] "vinegar" "soap"   
## 
## $salt_n_straw
## [1] "almond_brittle"      "double_fold_vanilla" "tanzanian_tin_roof" 
## 
## $budget
##         stores fund
## 1  new_seasons  100
## 2      safeway   25
## 3 salt_n_straw  200
```

---

Nested structure

```r
outings <- c("trivia", "pizza")

feb <- list(groceries = groceries, outings = outings)

feb
```

```
## $groceries
## $groceries$new_seasons
## [1] "apples"    "chocolate" "kale"      "garlic"   
## 
## $groceries$safeway
## [1] "vinegar" "soap"   
## 
## $groceries$salt_n_straw
## [1] "almond_brittle"      "double_fold_vanilla" "tanzanian_tin_roof" 
## 
## $groceries$budget
##         stores fund
## 1  new_seasons  100
## 2      safeway   25
## 3 salt_n_straw  200
## 
## 
## $outings
## [1] "trivia" "pizza"
```

---

## Grabbing items or the container and the items

```r
thing1 <- feb$groceries[3]
thing1
```

```
## $salt_n_straw
## [1] "almond_brittle"      "double_fold_vanilla" "tanzanian_tin_roof"
```

```r
class(thing1)
```

```
## [1] "list"
```

```r
thing2 <- feb$groceries[[3]]
thing2
```

```
## [1] "almond_brittle"      "double_fold_vanilla" "tanzanian_tin_roof"
```

```r
class(thing2)
```

```
## [1] "character"
```

---

## Lists in R

Data frame is a special case of a list

* What must be true to be a data frame?

```r
groceries_df <- data.frame(stores = c("new_seasons", "safeway",
                                   "salt_n_straw"),
                        budget = c(100, 25, 200))
groceries_df
```

```
##         stores budget
## 1  new_seasons    100
## 2      safeway     25
## 3 salt_n_straw    200
```

---

## Web Data

* Two common languages of web services:
    + JavaScript Object Notation (JSON)
    + eXtensible Markup Language (XML)

* We won't be deep diving into JSON/XML today.
    + Will use functions to convert to R objects (lists!).

* Learning to interact more directly with JSON/XML is a great option for your final project.

---

## APIs

* We will use `httr` to access data via APIs.
    + `tidyverse` adjacent
    + Play on HTTP: Hyper-Text Transfer Proctocol

```r
library(httr)
```

* For our example, let's grab the College Scorecard data from [data.gov](https://www.data.gov/).  
    + You will need to first sign up for an API key [here](https://api.data.gov/signup/).

```r
# Store API key (Change to your personal key!)
my_key <- "insert key"
```

```r
# URL of interest
url <- "https://api.data.gov/ed/collegescorecard/v1/schools?"

# Download available data for Reed
reed <- GET(url, query = list(api_key = my_key, 
                                     school.name = "Reed College"))
```

---

```r
#Look at type
http_type(reed)
```

```
## [1] "application/json"
```

```r
# Examine components
names(reed)
```

```
##  [1] "url"         "status_code" "headers"     "all_headers" "cookies"    
##  [6] "content"     "date"        "times"       "request"     "handle"
```

---

## Let's start with `status_code`

Key:

* 2xx: Success
* 3xx: Client error (something's not right on your end)
* 4xx: Server error (something's not right on their end)

```r
status_code(reed)
```

```
## [1] 200
```

* If not 200, check that you got the correct url.

---

## Want to pull out the `content`

```r
# Convert data into an R object
# JSON automatically parsed into named list
dat <- content(reed, as = "parsed", type = "application/json")

#Look at structure
class(dat)
```

```
## [1] "list"
```

---

```r
# Continue looking at structure
names(dat)
```

```
## [1] "metadata" "results"
```

```r
glimpse(dat)
```

```
## List of 2
##  $ metadata:List of 3
##   ..$ total   : int 2
##   ..$ page    : int 0
##   ..$ per_page: int 20
##  $ results :List of 2
##   ..$ :List of 30
##   .. ..$ 2012      :List of 8
##   .. ..$ 2011      :List of 8
##   .. ..$ 2010      :List of 8
##   .. ..$ 2009      :List of 8
##   .. ..$ 1998      :List of 8
##   .. ..$ 2008      :List of 8
##   .. ..$ 1997      :List of 8
##   .. ..$ 2007      :List of 8
##   .. ..$ 1996      :List of 8
##   .. ..$ 2006      :List of 8
##   .. ..$ 2005      :List of 8
##   .. ..$ school    :List of 36
##   .. ..$ 2004      :List of 8
##   .. ..$ 2003      :List of 8
##   .. ..$ 2002      :List of 8
##   .. ..$ id        : int 209922
##   .. ..$ latest    :List of 9
##   .. ..$ 1999      :List of 8
##   .. ..$ 2001      :List of 8
##   .. ..$ 2000      :List of 8
##   .. ..$ fed_sch_cd: chr "003217"
##   .. ..$ 2018      :List of 8
##   .. ..$ ope6_id   : chr "003217"
##   .. ..$ 2017      :List of 9
##   .. ..$ 2016      :List of 9
##   .. ..$ 2015      :List of 9
##   .. ..$ 2014      :List of 8
##   .. ..$ 2013      :List of 8
##   .. ..$ ope8_id   : chr "00321700"
##   .. ..$ location  :List of 2
##   ..$ :List of 30
##   .. ..$ 2012      :List of 8
##   .. ..$ 2011      :List of 8
##   .. ..$ 2010      :List of 8
##   .. ..$ 2009      :List of 8
##   .. ..$ 1998      :List of 8
##   .. ..$ 2008      :List of 8
##   .. ..$ 1997      :List of 8
##   .. ..$ 2007      :List of 8
##   .. ..$ 1996      :List of 8
##   .. ..$ 2006      :List of 8
##   .. ..$ 2005      :List of 8
##   .. ..$ school    :List of 36
##   .. ..$ 2004      :List of 8
##   .. ..$ 2003      :List of 8
##   .. ..$ 2002      :List of 8
##   .. ..$ id        : int 117052
##   .. ..$ latest    :List of 9
##   .. ..$ 1999      :List of 8
##   .. ..$ 2001      :List of 8
##   .. ..$ 2000      :List of 8
##   .. ..$ fed_sch_cd: chr "001308"
##   .. ..$ 2018      :List of 8
##   .. ..$ ope6_id   : chr "001308"
##   .. ..$ 2017      :List of 9
##   .. ..$ 2016      :List of 9
##   .. ..$ 2015      :List of 9
##   .. ..$ 2014      :List of 8
##   .. ..$ 2013      :List of 8
##   .. ..$ ope8_id   : chr "00130800"
##   .. ..$ location  :List of 2
```

---

```r
names(dat$results[[1]])
```

```
##  [1] "2012"       "2011"       "2010"       "2009"       "1998"      
##  [6] "2008"       "1997"       "2007"       "1996"       "2006"      
## [11] "2005"       "school"     "2004"       "2003"       "2002"      
## [16] "id"         "latest"     "1999"       "2001"       "2000"      
## [21] "fed_sch_cd" "2018"       "ope6_id"    "2017"       "2016"      
## [26] "2015"       "2014"       "2013"       "ope8_id"    "location"
```

```r
names(dat$results[[1]][[1]])
```

```
## [1] "completion" "earnings"   "cost"       "student"    "academics" 
## [6] "admissions" "aid"        "repayment"
```

---

```r
# Pulling out useful data takes some work
clean_dat <- dat$results[[1]][c(as.character(2000:2017))] %>%
  sapply(function(x) x$aid$median_debt$completers$overall) %>%
  unlist()
clean_dat
```

```
##  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012 
## 11500 11558 11750 12000 11750 11000 10500 10500 12000 13206 14350 16000 16147 
##  2013  2014  2015  2016  2017 
## 16000 16000 16000 16000 13465
```

* Important last step: Save the data 
    + Don't have R run `GET()` each time you knit!
    + Load the data you wrote to a csv.

```r
write_csv(clean_dat, "median_debt_reed.csv")
```

---

## Another Example: ColourLovers

```r
colour_lover <- GET("http://www.colourlovers.com/api/palette/292482?format=json")

http_type(colour_lover)
```

```
## [1] "application/json"
```

```r
dat <- content(colour_lover, as = "parsed", 
               type = "application/json")

dat_df <- dat %>%
  as.data.frame()
dat_df
```

```
##       id  title   userName numViews numVotes numComments numHearts rank
## 1 292482 Terra? GlueStudio   391994     5520         583       4.5    7
##           dateCreated colors..E8DDCB. colors..CDB380. colors..036564.
## 1 2008-02-29 08:37:21          E8DDCB          CDB380          036564
##   colors..033649. colors..031634.
## 1          033649          031634
##                                                                                                                                                                                                                                                                                                                                                                                                        description
## 1 I-MOO\r\n<div style="width: 300px; text-align: center;"><a href="http://www.colourlovers.com/contests/moo/minicard/2291466" target="_blank" style="display: block; margin-bottom: 5px; width: 300px; height: 120px; -moz-box-shadow: 0 1px 4px #d1d1d1; -webkit-box-shadow: 0 1px 4px #d1d1d1; box-shadow: 0 1px 4px #d1d1d1; filter: progid:DXImageTransform.Microsoft.Shadow(Strength=1, Direction=180, Color=
##                                                url
## 1 http://www.colourlovers.com/palette/292482/Terra
##                                                                              imageUrl
## 1 http://www.colourlovers.com/paletteImg/E8DDCB/CDB380/036564/033649/031634/Terra.png
##                                                            badgeUrl
## 1 http://www.colourlovers.com/images/badges/pw/292/292482_Terra.png
##                                           apiUrl
## 1 http://www.colourlovers.com/api/palette/292482
```

---

## ColourLovers API Wrapper

```r
library(colourlovers)

pal <- clpalette(id = 292482)
pal
```

```
## Palette ID:      292482 
## Title:           Terra? 
## Created by user: GlueStudio 
## Date created:    2008-02-29 08:37:21 
## Views:           391994 
## Votes:           5520 
## Comments:        583 
## Hearts:          4.5 
## Rank:            7 
## URL:             http://www.colourlovers.com/palette/292482/Terra 
## Image URL:       
## Colors:          #E8DDCB, #CDB380, #036564, #033649, #031634
```

```r
class(pal)
```

```
## [1] "clpalette" "list"
```

---

## Web Scraping

* Found data on a website but there isn't an API.

* How easy it is to grab that data depends on the quality of the website!
    + And be prepared to do a LOT of cleaning once you have the data.

---

## HyperText Markup Language (HTML)

* Most of the data on the web is available as HTML.
* It is structured (hierarchial) but often not available in a useful, tidy format.

```
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>
```
    
---

## Web Scraping

`rvest`: package for basic processing and manipulation of HTML data

* Designed to work with `%>%`

---

## Key `rvest` functions

* `read_html()`: Read in HTML data from a URL
* `html_node()`: Select a specified node from the HTML document
* `html_nodes()`: Select specified nodes from the HTML document
* `html_table()`: Parse an HTML table into a data frame

---

## Web Scraping

**Steps:**

0. Checking for permission to scrape the data with `robotstxt::paths_allowed()`
1. Read the HTML page into R with `read_html()`
2. Extract the nodes of the page that correspond to elements of interest
    + Use web tools to help identify these nodes
3. Clean up the extracted text fields.
4. Write the data to a csv so that you are not scraping it every time, in case the website goes down, or if the data changes.

---
class: inverse, center, middle

## Let's go through the webScraping.Rmd handout in the Handouts folder!

---

#### Scraping Challenges

1. Reproducibility
    + The data are not static.
    + Websites change their structure over time.
    
--

2. Some website structures are much harder to scrape, especially when the data you want is spread over many nodes.

3. Making many requests to a web server will cause it to ban requests or slow down the speed of informational retrieval.

4. The quality of the data
    + Is it provided by users of the website or staff affiliated with the website?
    
--

5. Privacy and consent considerations
    + Ex: OkCupid data scraped and provided on PsychNet
    
--

6. Legality issues
    + Ex: Dispute between eBay and Bidder's Edge (2000): Court banned Bidder's Edge from scraping data from eBay
    + Ex: Dispute between LinkedIn and HiQ (2019):  scraping publicly available info `$\neq$` hacking but may involve copyright infringement