class: center, middle ## API calls with `httr` and Web Scraping with `rvest` <img src="img/hero_wall_pink.png" width="800px"/> ### Kelly McConville .large[Math 241 | Week 5 | Spring 2021] --- ## Announcements/Reminders * Lab 4 posted. * Mini-Project 1 due next Thursday (Mar 5th) at 8:30am + Presentations during class. + Add slides [here](https://docs.google.com/presentation/d/1Vn_YwO9hqDhDELJ-TqLiHyoLFZ4vroA-FX1aLem3c9U/edit?usp=sharing). --- ## Looking Ahead... Thursday's class: 5 minute presentations about your R package + I will cut you off at 5 minutes time. + Each group member needs to participate. -- > "How long does it take you to prepare one of your speeches?" asked a friend of President Wilson. > "That depends on the length of the speech," answered the President. "If it is a 10-minute speech, it takes me all of two weeks to prepare it; if it is a half hour speech, it takes me a week; if I can talk as long as I want to, it requires no preparation at all. I am ready now." -- Make sure to include: * A clear and concise description of the data. * A hook to get people interested in using the data themselves. + Will vote for the data packages we are most excited to use! --- ## Goals for Today * Web data + Talking to APIs + Web-scraping --- ## Grabbing Data From The Web Four main categories (listed from easiest to hardest): * **Download and Go**: Flat files, such as csvs, that you can download and then install via something like `readr` -- * **Package API Wrapper**: R packages that talk to APIs. -- * **API**: Talking to the APIs directly. -- * **Scrap**: Scraping directly from a website. --- ## Data from APIs: What To Do * Ask the internet if there is an R package for a particular API. * If so, read the vignette/help files. * If not, you must talk to the API directly. --- # But first... Lists in R The most general way to store things: ```r groceries <- list() groceries$new_seasons <- c("apples", "chocolate", "kale", "garlic") groceries$safeway <- c("vinegar", "soap") groceries$salt_n_straw <- c("almond_brittle", "double_fold_vanilla", "tanzanian_tin_roof") groceries$budget <- data.frame(stores = c("new_seasons", "safeway", "salt_n_straw"), fund = c(100, 25, 200)) ``` --- ```r groceries ``` ``` ## $new_seasons ## [1] "apples" "chocolate" "kale" "garlic" ## ## $safeway ## [1] "vinegar" "soap" ## ## $salt_n_straw ## [1] "almond_brittle" "double_fold_vanilla" "tanzanian_tin_roof" ## ## $budget ## stores fund ## 1 new_seasons 100 ## 2 safeway 25 ## 3 salt_n_straw 200 ``` --- Nested structure ```r outings <- c("trivia", "pizza") feb <- list(groceries = groceries, outings = outings) feb ``` ``` ## $groceries ## $groceries$new_seasons ## [1] "apples" "chocolate" "kale" "garlic" ## ## $groceries$safeway ## [1] "vinegar" "soap" ## ## $groceries$salt_n_straw ## [1] "almond_brittle" "double_fold_vanilla" "tanzanian_tin_roof" ## ## $groceries$budget ## stores fund ## 1 new_seasons 100 ## 2 safeway 25 ## 3 salt_n_straw 200 ## ## ## $outings ## [1] "trivia" "pizza" ``` --- ## Grabbing items or the container and the items ```r thing1 <- feb$groceries[3] thing1 ``` ``` ## $salt_n_straw ## [1] "almond_brittle" "double_fold_vanilla" "tanzanian_tin_roof" ``` ```r class(thing1) ``` ``` ## [1] "list" ``` ```r thing2 <- feb$groceries[[3]] thing2 ``` ``` ## [1] "almond_brittle" "double_fold_vanilla" "tanzanian_tin_roof" ``` ```r class(thing2) ``` ``` ## [1] "character" ``` --- ## Lists in R Data frame is a special case of a list * What must be true to be a data frame? ```r groceries_df <- data.frame(stores = c("new_seasons", "safeway", "salt_n_straw"), budget = c(100, 25, 200)) groceries_df ``` ``` ## stores budget ## 1 new_seasons 100 ## 2 safeway 25 ## 3 salt_n_straw 200 ``` --- ## Web Data * Two common languages of web services: + JavaScript Object Notation (JSON) + eXtensible Markup Language (XML) * We won't be deep diving into JSON/XML today. + Will use functions to convert to R objects (lists!). * Learning to interact more directly with JSON/XML is a great option for your final project. --- ## APIs * We will use `httr` to access data via APIs. + `tidyverse` adjacent + Play on HTTP: Hyper-Text Transfer Proctocol ```r library(httr) ``` -- * For our example, let's grab the College Scorecard data from [data.gov](https://www.data.gov/). + You will need to first sign up for an API key [here](https://api.data.gov/signup/). ```r # Store API key (Change to your personal key!) my_key <- "insert key" ``` ```r # URL of interest url <- "https://api.data.gov/ed/collegescorecard/v1/schools?" # Download available data for Reed reed <- GET(url, query = list(api_key = my_key, school.name = "Reed College")) ``` --- ```r #Look at type http_type(reed) ``` ``` ## [1] "application/json" ``` ```r # Examine components names(reed) ``` ``` ## [1] "url" "status_code" "headers" "all_headers" "cookies" ## [6] "content" "date" "times" "request" "handle" ``` --- ## Let's start with `status_code` Key: * 2xx: Success * 3xx: Client error (something's not right on your end) * 4xx: Server error (something's not right on their end) ```r status_code(reed) ``` ``` ## [1] 200 ``` * If not 200, check that you got the correct url. --- ## Want to pull out the `content` ```r # Convert data into an R object # JSON automatically parsed into named list dat <- content(reed, as = "parsed", type = "application/json") #Look at structure class(dat) ``` ``` ## [1] "list" ``` --- ```r # Continue looking at structure names(dat) ``` ``` ## [1] "metadata" "results" ``` ```r glimpse(dat) ``` ``` ## List of 2 ## $ metadata:List of 3 ## ..$ total : int 2 ## ..$ page : int 0 ## ..$ per_page: int 20 ## $ results :List of 2 ## ..$ :List of 30 ## .. ..$ 2012 :List of 8 ## .. ..$ 2011 :List of 8 ## .. ..$ 2010 :List of 8 ## .. ..$ 2009 :List of 8 ## .. ..$ 1998 :List of 8 ## .. ..$ 2008 :List of 8 ## .. ..$ 1997 :List of 8 ## .. ..$ 2007 :List of 8 ## .. ..$ 1996 :List of 8 ## .. ..$ 2006 :List of 8 ## .. ..$ 2005 :List of 8 ## .. ..$ school :List of 36 ## .. ..$ 2004 :List of 8 ## .. ..$ 2003 :List of 8 ## .. ..$ 2002 :List of 8 ## .. ..$ id : int 209922 ## .. ..$ latest :List of 9 ## .. ..$ 1999 :List of 8 ## .. ..$ 2001 :List of 8 ## .. ..$ 2000 :List of 8 ## .. ..$ fed_sch_cd: chr "003217" ## .. ..$ 2018 :List of 8 ## .. ..$ ope6_id : chr "003217" ## .. ..$ 2017 :List of 9 ## .. ..$ 2016 :List of 9 ## .. ..$ 2015 :List of 9 ## .. ..$ 2014 :List of 8 ## .. ..$ 2013 :List of 8 ## .. ..$ ope8_id : chr "00321700" ## .. ..$ location :List of 2 ## ..$ :List of 30 ## .. ..$ 2012 :List of 8 ## .. ..$ 2011 :List of 8 ## .. ..$ 2010 :List of 8 ## .. ..$ 2009 :List of 8 ## .. ..$ 1998 :List of 8 ## .. ..$ 2008 :List of 8 ## .. ..$ 1997 :List of 8 ## .. ..$ 2007 :List of 8 ## .. ..$ 1996 :List of 8 ## .. ..$ 2006 :List of 8 ## .. ..$ 2005 :List of 8 ## .. ..$ school :List of 36 ## .. ..$ 2004 :List of 8 ## .. ..$ 2003 :List of 8 ## .. ..$ 2002 :List of 8 ## .. ..$ id : int 117052 ## .. ..$ latest :List of 9 ## .. ..$ 1999 :List of 8 ## .. ..$ 2001 :List of 8 ## .. ..$ 2000 :List of 8 ## .. ..$ fed_sch_cd: chr "001308" ## .. ..$ 2018 :List of 8 ## .. ..$ ope6_id : chr "001308" ## .. ..$ 2017 :List of 9 ## .. ..$ 2016 :List of 9 ## .. ..$ 2015 :List of 9 ## .. ..$ 2014 :List of 8 ## .. ..$ 2013 :List of 8 ## .. ..$ ope8_id : chr "00130800" ## .. ..$ location :List of 2 ``` --- ```r names(dat$results[[1]]) ``` ``` ## [1] "2012" "2011" "2010" "2009" "1998" ## [6] "2008" "1997" "2007" "1996" "2006" ## [11] "2005" "school" "2004" "2003" "2002" ## [16] "id" "latest" "1999" "2001" "2000" ## [21] "fed_sch_cd" "2018" "ope6_id" "2017" "2016" ## [26] "2015" "2014" "2013" "ope8_id" "location" ``` ```r names(dat$results[[1]][[1]]) ``` ``` ## [1] "completion" "earnings" "cost" "student" "academics" ## [6] "admissions" "aid" "repayment" ``` --- ```r # Pulling out useful data takes some work clean_dat <- dat$results[[1]][c(as.character(2000:2017))] %>% sapply(function(x) x$aid$median_debt$completers$overall) %>% unlist() clean_dat ``` ``` ## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 ## 11500 11558 11750 12000 11750 11000 10500 10500 12000 13206 14350 16000 16147 ## 2013 2014 2015 2016 2017 ## 16000 16000 16000 16000 13465 ``` * Important last step: Save the data + Don't have R run `GET()` each time you knit! + Load the data you wrote to a csv. ```r write_csv(clean_dat, "median_debt_reed.csv") ``` --- ## Another Example: ColourLovers ```r colour_lover <- GET("http://www.colourlovers.com/api/palette/292482?format=json") http_type(colour_lover) ``` ``` ## [1] "application/json" ``` ```r dat <- content(colour_lover, as = "parsed", type = "application/json") dat_df <- dat %>% as.data.frame() dat_df ``` ``` ## id title userName numViews numVotes numComments numHearts rank ## 1 292482 Terra? GlueStudio 391994 5520 583 4.5 7 ## dateCreated colors..E8DDCB. colors..CDB380. colors..036564. ## 1 2008-02-29 08:37:21 E8DDCB CDB380 036564 ## colors..033649. colors..031634. ## 1 033649 031634 ## description ## 1 I-MOO\r\n<div style="width: 300px; text-align: center;"><a href="http://www.colourlovers.com/contests/moo/minicard/2291466" target="_blank" style="display: block; margin-bottom: 5px; width: 300px; height: 120px; -moz-box-shadow: 0 1px 4px #d1d1d1; -webkit-box-shadow: 0 1px 4px #d1d1d1; box-shadow: 0 1px 4px #d1d1d1; filter: progid:DXImageTransform.Microsoft.Shadow(Strength=1, Direction=180, Color= ## url ## 1 http://www.colourlovers.com/palette/292482/Terra ## imageUrl ## 1 http://www.colourlovers.com/paletteImg/E8DDCB/CDB380/036564/033649/031634/Terra.png ## badgeUrl ## 1 http://www.colourlovers.com/images/badges/pw/292/292482_Terra.png ## apiUrl ## 1 http://www.colourlovers.com/api/palette/292482 ``` --- ## ColourLovers API Wrapper ```r library(colourlovers) pal <- clpalette(id = 292482) pal ``` ``` ## Palette ID: 292482 ## Title: Terra? ## Created by user: GlueStudio ## Date created: 2008-02-29 08:37:21 ## Views: 391994 ## Votes: 5520 ## Comments: 583 ## Hearts: 4.5 ## Rank: 7 ## URL: http://www.colourlovers.com/palette/292482/Terra ## Image URL: ## Colors: #E8DDCB, #CDB380, #036564, #033649, #031634 ``` ```r class(pal) ``` ``` ## [1] "clpalette" "list" ``` --- ## Web Scraping * Found data on a website but there isn't an API. -- * How easy it is to grab that data depends on the quality of the website! + And be prepared to do a LOT of cleaning once you have the data. --- ## HyperText Markup Language (HTML) * Most of the data on the web is available as HTML. * It is structured (hierarchial) but often not available in a useful, tidy format. ``` <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` --- ## Web Scraping `rvest`: package for basic processing and manipulation of HTML data * Designed to work with `%>%` <img src="img/rvest.png" width="40%" style="display: block; margin: auto;" /> --- ## Key `rvest` functions * `read_html()`: Read in HTML data from a URL * `html_node()`: Select a specified node from the HTML document * `html_nodes()`: Select specified nodes from the HTML document * `html_table()`: Parse an HTML table into a data frame --- ## Web Scraping **Steps:** 0. Checking for permission to scrape the data with `robotstxt::paths_allowed()` 1. Read the HTML page into R with `read_html()` 2. Extract the nodes of the page that correspond to elements of interest + Use web tools to help identify these nodes 3. Clean up the extracted text fields. 4. Write the data to a csv so that you are not scraping it every time, in case the website goes down, or if the data changes. --- class: inverse, center, middle ## Let's go through the webScraping.Rmd handout in the Handouts folder! --- #### Scraping Challenges 1. Reproducibility + The data are not static. + Websites change their structure over time. -- 2. Some website structures are much harder to scrape, especially when the data you want is spread over many nodes. -- 3. Making many requests to a web server will cause it to ban requests or slow down the speed of informational retrieval. -- 4. The quality of the data + Is it provided by users of the website or staff affiliated with the website? -- 5. Privacy and consent considerations + Ex: OkCupid data scraped and provided on PsychNet -- 6. Legality issues + Ex: Dispute between eBay and Bidder's Edge (2000): Court banned Bidder's Edge from scraping data from eBay + Ex: Dispute between LinkedIn and HiQ (2019): scraping publicly available info `\(\neq\)` hacking but may involve copyright infringement