class: center, middle ## Data Joins with `dplyr` and Data Reshaping with `tidyr` <img src="img/hero_wall_pink.png" width="800px"/> ### Kelly McConville .large[Math 241 | Week 4 | Spring 2021] --- ## Adjusting for the 🌨 Storm * Moving Mini-Project 1 due date and presentations to one week later: ~~Th Feb 25th~~ **Th Mar 4th** + Updated on the [website](https://reed-statistics.github.io/math241s21/projects.html) * Pushed dates for Mini-Project 2 * Modified Office Hours this week: + Th 3:30pm - 4:30pm + Fri 10:00am - noon --- # Announcements/Reminders * [Additional office hours added to the schedule](https://reed-statistics.github.io/math241s21/office_hours.html) * Stats Visitor Job Talks + This week and next! + Will post talk info in the #outside channel * In terms of Github, make sure you: + Joined the Reed-Math241 Org + Joined your `labwork_username` repo + Joined your `pkgGrpX` repo * Project questions? + Mahria Lebow, Data Librarian, here to help! + Bad package name?? --- ## What if we don't want our package to be called `pkgGrpX`? * No problem! * Decide if you would like to share the post on the course website. + If so, determine the `name` of the url you would like me to use to host your package (e.g., https://github.com/Reed-Math241/INSERT_NAME). + Post-grading, I will rename your repo on GitHub (and then you will need to update the package name in a few places on the repo). --- ## Looking Ahead... * Receive Lab 3 this week + Due next Thursday (27th) at 8:30am * Mini-Project 1 due the following Thursday (Mar 5th) at 8:30am + Presentations during class. + Add slides [here](https://docs.google.com/presentation/d/1Vn_YwO9hqDhDELJ-TqLiHyoLFZ4vroA-FX1aLem3c9U/edit?usp=sharing). --- ## Goals for Today * R Data Package License * Git: Merge conflicts * Data joins with `dplyr` * Tidy data * Tidying and reshaping data with `tidyr` --- ## Git Collaboration: Merge conflicts * What if my partners and I both make changes? + Scenario: Your partner makes changes to a file, commits, and pushes to GitHub. You also modify that file, commit and push. + Result: Your push will fail because there's a commit on GitHub that you don't have. + Usual Solution: Pull and *usually* git will merge their work nicely with yours. Then push. If that doesn't work, you have a **merge conflict**. Let's cross that bridge when we get there. * How to avoid merge conflicts? + Always pull when you are going to work on your project. + Always commit and push when you are done even if you made small changes. --- ## Data Joins * Often in the data analysis workflow, we have more than one data source, which means more than one dataframe, and we want to combine these dataframes. * Need principled way to combine. + Need a **key** that links two dataframes together. * These multiple dataframes are called **relational data** + Will talk about relational database management systems later in the semester. <!-- --- --> <!-- ## Data Joins: Language --> <!-- Three families of verbs for working with relational data: --> <!-- -- --> <!-- * **Mutating joins**: Add variables to one data frame but matching observations in another. --> <!-- -- --> <!-- * **Filtering joins**: Filter observations from one data frame based on whether or not they match an observation in the other table. --> <!-- -- --> <!-- * **Set operations**: treat observations as if they were set elements. --> --- ## Example: BLS Consumer Expenditure Data * Household survey but data are also collected on individuals + ce: household data + memi: household member-level data ```r #Read in data with readr package library(tidyverse) fmli <- read_csv("/home/courses/math241s21/Data/fmli.csv", na = c("NA", ".")) memi <- read_csv("/home/courses/math241s21/Data/memi.csv", na = c("NA", ".")) ``` * Want variables on principal earner to be added to the household data --- ## CE Data * Key variable(s)? ```r library(dplyr) glimpse(fmli) ``` ``` ## Rows: 6,301 ## Columns: 51 ## $ NEWID <chr> "03324174", "03324204", "03324214", "03324244", "03324274", … ## $ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", … ## $ FINLWT21 <dbl> 25985, 6581, 20208, 18078, 20112, 19907, 11705, 24431, 42859… ## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 1… ## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ POPSIZE <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, … ## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", … ## $ EDUCA2 <chr> "15", "15", "13", "", "", "", "", "15", "15", "14", "12", "1… ## $ AGE_REF <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, … ## $ AGE2 <dbl> 50, 47, 46, NA, NA, NA, NA, 36, 53, 67, 44, 62, NA, NA, NA, … ## $ SEX_REF <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, … ## $ SEX2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1… ## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, … ## $ RACE2 <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1… ## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, … ## $ HISP2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2… ## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, … ## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, … ## $ REGION <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, … ## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", … ## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ FOODCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ TRANSCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ EDUCACQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ STUDFINX <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ IRAX <dbl> 1000000, 10000, 0, NA, NA, 0, 0, 15000, NA, 477000, NA, NA, … ## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, … ## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, … ## $ VEHQ <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, … ## $ ROOMSQ <dbl> 8, 5, 6, 4, 4, 4, 7, 5, 4, 9, 6, 10, 4, 7, 5, 6, 6, 8, 18, 4… ## $ INC_HRS1 <dbl> 40, 40, 40, 44, 40, NA, NA, 40, 40, NA, 40, NA, NA, NA, NA, … ## $ INC_HRS2 <dbl> 30, 40, 52, NA, NA, NA, NA, 40, 40, NA, 65, NA, NA, NA, NA, … ## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, … ## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, … ## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", "", "", "12", "04", "", "01", … ## $ OCCUCOD2 <chr> "04", "02", "01", "", "", "", "", "02", "03", "", "11", "", … ## $ STATE <chr> "41", "15", "48", "06", "06", "48", "06", "42", "", "27", "2… ## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9,… ## $ TOTXEST <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 14… ## $ CREDFINX <dbl> 0, NA, 0, NA, 5, NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 2, 35… ## $ CREDITB <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ CREDITX <dbl> 4000, 5000, 2000, NA, 7000, 1800, NA, 6000, NA, 719, NA, 120… ## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", … ## $ ST_HOUS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, … ## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ``` --- ## CE Data * Key variables? ```r glimpse(memi) ``` ``` ## Rows: 15,412 ## Columns: 14 ## $ NEWID <chr> "03552611", "03552641", "03552641", "03552651", "03552651", … ## $ MEMBNO <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, … ## $ AGE <dbl> 58, 54, 49, 39, 10, 32, 7, 9, 38, 34, 11, 8, 6, 3, 65, 61, 1… ## $ SEX <dbl> 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, … ## $ EARNER <dbl> 1, 1, 2, 2, NA, 2, NA, NA, 1, 2, NA, NA, NA, NA, 1, 2, NA, N… ## $ EARNTYPE <dbl> 2, 1, NA, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA, NA, 4, NA, … ## $ INC_HRSQ <dbl> 20, 56, NA, NA, NA, NA, NA, NA, 50, NA, NA, NA, NA, NA, 25, … ## $ INCOMEY <dbl> 4, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 4, NA, … ## $ OCCUCODE <chr> "10", "05", "", "", "", "", "", "", "03", "", "", "", "", ""… ## $ HISPANIC <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ MEMBRACE <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, … ## $ PAYSTUB <dbl> 1, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 2, NA, … ## $ SALARYX <dbl> 8982, NA, NA, NA, NA, NA, NA, NA, 280500, NA, NA, NA, NA, NA… ## $ WKSTATUS <dbl> 1, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 1, NA, … ``` --- ## CE Data * Key variables? + Problem with class? ```r glimpse(select(fmli, 1,2)) ``` ``` ## Rows: 6,301 ## Columns: 2 ## $ NEWID <chr> "03324174", "03324204", "03324214", "03324244", "03324274", … ## $ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", … ``` ```r glimpse(select(memi, 1:2)) ``` ``` ## Rows: 15,412 ## Columns: 2 ## $ NEWID <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "0… ## $ MEMBNO <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3,… ``` --- ## CE Data * Key variables? + Problem with class? ```r fmli <- mutate(fmli, PRINEARN = as.integer(PRINEARN)) glimpse(select(fmli, 1,2)) ``` ``` ## Rows: 6,301 ## Columns: 2 ## $ NEWID <chr> "03324174", "03324204", "03324214", "03324244", "03324274", … ## $ PRINEARN <int> 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, … ``` ```r glimpse(select(memi, 1:2)) ``` ``` ## Rows: 15,412 ## Columns: 2 ## $ NEWID <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "0… ## $ MEMBNO <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3,… ``` --- ## CE Data * Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data + What type of join is that? --- ## The World of Joins * Mutating joins: Add new variables to one dataset from matching observations in another. + `left_join()` (and `right_join()`) + `inner_join()` + `full_join()` * There are also *filtering* joins but we won't cover those today. --- ## Example Dataframes ```r d1 <- tibble(V = 1:5, X = c(1, 2, 2, 3, 1), Y = c(14, 3, 6, 1, 4)) d2 <- tibble(X = c(2, 4, 1), S = c(4, 13, 8)) d1 ``` ``` ## # A tibble: 5 x 3 ## V X Y ## <int> <dbl> <dbl> ## 1 1 1 14 ## 2 2 2 3 ## 3 3 2 6 ## 4 4 3 1 ## 5 5 1 4 ``` ```r d2 ``` ``` ## # A tibble: 3 x 2 ## X S ## <dbl> <dbl> ## 1 2 4 ## 2 4 13 ## 3 1 8 ``` --- ## `left_join()` ```r d1_new <- left_join(d1, d2) ``` ``` ## Joining, by = "X" ``` ```r d1_new ``` ``` ## # A tibble: 5 x 4 ## V X Y S ## <int> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 4 3 1 NA ## 5 5 1 4 8 ``` --- ## `left_join()` ```r d1_new <- left_join(d1, d2, by = c("X" = "X")) d1_new ``` ``` ## # A tibble: 5 x 4 ## V X Y S ## <int> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 4 3 1 NA ## 5 5 1 4 8 ``` --- ## `left_join()` ```r d1_new <- left_join(d1, d2, by = c("V" = "X")) d1_new ``` ``` ## # A tibble: 5 x 4 ## V X Y S ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 NA ## 4 4 3 1 13 ## 5 5 1 4 NA ``` --- ## `inner_join()` ```r d1_d2 <- inner_join(d1, d2, by = c("X" = "X")) d1_d2 ``` ``` ## # A tibble: 4 x 4 ## V X Y S ## <int> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 5 1 4 8 ``` --- ## `inner_join()` ```r d1_d2 <- inner_join(d1, d2, by = c("V" = "X")) d1_d2 ``` ``` ## # A tibble: 3 x 4 ## V X Y S ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 4 3 1 13 ``` --- ## `full_join()` ```r d1_d2 <- full_join(d1, d2, by = c("X" = "X")) d1_d2 ``` ``` ## # A tibble: 6 x 4 ## V X Y S ## <int> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 4 3 1 NA ## 5 5 1 4 8 ## 6 NA 4 NA 13 ``` --- ## `full_join()` ```r d1_d2 <- full_join(d1, d2, by = c("V" = "X")) d1_d2 ``` ``` ## # A tibble: 5 x 4 ## V X Y S ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 NA ## 4 4 3 1 13 ## 5 5 1 4 NA ``` --- ## Back to our Example * What kind of join do we want for the Consumer Expenditure data? + Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data --- ## Look at the Possible Joins ```r left_join(fmli, memi) %>% arrange(NEWID) ``` ``` ## Joining, by = "NEWID" ``` ``` ## # A tibble: 15,412 x 64 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> ## 1 0332… 1 25985. 116920 1 2 16 15 63 ## 2 0332… 1 25985. 116920 1 2 16 15 63 ## 3 0332… 1 25985. 116920 1 2 16 15 63 ## 4 0332… 1 25985. 116920 1 2 16 15 63 ## 5 0332… 1 6581. 200 1 3 15 15 50 ## 6 0332… 1 6581. 200 1 3 15 15 50 ## 7 0332… 1 6581. 200 1 3 15 15 50 ## 8 0332… 1 6581. 200 1 3 15 15 50 ## 9 0332… 1 6581. 200 1 3 15 15 50 ## 10 0332… 1 6581. 200 1 3 15 15 50 ## # … with 15,402 more rows, and 55 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, ## # NO_EARNR <dbl>, OCCUCOD1 <chr>, OCCUCOD2 <chr>, STATE <chr>, ## # DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <dbl>, CREDITB <dbl>, ## # CREDITX <dbl>, BUILDING <chr>, ST_HOUS <dbl>, INT_PHON <lgl>, ## # INT_HOME <lgl>, MEMBNO <dbl>, AGE <dbl>, SEX <dbl>, EARNER <dbl>, ## # EARNTYPE <dbl>, INC_HRSQ <dbl>, INCOMEY <dbl>, OCCUCODE <chr>, ## # HISPANIC <dbl>, MEMBRACE <dbl>, PAYSTUB <dbl>, SALARYX <dbl>, ## # WKSTATUS <dbl> ``` --- ## Look at the Possible Joins * Be careful. This erroneous example made my R crash! ```r left_join(fmli, memi, by = c("PRINEARN" = "MEMBNO")) %>% arrange(MEMBNO) ``` --- ## Look at the Possible Joins ```r left_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) %>% arrange(NEWID) ``` ``` ## # A tibble: 6,301 x 63 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> ## 1 0332… 1 25985. 116920 1 2 16 "15" 63 ## 2 0332… 1 6581. 200 1 3 15 "15" 50 ## 3 0332… 1 20208. 117000 1 4 16 "13" 47 ## 4 0332… 1 18078. 0 1 2 15 "" 37 ## 5 0332… 2 20112. 2000 1 2 14 "" 51 ## 6 0332… 1 19907. 942 1 2 11 "" 63 ## 7 0332… 1 11705. 0 1 1 10 "" 77 ## 8 0332… 1 24431. 91000 1 2 13 "15" 37 ## 9 0332… 2 42859. 95000 2 5 12 "15" 51 ## 10 0332… 1 17481. 40037 1 2 12 "14" 64 ## # … with 6,291 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, ## # NO_EARNR <dbl>, OCCUCOD1 <chr>, OCCUCOD2 <chr>, STATE <chr>, ## # DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <dbl>, CREDITB <dbl>, ## # CREDITX <dbl>, BUILDING <chr>, ST_HOUS <dbl>, INT_PHON <lgl>, ## # INT_HOME <lgl>, AGE <dbl>, SEX <dbl>, EARNER <dbl>, EARNTYPE <dbl>, ## # INC_HRSQ <dbl>, INCOMEY <dbl>, OCCUCODE <chr>, HISPANIC <dbl>, ## # MEMBRACE <dbl>, PAYSTUB <dbl>, SALARYX <dbl>, WKSTATUS <dbl> ``` --- ## Look at the Possible Joins ```r inner_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) %>% arrange(NEWID) ``` ``` ## # A tibble: 6,301 x 63 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> ## 1 0332… 1 25985. 116920 1 2 16 "15" 63 ## 2 0332… 1 6581. 200 1 3 15 "15" 50 ## 3 0332… 1 20208. 117000 1 4 16 "13" 47 ## 4 0332… 1 18078. 0 1 2 15 "" 37 ## 5 0332… 2 20112. 2000 1 2 14 "" 51 ## 6 0332… 1 19907. 942 1 2 11 "" 63 ## 7 0332… 1 11705. 0 1 1 10 "" 77 ## 8 0332… 1 24431. 91000 1 2 13 "15" 37 ## 9 0332… 2 42859. 95000 2 5 12 "15" 51 ## 10 0332… 1 17481. 40037 1 2 12 "14" 64 ## # … with 6,291 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, ## # NO_EARNR <dbl>, OCCUCOD1 <chr>, OCCUCOD2 <chr>, STATE <chr>, ## # DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <dbl>, CREDITB <dbl>, ## # CREDITX <dbl>, BUILDING <chr>, ST_HOUS <dbl>, INT_PHON <lgl>, ## # INT_HOME <lgl>, AGE <dbl>, SEX <dbl>, EARNER <dbl>, EARNTYPE <dbl>, ## # INC_HRSQ <dbl>, INCOMEY <dbl>, OCCUCODE <chr>, HISPANIC <dbl>, ## # MEMBRACE <dbl>, PAYSTUB <dbl>, SALARYX <dbl>, WKSTATUS <dbl> ``` * Why does this give us the same answer as `left_join` for this situation? --- ## Look at the Possible Joins ```r full_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) %>% arrange(NEWID) ``` ``` ## # A tibble: 15,412 x 63 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> ## 1 0332… 1 25985. 116920 1 2 16 15 63 ## 2 0332… 2 NA NA NA NA <NA> <NA> NA ## 3 0332… 3 NA NA NA NA <NA> <NA> NA ## 4 0332… 4 NA NA NA NA <NA> <NA> NA ## 5 0332… 1 6581. 200 1 3 15 15 50 ## 6 0332… 2 NA NA NA NA <NA> <NA> NA ## 7 0332… 3 NA NA NA NA <NA> <NA> NA ## 8 0332… 4 NA NA NA NA <NA> <NA> NA ## 9 0332… 5 NA NA NA NA <NA> <NA> NA ## 10 0332… 6 NA NA NA NA <NA> <NA> NA ## # … with 15,402 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, ## # NO_EARNR <dbl>, OCCUCOD1 <chr>, OCCUCOD2 <chr>, STATE <chr>, ## # DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <dbl>, CREDITB <dbl>, ## # CREDITX <dbl>, BUILDING <chr>, ST_HOUS <dbl>, INT_PHON <lgl>, ## # INT_HOME <lgl>, AGE <dbl>, SEX <dbl>, EARNER <dbl>, EARNTYPE <dbl>, ## # INC_HRSQ <dbl>, INCOMEY <dbl>, OCCUCODE <chr>, HISPANIC <dbl>, ## # MEMBRACE <dbl>, PAYSTUB <dbl>, SALARYX <dbl>, WKSTATUS <dbl> ``` --- ## Joining Tips ```r fmli <- left_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) %>% arrange(NEWID) ``` * FIRST: conceptualize for yourself what you think you want the final dataset to look like! * Check initial dimensions and final dimensions. * Use variable names when joining even if they are the same. --- ## Data Wrangling in the `tidyverse` **Question**: Why do we need TWO different data wrangling packages in the `tidyverse`? -- **Question**: What does `dplyr` do? -- * Performs various data manipulation tasks, such as extracting and summarizing -- **Question**: So what is left for `tidyr` to do? * Reshape data into different formats (e.g. long and wide) --- ## Key Idea: Reshape Data to a Tidy Data Format > "Happy families are all alike; every unhappy family is unhappy in its own way." -- Leo Tolstoy -- > "Tidy datasets are all alike, but every messy dataset is messy in its own way." -- Hadley Wickham -- * By tidy, we don't mean neat. * Tidy data satisfy rules that make it easy to work with the data. --- ## Tidy Data Rules <img src="img/tidyRules.png" width="80%" /> * Each column is a single variable. * Each row is a unique observation. * Each value must have its own cell. --- ## Tidy Data ```r fmli ``` ``` ## # A tibble: 6,301 x 63 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> ## 1 0332… 1 25985. 116920 1 2 16 "15" 63 ## 2 0332… 1 6581. 200 1 3 15 "15" 50 ## 3 0332… 1 20208. 117000 1 4 16 "13" 47 ## 4 0332… 1 18078. 0 1 2 15 "" 37 ## 5 0332… 2 20112. 2000 1 2 14 "" 51 ## 6 0332… 1 19907. 942 1 2 11 "" 63 ## 7 0332… 1 11705. 0 1 1 10 "" 77 ## 8 0332… 1 24431. 91000 1 2 13 "15" 37 ## 9 0332… 2 42859. 95000 2 5 12 "15" 51 ## 10 0332… 1 17481. 40037 1 2 12 "14" 64 ## # … with 6,291 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, ## # NO_EARNR <dbl>, OCCUCOD1 <chr>, OCCUCOD2 <chr>, STATE <chr>, ## # DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <dbl>, CREDITB <dbl>, ## # CREDITX <dbl>, BUILDING <chr>, ST_HOUS <dbl>, INT_PHON <lgl>, ## # INT_HOME <lgl>, AGE <dbl>, SEX <dbl>, EARNER <dbl>, EARNTYPE <dbl>, ## # INC_HRSQ <dbl>, INCOMEY <dbl>, OCCUCODE <chr>, HISPANIC <dbl>, ## # MEMBRACE <dbl>, PAYSTUB <dbl>, SALARYX <dbl>, WKSTATUS <dbl> ``` --- ## Un-Tidy Data * Let's look at a dataset of CDC health stats for PA, NY, and TX ``` ## # A tibble: 5 x 6 ## YearStart PA NY TX DataValueType Question ## <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 2014 87.7 66.8 96.4 Age-adjusted Rate Mortality from heart failure ## 2 2013 87.3 67.8 95.6 Age-adjusted Rate Mortality from heart failure ## 3 2012 83.2 67.7 93.5 Age-adjusted Rate Mortality from heart failure ## 4 2011 83.8 69.9 92.1 Age-adjusted Rate Mortality from heart failure ## 5 2010 83 69.4 96.5 Age-adjusted Rate Mortality from heart failure ``` * Which tidy rule was broken? --- ## Tidy Version of the Data ``` ## # A tibble: 15 x 5 ## YearStart LocationAbbr DataValue Question DataValueType ## <dbl> <chr> <dbl> <chr> <chr> ## 1 2014 NY 66.8 Mortality from heart failu… Age-adjusted Ra… ## 2 2014 PA 87.7 Mortality from heart failu… Age-adjusted Ra… ## 3 2014 TX 96.4 Mortality from heart failu… Age-adjusted Ra… ## 4 2013 NY 67.8 Mortality from heart failu… Age-adjusted Ra… ## 5 2013 PA 87.3 Mortality from heart failu… Age-adjusted Ra… ## 6 2013 TX 95.6 Mortality from heart failu… Age-adjusted Ra… ## 7 2012 NY 67.7 Mortality from heart failu… Age-adjusted Ra… ## 8 2012 PA 83.2 Mortality from heart failu… Age-adjusted Ra… ## 9 2012 TX 93.5 Mortality from heart failu… Age-adjusted Ra… ## 10 2011 NY 69.9 Mortality from heart failu… Age-adjusted Ra… ## 11 2011 PA 83.8 Mortality from heart failu… Age-adjusted Ra… ## 12 2011 TX 92.1 Mortality from heart failu… Age-adjusted Ra… ## 13 2010 NY 69.4 Mortality from heart failu… Age-adjusted Ra… ## 14 2010 PA 83 Mortality from heart failu… Age-adjusted Ra… ## 15 2010 TX 96.5 Mortality from heart failu… Age-adjusted Ra… ``` --- ## Un-Tidy Data ``` ## # A tibble: 15 x 5 ## YearStart LocationAbbr DataValue DataValueType Question ## <dbl> <chr> <dbl> <chr> <chr> ## 1 2014 PA 87.7 Age-adjusted Rate Mortality from heart fail… ## 2 2014 PA 126. Crude Rate Mortality from heart fail… ## 3 2014 PA 16091 Number Mortality from heart fail… ## 4 2013 PA 87.3 Age-adjusted Rate Mortality from heart fail… ## 5 2013 PA 124. Crude Rate Mortality from heart fail… ## 6 2013 PA 15894 Number Mortality from heart fail… ## 7 2012 PA 83.2 Age-adjusted Rate Mortality from heart fail… ## 8 2012 PA 117. Crude Rate Mortality from heart fail… ## 9 2012 PA 14893 Number Mortality from heart fail… ## 10 2011 PA 83.8 Age-adjusted Rate Mortality from heart fail… ## 11 2011 PA 116. Crude Rate Mortality from heart fail… ## 12 2011 PA 14827 Number Mortality from heart fail… ## 13 2010 PA 83 Age-adjusted Rate Mortality from heart fail… ## 14 2010 PA 114. Crude Rate Mortality from heart fail… ## 15 2010 PA 14411 Number Mortality from heart fail… ``` * Which tidy rule was broken? --- ## Tidy Data ``` ## # A tibble: 5 x 6 ## YearStart LocationAbbr Question `Age-adjusted Ra… `Crude Rate` Number ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> ## 1 2014 PA Mortality from h… 87.7 126. 16091 ## 2 2013 PA Mortality from h… 87.3 124. 15894 ## 3 2012 PA Mortality from h… 83.2 117. 14893 ## 4 2011 PA Mortality from h… 83.8 116. 14827 ## 5 2010 PA Mortality from h… 83 114. 14411 ``` --- ## `tidyr` * `tidyr` is an R package which will allow us to switch a dataset from a wide to a long format and vice versa. ```r library(tidyverse) ``` -- * **Problem**: When column names are not variables but values of a variable. Each row represents multiple observations. + **Solution**: Use `pivot_longer()`. -- * **Problem**: When an observation is scattered across multiple rows. + **Solution**: Use `pivot_wider()`. -- * **Problem**: Two observations are in one cell. + **Solution**: Use `separate()`. --- ## `pivot_longer()` * What columns do we want to gather? + names_to = name of variable whose values are the column names + values_to = name of variable whose values are spread over the cells ```r CDC_wide ``` ``` ## # A tibble: 5 x 6 ## YearStart PA NY TX DataValueType Question ## <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 2014 87.7 66.8 96.4 Age-adjusted Rate Mortality from heart failure ## 2 2013 87.3 67.8 95.6 Age-adjusted Rate Mortality from heart failure ## 3 2012 83.2 67.7 93.5 Age-adjusted Rate Mortality from heart failure ## 4 2011 83.8 69.9 92.1 Age-adjusted Rate Mortality from heart failure ## 5 2010 83 69.4 96.5 Age-adjusted Rate Mortality from heart failure ``` --- ## `pivot_longer()` ```r CDC_narrow_tidy <- pivot_longer(CDC_wide, cols = c(PA, NY, TX), names_to = "State", values_to = "MortalityRate") %>% select(YearStart, State, MortalityRate) CDC_narrow_tidy ``` ``` ## # A tibble: 15 x 3 ## YearStart State MortalityRate ## <dbl> <chr> <dbl> ## 1 2014 PA 87.7 ## 2 2014 NY 66.8 ## 3 2014 TX 96.4 ## 4 2013 PA 87.3 ## 5 2013 NY 67.8 ## 6 2013 TX 95.6 ## 7 2012 PA 83.2 ## 8 2012 NY 67.7 ## 9 2012 TX 93.5 ## 10 2011 PA 83.8 ## 11 2011 NY 69.9 ## 12 2011 TX 92.1 ## 13 2010 PA 83 ## 14 2010 NY 69.4 ## 15 2010 TX 96.5 ``` --- ## `pivot_wider()` * Which columns do we want to spread across multiple rows? + `names_from` = column that contains the variable names + `values_from` = column that contains the values of the variables ```r CDC_narrow_untidy ``` ``` ## # A tibble: 15 x 5 ## YearStart LocationAbbr DataValue DataValueType Question ## <dbl> <chr> <dbl> <chr> <chr> ## 1 2014 PA 87.7 Age-adjusted Rate Mortality from heart fail… ## 2 2014 PA 126. Crude Rate Mortality from heart fail… ## 3 2014 PA 16091 Number Mortality from heart fail… ## 4 2013 PA 87.3 Age-adjusted Rate Mortality from heart fail… ## 5 2013 PA 124. Crude Rate Mortality from heart fail… ## 6 2013 PA 15894 Number Mortality from heart fail… ## 7 2012 PA 83.2 Age-adjusted Rate Mortality from heart fail… ## 8 2012 PA 117. Crude Rate Mortality from heart fail… ## 9 2012 PA 14893 Number Mortality from heart fail… ## 10 2011 PA 83.8 Age-adjusted Rate Mortality from heart fail… ## 11 2011 PA 116. Crude Rate Mortality from heart fail… ## 12 2011 PA 14827 Number Mortality from heart fail… ## 13 2010 PA 83 Age-adjusted Rate Mortality from heart fail… ## 14 2010 PA 114. Crude Rate Mortality from heart fail… ## 15 2010 PA 14411 Number Mortality from heart fail… ``` --- ## `pivot_wider()` ```r CDC_wide_tidy <- pivot_wider(CDC_narrow_untidy, names_from = DataValueType, values_from = DataValue) CDC_wide_tidy ``` ``` ## # A tibble: 5 x 6 ## YearStart LocationAbbr Question `Age-adjusted Ra… `Crude Rate` Number ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> ## 1 2014 PA Mortality from h… 87.7 126. 16091 ## 2 2013 PA Mortality from h… 87.3 124. 15894 ## 3 2012 PA Mortality from h… 83.2 117. 14893 ## 4 2011 PA Mortality from h… 83.8 116. 14827 ## 5 2010 PA Mortality from h… 83 114. 14411 ``` --- ## `separate()` * Separate one column into multiple columns. * What column might we want to separate into two? ```r read_csv("/home/courses/math241s21/Data/CDC2.csv") %>% select(GeoLocation, YearStart, LocationAbbr, DataValue, Question) %>% glimpse() ``` ``` ## Rows: 74,811 ## Columns: 5 ## $ GeoLocation <chr> NA, "(32.84057112200048, -86.63186076199969)", "(64.8450… ## $ YearStart <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20… ## $ LocationAbbr <chr> "US", "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "D… ## $ DataValue <dbl> 16.9, 13.0, 18.2, 15.6, 15.0, 16.3, 19.0, 16.7, 17.0, 25… ## $ Question <chr> "Binge drinking prevalence among adults aged >= 18 years… ``` --- ## `separate()` * Separate one column into multiple columns. + `col` = name of column to separate + `into` = character vector with names of new columns + `sep` = character to split on ```r CDC_sep <- read_csv("/home/courses/math241s21/Data/CDC2.csv") %>% select(GeoLocation, YearStart, LocationAbbr, DataValue, Question) %>% separate(col = GeoLocation, into = c("lat", "long"), sep = ", ") CDC_sep ``` ``` ## # A tibble: 74,811 x 6 ## lat long YearStart LocationAbbr DataValue Question ## <chr> <chr> <dbl> <chr> <dbl> <chr> ## 1 <NA> <NA> 2016 US 16.9 Binge drinking preval… ## 2 (32.84057… -86.63186… 2016 AL 13 Binge drinking preval… ## 3 (64.84507… -147.7220… 2016 AK 18.2 Binge drinking preval… ## 4 (34.86597… -111.7638… 2016 AZ 15.6 Binge drinking preval… ## 5 (34.74865… -92.27449… 2016 AR 15 Binge drinking preval… ## 6 (37.63864… -120.9999… 2016 CA 16.3 Binge drinking preval… ## 7 (38.84384… -106.1336… 2016 CO 19 Binge drinking preval… ## 8 (41.56266… -72.64984… 2016 CT 16.7 Binge drinking preval… ## 9 (39.00883… -75.57774… 2016 DE 17 Binge drinking preval… ## 10 (38.907192 -77.03687… 2016 DC 25.6 Binge drinking preval… ## # … with 74,801 more rows ``` --- ## `separate()` ```r library(stringr) CDC_sep <- read_csv("/home/courses/math241s21/Data/CDC2.csv") %>% select(GeoLocation, YearStart, LocationAbbr, DataValue, Question) %>% separate(col = GeoLocation, into = c("lat", "long"), sep = ", ") %>% mutate(lat = parse_number(lat), long = parse_number(long)) ``` --- ## Another Example: Live Coding * Also a chance to see some additional `dplyr` functions. ```r library(readxl) url <- "https://www.bts.gov/sites/bts.dot.gov/files/table_01_33_102020.xlsx" destfile <- "table_01_33_102020.xlsx" curl::curl_download(url, destfile) Amtrak <- read_excel(destfile) Amtrak ``` ``` ## # A tibble: 20 x 35 ## `Table 1-33: A… ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 <NA> 1972 1975 1980 1985 1990 1991 1992 1993 1994 ## 2 Locomotives <NA> NA NA NA NA NA NA NA NA ## 3 Percent availab… U 87 83 93 84 86 83 84 85 ## 4 Average age (ye… 22.3 14.4 7.4 7 12 13 13 13.2 13.4 ## 5 Passenger and o… <NA> NA NA NA NA NA NA NA NA ## 6 Percent availab… U 82 77 90 90 92 90 89 88 ## 7 Average age (ye… 22 24.7 14.3 14.2 20 21 21.5 22.6 22.4 ## 8 KEY: U = data … <NA> NA NA NA NA NA NA NA NA ## 9 <NA> <NA> NA NA NA NA NA NA NA NA ## 10 a Year-end dail… <NA> NA NA NA NA NA NA NA NA ## 11 b Fiscal Year-e… <NA> NA NA NA NA NA NA NA NA ## 12 <NA> <NA> NA NA NA NA NA NA NA NA ## 13 NOTES <NA> NA NA NA NA NA NA NA NA ## 14 1972 was Amtrak… <NA> NA NA NA NA NA NA NA NA ## 15 Roadrailers are… <NA> NA NA NA NA NA NA NA NA ## 16 <NA> <NA> NA NA NA NA NA NA NA NA ## 17 SOURCES <NA> NA NA NA NA NA NA NA NA ## 18 1972-80: Amtrak… <NA> NA NA NA NA NA NA NA NA ## 19 1985-2000: Ibid… <NA> NA NA NA NA NA NA NA NA ## 20 2001-19: Amtrak… <NA> NA NA NA NA NA NA NA NA ## # … with 25 more variables: ...11 <dbl>, ...12 <dbl>, ...13 <dbl>, ...14 <dbl>, ## # ...15 <dbl>, ...16 <dbl>, ...17 <chr>, ...18 <chr>, ...19 <dbl>, ## # ...20 <dbl>, ...21 <dbl>, ...22 <dbl>, ...23 <dbl>, ...24 <dbl>, ## # ...25 <dbl>, ...26 <dbl>, ...27 <dbl>, ...28 <dbl>, ...29 <dbl>, ## # ...30 <dbl>, ...31 <dbl>, ...32 <chr>, ...33 <chr>, ...34 <chr>, ## # ...35 <chr> ```