Data Joins with dplyr and Data Reshaping with tidyr

class: center, middle

## Data Joins with `dplyr` and Data Reshaping with `tidyr`

### Kelly McConville

.large[Math 241 | Week 4 | Spring 2021]

---

## Adjusting for the 🌨 Storm

* Moving Mini-Project 1 due date and presentations to one week later: ~~Th Feb 25th~~ **Th Mar 4th**
    + Updated on the [website](https://reed-statistics.github.io/math241s21/projects.html)

* Pushed dates for Mini-Project 2

* Modified Office Hours this week:
    + Th 3:30pm - 4:30pm
    + Fri 10:00am - noon

---

# Announcements/Reminders

* [Additional office hours added to the schedule](https://reed-statistics.github.io/math241s21/office_hours.html)

* Stats Visitor Job Talks
    + This week and next!
    + Will post talk info in the #outside channel

* In terms of Github, make sure you:
    + Joined the Reed-Math241 Org
    + Joined your `labwork_username` repo
    + Joined your `pkgGrpX` repo

* Project questions?
    + Mahria Lebow, Data Librarian, here to help!
    + Bad package name??
    
    
---

## What if we don't want our package to be called `pkgGrpX`?

* No problem!

* Decide if you would like to share the post on the course website.
    + If so, determine the `name` of the url you would like me to use to host your package (e.g., https://github.com/Reed-Math241/INSERT_NAME).
    + Post-grading, I will rename your repo on GitHub (and then you will need to update the package name in a few places on the repo).

---

## Looking Ahead...

* Receive Lab 3 this week
    + Due next Thursday (27th) at 8:30am

* Mini-Project 1 due the following Thursday (Mar 5th) at 8:30am
    + Presentations during class.
    + Add slides [here](https://docs.google.com/presentation/d/1Vn_YwO9hqDhDELJ-TqLiHyoLFZ4vroA-FX1aLem3c9U/edit?usp=sharing).

---

## Goals for Today

* R Data Package License

* Git: Merge conflicts

* Data joins with `dplyr`

* Tidy data

* Tidying and reshaping data with `tidyr`

---

## Git Collaboration: Merge conflicts

* What if my partners and I both make changes? 
    + Scenario: Your partner makes changes to a file, commits, and pushes to GitHub. You also modify that file, commit and push.  
    + Result: Your push will fail because there's a commit on GitHub that you don't have.  
    + Usual Solution: Pull and *usually* git will merge their work nicely with yours.  Then push.  If that doesn't work, you have a **merge conflict**.  Let's cross that bridge when we get there.  
    
* How to avoid merge conflicts?
    + Always pull when you are going to work on your project.
    + Always commit and push when you are done even if you made small changes.

---

## Data Joins

* Often in the data analysis workflow, we have more than one data source, which means more than one dataframe, and we want to combine these dataframes.

* Need principled way to combine.
    + Need a **key** that links two dataframes together.

* These multiple dataframes are called **relational data**
    + Will talk about relational database management systems later in the semester.

---

## Example: BLS Consumer Expenditure Data

* Household survey but data are also collected on individuals
    + ce: household data
    + memi: household member-level data

```r
#Read in data with readr package
library(tidyverse)
fmli <- read_csv("/home/courses/math241s21/Data/fmli.csv", 
                 na = c("NA", "."))
memi <- read_csv("/home/courses/math241s21/Data/memi.csv", 
                 na = c("NA", "."))
```

* Want variables on principal earner to be added to the household data

---

## CE Data

* Key variable(s)?

```r
library(dplyr)
glimpse(fmli)
```

```
## Rows: 6,301
## Columns: 51
## $ NEWID    <chr> "03324174", "03324204", "03324214", "03324244", "03324274", …
## $ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", …
## $ FINLWT21 <dbl> 25985, 6581, 20208, 18078, 20112, 19907, 11705, 24431, 42859…
## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 1…
## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ POPSIZE  <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, …
## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", …
## $ EDUCA2   <chr> "15", "15", "13", "", "", "", "", "15", "15", "14", "12", "1…
## $ AGE_REF  <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, …
## $ AGE2     <dbl> 50, 47, 46, NA, NA, NA, NA, 36, 53, 67, 44, 62, NA, NA, NA, …
## $ SEX_REF  <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, …
## $ SEX2     <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1…
## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, …
## $ RACE2    <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1…
## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, …
## $ HISP2    <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2…
## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, …
## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, …
## $ REGION   <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, …
## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", …
## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ FOODCQ   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TRANSCQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EDUCACQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ STUDFINX <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ IRAX     <dbl> 1000000, 10000, 0, NA, NA, 0, 0, 15000, NA, 477000, NA, NA, …
## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, …
## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, …
## $ VEHQ     <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, …
## $ ROOMSQ   <dbl> 8, 5, 6, 4, 4, 4, 7, 5, 4, 9, 6, 10, 4, 7, 5, 6, 6, 8, 18, 4…
## $ INC_HRS1 <dbl> 40, 40, 40, 44, 40, NA, NA, 40, 40, NA, 40, NA, NA, NA, NA, …
## $ INC_HRS2 <dbl> 30, 40, 52, NA, NA, NA, NA, 40, 40, NA, 65, NA, NA, NA, NA, …
## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, …
## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, …
## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", "", "", "12", "04", "", "01", …
## $ OCCUCOD2 <chr> "04", "02", "01", "", "", "", "", "02", "03", "", "11", "", …
## $ STATE    <chr> "41", "15", "48", "06", "06", "48", "06", "42", "", "27", "2…
## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9,…
## $ TOTXEST  <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 14…
## $ CREDFINX <dbl> 0, NA, 0, NA, 5, NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 2, 35…
## $ CREDITB  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ CREDITX  <dbl> 4000, 5000, 2000, NA, 7000, 1800, NA, 6000, NA, 719, NA, 120…
## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", …
## $ ST_HOUS  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
```

---

## CE Data

* Key variables?

```r
glimpse(memi)
```

```
## Rows: 15,412
## Columns: 14
## $ NEWID    <chr> "03552611", "03552641", "03552641", "03552651", "03552651", …
## $ MEMBNO   <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, …
## $ AGE      <dbl> 58, 54, 49, 39, 10, 32, 7, 9, 38, 34, 11, 8, 6, 3, 65, 61, 1…
## $ SEX      <dbl> 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, …
## $ EARNER   <dbl> 1, 1, 2, 2, NA, 2, NA, NA, 1, 2, NA, NA, NA, NA, 1, 2, NA, N…
## $ EARNTYPE <dbl> 2, 1, NA, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA, NA, 4, NA, …
## $ INC_HRSQ <dbl> 20, 56, NA, NA, NA, NA, NA, NA, 50, NA, NA, NA, NA, NA, 25, …
## $ INCOMEY  <dbl> 4, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 4, NA, …
## $ OCCUCODE <chr> "10", "05", "", "", "", "", "", "", "03", "", "", "", "", ""…
## $ HISPANIC <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ MEMBRACE <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, …
## $ PAYSTUB  <dbl> 1, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 2, NA, …
## $ SALARYX  <dbl> 8982, NA, NA, NA, NA, NA, NA, NA, 280500, NA, NA, NA, NA, NA…
## $ WKSTATUS <dbl> 1, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 1, NA, …
```

---

## CE Data

* Key variables?
    + Problem with class?

```r
glimpse(select(fmli, 1,2))
```

```
## Rows: 6,301
## Columns: 2
## $ NEWID    <chr> "03324174", "03324204", "03324214", "03324244", "03324274", …
## $ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", …
```

```r
glimpse(select(memi, 1:2))
```

```
## Rows: 15,412
## Columns: 2
## $ NEWID  <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "0…
## $ MEMBNO <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3,…
```

---

## CE Data

* Key variables?
    + Problem with class?

```r
fmli <- mutate(fmli, PRINEARN = as.integer(PRINEARN))
glimpse(select(fmli, 1,2))
```

```
## Rows: 6,301
## Columns: 2
## $ NEWID    <chr> "03324174", "03324204", "03324214", "03324244", "03324274", …
## $ PRINEARN <int> 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, …
```

```r
glimpse(select(memi, 1:2))
```

---

## CE Data

* Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data
    + What type of join is that?

---

## The World of Joins

* Mutating joins: Add new variables to one dataset from matching observations in another.
    + `left_join()` (and `right_join()`)
    + `inner_join()`
    + `full_join()`

* There are also *filtering* joins but we won't cover those today.

---

## Example Dataframes

```r
d1 <- tibble(V = 1:5, X = c(1, 2, 2, 3, 1), Y = c(14, 3, 6, 1, 4))
d2 <- tibble(X = c(2, 4, 1), S = c(4, 13, 8))
d1
```

```
## # A tibble: 5 x 3
##       V     X     Y
##   <int> <dbl> <dbl>
## 1     1     1    14
## 2     2     2     3
## 3     3     2     6
## 4     4     3     1
## 5     5     1     4
```

```r
d2
```

```
## # A tibble: 3 x 2
##       X     S
##   <dbl> <dbl>
## 1     2     4
## 2     4    13
## 3     1     8
```

---

## `left_join()`

```r
d1_new <- left_join(d1, d2)
```

```
## Joining, by = "X"
```

```r
d1_new
```

```
## # A tibble: 5 x 4
##       V     X     Y     S
##   <int> <dbl> <dbl> <dbl>
## 1     1     1    14     8
## 2     2     2     3     4
## 3     3     2     6     4
## 4     4     3     1    NA
## 5     5     1     4     8
```

---

## `left_join()`

```r
d1_new <- left_join(d1, d2, by = c("X" = "X"))
d1_new
```

---

## `left_join()`

```r
d1_new <- left_join(d1, d2, by = c("V" = "X"))
d1_new
```

```
## # A tibble: 5 x 4
##       V     X     Y     S
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     1    14     8
## 2     2     2     3     4
## 3     3     2     6    NA
## 4     4     3     1    13
## 5     5     1     4    NA
```

---

## `inner_join()`

```r
d1_d2 <- inner_join(d1, d2, by = c("X" = "X"))
d1_d2
```

```
## # A tibble: 4 x 4
##       V     X     Y     S
##   <int> <dbl> <dbl> <dbl>
## 1     1     1    14     8
## 2     2     2     3     4
## 3     3     2     6     4
## 4     5     1     4     8
```

---

## `inner_join()`

```r
d1_d2 <- inner_join(d1, d2, by = c("V" = "X"))
d1_d2
```

```
## # A tibble: 3 x 4
##       V     X     Y     S
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     1    14     8
## 2     2     2     3     4
## 3     4     3     1    13
```

---

## `full_join()`

```r
d1_d2 <- full_join(d1, d2, by = c("X" = "X"))
d1_d2
```

```
## # A tibble: 6 x 4
##       V     X     Y     S
##   <int> <dbl> <dbl> <dbl>
## 1     1     1    14     8
## 2     2     2     3     4
## 3     3     2     6     4
## 4     4     3     1    NA
## 5     5     1     4     8
## 6    NA     4    NA    13
```

---

## `full_join()`

```r
d1_d2 <- full_join(d1, d2, by = c("V" = "X"))
d1_d2
```

---

## Back to our Example

* What kind of join do we want for the Consumer Expenditure data?
    + Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data

---

## Look at the Possible Joins

```r
left_join(fmli, memi) %>% 
  arrange(NEWID)
```

```
## Joining, by = "NEWID"
```

```
## # A tibble: 15,412 x 64
##    NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF
##    <chr>    <int>    <dbl>    <dbl>    <dbl>   <dbl> <chr>    <chr>    <dbl>
##  1 0332…        1   25985.   116920        1       2 16       15          63
##  2 0332…        1   25985.   116920        1       2 16       15          63
##  3 0332…        1   25985.   116920        1       2 16       15          63
##  4 0332…        1   25985.   116920        1       2 16       15          63
##  5 0332…        1    6581.      200        1       3 15       15          50
##  6 0332…        1    6581.      200        1       3 15       15          50
##  7 0332…        1    6581.      200        1       3 15       15          50
##  8 0332…        1    6581.      200        1       3 15       15          50
##  9 0332…        1    6581.      200        1       3 15       15          50
## 10 0332…        1    6581.      200        1       3 15       15          50
## # … with 15,402 more rows, and 55 more variables: AGE2 <dbl>, SEX_REF <dbl>,
## #   SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>,
## #   FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>,
## #   HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>,
## #   TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>,
## #   TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>,
## #   VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>,
## #   NO_EARNR <dbl>, OCCUCOD1 <chr>, OCCUCOD2 <chr>, STATE <chr>,
## #   DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <dbl>, CREDITB <dbl>,
## #   CREDITX <dbl>, BUILDING <chr>, ST_HOUS <dbl>, INT_PHON <lgl>,
## #   INT_HOME <lgl>, MEMBNO <dbl>, AGE <dbl>, SEX <dbl>, EARNER <dbl>,
## #   EARNTYPE <dbl>, INC_HRSQ <dbl>, INCOMEY <dbl>, OCCUCODE <chr>,
## #   HISPANIC <dbl>, MEMBRACE <dbl>, PAYSTUB <dbl>, SALARYX <dbl>,
## #   WKSTATUS <dbl>
```

---

## Look at the Possible Joins

* Be careful.  This erroneous example made my R crash!

```r
left_join(fmli, memi, by = c("PRINEARN" = "MEMBNO")) %>% 
  arrange(MEMBNO)
```

---

## Look at the Possible Joins

```r
left_join(fmli, memi, by = c("NEWID" = "NEWID",
                             "PRINEARN" = "MEMBNO")) %>% 
  arrange(NEWID)
```

```
## # A tibble: 6,301 x 63
##    NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF
##    <chr>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl> <chr>    <chr>    <dbl>
##  1 0332…        1   25985.   116920        1       2 16       "15"        63
##  2 0332…        1    6581.      200        1       3 15       "15"        50
##  3 0332…        1   20208.   117000        1       4 16       "13"        47
##  4 0332…        1   18078.        0        1       2 15       ""          37
##  5 0332…        2   20112.     2000        1       2 14       ""          51
##  6 0332…        1   19907.      942        1       2 11       ""          63
##  7 0332…        1   11705.        0        1       1 10       ""          77
##  8 0332…        1   24431.    91000        1       2 13       "15"        37
##  9 0332…        2   42859.    95000        2       5 12       "15"        51
## 10 0332…        1   17481.    40037        1       2 12       "14"        64
## # … with 6,291 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>,
## #   SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>,
## #   FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>,
## #   HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>,
## #   TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>,
## #   TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>,
## #   VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>,
## #   NO_EARNR <dbl>, OCCUCOD1 <chr>, OCCUCOD2 <chr>, STATE <chr>,
## #   DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <dbl>, CREDITB <dbl>,
## #   CREDITX <dbl>, BUILDING <chr>, ST_HOUS <dbl>, INT_PHON <lgl>,
## #   INT_HOME <lgl>, AGE <dbl>, SEX <dbl>, EARNER <dbl>, EARNTYPE <dbl>,
## #   INC_HRSQ <dbl>, INCOMEY <dbl>, OCCUCODE <chr>, HISPANIC <dbl>,
## #   MEMBRACE <dbl>, PAYSTUB <dbl>, SALARYX <dbl>, WKSTATUS <dbl>
```

---

## Look at the Possible Joins

```r
inner_join(fmli, memi, by = c("NEWID" = "NEWID",
                              "PRINEARN" = "MEMBNO")) %>% 
  arrange(NEWID)
```

* Why does this give us the same answer as `left_join` for this situation?

---

## Look at the Possible Joins

```r
full_join(fmli, memi, by = c("NEWID" = "NEWID",
                             "PRINEARN" = "MEMBNO")) %>% 
  arrange(NEWID)
```

```
## # A tibble: 15,412 x 63
##    NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF
##    <chr>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl> <chr>    <chr>    <dbl>
##  1 0332…        1   25985.   116920        1       2 16       15          63
##  2 0332…        2      NA        NA       NA      NA <NA>     <NA>        NA
##  3 0332…        3      NA        NA       NA      NA <NA>     <NA>        NA
##  4 0332…        4      NA        NA       NA      NA <NA>     <NA>        NA
##  5 0332…        1    6581.      200        1       3 15       15          50
##  6 0332…        2      NA        NA       NA      NA <NA>     <NA>        NA
##  7 0332…        3      NA        NA       NA      NA <NA>     <NA>        NA
##  8 0332…        4      NA        NA       NA      NA <NA>     <NA>        NA
##  9 0332…        5      NA        NA       NA      NA <NA>     <NA>        NA
## 10 0332…        6      NA        NA       NA      NA <NA>     <NA>        NA
## # … with 15,402 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>,
## #   SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>,
## #   FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>,
## #   HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>,
## #   TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>,
## #   TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>,
## #   VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>,
## #   NO_EARNR <dbl>, OCCUCOD1 <chr>, OCCUCOD2 <chr>, STATE <chr>,
## #   DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <dbl>, CREDITB <dbl>,
## #   CREDITX <dbl>, BUILDING <chr>, ST_HOUS <dbl>, INT_PHON <lgl>,
## #   INT_HOME <lgl>, AGE <dbl>, SEX <dbl>, EARNER <dbl>, EARNTYPE <dbl>,
## #   INC_HRSQ <dbl>, INCOMEY <dbl>, OCCUCODE <chr>, HISPANIC <dbl>,
## #   MEMBRACE <dbl>, PAYSTUB <dbl>, SALARYX <dbl>, WKSTATUS <dbl>
```

---

## Joining Tips

```r
fmli <- left_join(fmli, memi, by = c("NEWID" = "NEWID",
                                     "PRINEARN" = "MEMBNO")) %>% 
  arrange(NEWID)
```

* FIRST: conceptualize for yourself what you think you want the final dataset to look like!
* Check initial dimensions and final dimensions.
* Use variable names when joining even if they are the same.

---

## Data Wrangling in the `tidyverse`

**Question**: Why do we need TWO different data wrangling packages in the `tidyverse`?

**Question**: What does `dplyr` do?

* Performs various data manipulation tasks, such as extracting and summarizing

**Question**: So what is left for `tidyr` to do?

* Reshape data into different formats (e.g. long and wide)

---

## Key Idea: Reshape Data to a Tidy Data Format

> "Happy families are all alike; every unhappy family is unhappy in its own way." -- Leo Tolstoy

> "Tidy datasets are all alike, but every messy dataset is messy in its own way." -- Hadley Wickham

* By tidy, we don't mean neat.  
* Tidy data satisfy rules that make it easy to work with the data.

---

## Tidy Data Rules

* Each column is a single variable.

* Each row is a unique observation.

* Each value must have its own cell.

---

## Tidy Data

```r
fmli
```

---

## Un-Tidy Data

* Let's look at a dataset of CDC health stats for PA, NY, and TX

```
## # A tibble: 5 x 6
##   YearStart    PA    NY    TX DataValueType     Question                    
##       <dbl> <dbl> <dbl> <dbl> <chr>             <chr>                       
## 1      2014  87.7  66.8  96.4 Age-adjusted Rate Mortality from heart failure
## 2      2013  87.3  67.8  95.6 Age-adjusted Rate Mortality from heart failure
## 3      2012  83.2  67.7  93.5 Age-adjusted Rate Mortality from heart failure
## 4      2011  83.8  69.9  92.1 Age-adjusted Rate Mortality from heart failure
## 5      2010  83    69.4  96.5 Age-adjusted Rate Mortality from heart failure
```

* Which tidy rule was broken?

---

## Tidy Version of the Data

```
## # A tibble: 15 x 5
##    YearStart LocationAbbr DataValue Question                    DataValueType   
##        <dbl> <chr>            <dbl> <chr>                       <chr>           
##  1      2014 NY                66.8 Mortality from heart failu… Age-adjusted Ra…
##  2      2014 PA                87.7 Mortality from heart failu… Age-adjusted Ra…
##  3      2014 TX                96.4 Mortality from heart failu… Age-adjusted Ra…
##  4      2013 NY                67.8 Mortality from heart failu… Age-adjusted Ra…
##  5      2013 PA                87.3 Mortality from heart failu… Age-adjusted Ra…
##  6      2013 TX                95.6 Mortality from heart failu… Age-adjusted Ra…
##  7      2012 NY                67.7 Mortality from heart failu… Age-adjusted Ra…
##  8      2012 PA                83.2 Mortality from heart failu… Age-adjusted Ra…
##  9      2012 TX                93.5 Mortality from heart failu… Age-adjusted Ra…
## 10      2011 NY                69.9 Mortality from heart failu… Age-adjusted Ra…
## 11      2011 PA                83.8 Mortality from heart failu… Age-adjusted Ra…
## 12      2011 TX                92.1 Mortality from heart failu… Age-adjusted Ra…
## 13      2010 NY                69.4 Mortality from heart failu… Age-adjusted Ra…
## 14      2010 PA                83   Mortality from heart failu… Age-adjusted Ra…
## 15      2010 TX                96.5 Mortality from heart failu… Age-adjusted Ra…
```

---

## Un-Tidy Data

```
## # A tibble: 15 x 5
##    YearStart LocationAbbr DataValue DataValueType     Question                  
##        <dbl> <chr>            <dbl> <chr>             <chr>                     
##  1      2014 PA                87.7 Age-adjusted Rate Mortality from heart fail…
##  2      2014 PA               126.  Crude Rate        Mortality from heart fail…
##  3      2014 PA             16091   Number            Mortality from heart fail…
##  4      2013 PA                87.3 Age-adjusted Rate Mortality from heart fail…
##  5      2013 PA               124.  Crude Rate        Mortality from heart fail…
##  6      2013 PA             15894   Number            Mortality from heart fail…
##  7      2012 PA                83.2 Age-adjusted Rate Mortality from heart fail…
##  8      2012 PA               117.  Crude Rate        Mortality from heart fail…
##  9      2012 PA             14893   Number            Mortality from heart fail…
## 10      2011 PA                83.8 Age-adjusted Rate Mortality from heart fail…
## 11      2011 PA               116.  Crude Rate        Mortality from heart fail…
## 12      2011 PA             14827   Number            Mortality from heart fail…
## 13      2010 PA                83   Age-adjusted Rate Mortality from heart fail…
## 14      2010 PA               114.  Crude Rate        Mortality from heart fail…
## 15      2010 PA             14411   Number            Mortality from heart fail…
```

* Which tidy rule was broken?

---

## Tidy Data

```
## # A tibble: 5 x 6
##   YearStart LocationAbbr Question          `Age-adjusted Ra… `Crude Rate` Number
##       <dbl> <chr>        <chr>                         <dbl>        <dbl>  <dbl>
## 1      2014 PA           Mortality from h…              87.7         126.  16091
## 2      2013 PA           Mortality from h…              87.3         124.  15894
## 3      2012 PA           Mortality from h…              83.2         117.  14893
## 4      2011 PA           Mortality from h…              83.8         116.  14827
## 5      2010 PA           Mortality from h…              83           114.  14411
```

---

## `tidyr`

* `tidyr` is an R package which will allow us to switch a dataset from a wide to a long format and vice versa.

```r
library(tidyverse)
```

* **Problem**:  When column names are not variables but values of a variable.  Each row represents multiple observations.
    + **Solution**: Use `pivot_longer()`.

* **Problem**: When an observation is scattered across multiple rows.
    + **Solution**: Use `pivot_wider()`.

--
    
* **Problem**: Two observations are in one cell.
    + **Solution**: Use `separate()`.

---

## `pivot_longer()`

* What columns do we want to gather?
    + names_to = name of variable whose values are the column names
    + values_to = name of variable whose values are spread over the cells

```r
CDC_wide
```

---

## `pivot_longer()`

```r
CDC_narrow_tidy <- pivot_longer(CDC_wide, cols = c(PA, NY, TX), 
                                names_to = "State", 
                                values_to = "MortalityRate") %>%
  select(YearStart, State, MortalityRate)
CDC_narrow_tidy
```

```
## # A tibble: 15 x 3
##    YearStart State MortalityRate
##        <dbl> <chr>         <dbl>
##  1      2014 PA             87.7
##  2      2014 NY             66.8
##  3      2014 TX             96.4
##  4      2013 PA             87.3
##  5      2013 NY             67.8
##  6      2013 TX             95.6
##  7      2012 PA             83.2
##  8      2012 NY             67.7
##  9      2012 TX             93.5
## 10      2011 PA             83.8
## 11      2011 NY             69.9
## 12      2011 TX             92.1
## 13      2010 PA             83  
## 14      2010 NY             69.4
## 15      2010 TX             96.5
```

---

## `pivot_wider()`

* Which columns do we want to spread across multiple rows?
    + `names_from` = column that contains the variable names
    + `values_from` = column that contains the values of the variables

```r
CDC_narrow_untidy 
```

---

## `pivot_wider()`

```r
CDC_wide_tidy <- pivot_wider(CDC_narrow_untidy, 
                             names_from = DataValueType,
                             values_from = DataValue)
CDC_wide_tidy
```

---

## `separate()`

* Separate one column into multiple columns.

* What column might we want to separate into two?

```r
read_csv("/home/courses/math241s21/Data/CDC2.csv") %>% 
  select(GeoLocation, YearStart, LocationAbbr, DataValue, 
         Question) %>%
  glimpse()
```

```
## Rows: 74,811
## Columns: 5
## $ GeoLocation  <chr> NA, "(32.84057112200048, -86.63186076199969)", "(64.8450…
## $ YearStart    <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…
## $ LocationAbbr <chr> "US", "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "D…
## $ DataValue    <dbl> 16.9, 13.0, 18.2, 15.6, 15.0, 16.3, 19.0, 16.7, 17.0, 25…
## $ Question     <chr> "Binge drinking prevalence among adults aged >= 18 years…
```

---

## `separate()`

* Separate one column into multiple columns.
    + `col` = name of column to separate
    + `into` = character vector with names of new columns
    + `sep` = character to split on

```r
CDC_sep <- read_csv("/home/courses/math241s21/Data/CDC2.csv") %>% 
  select(GeoLocation, YearStart, LocationAbbr, DataValue, 
         Question) %>%
  separate(col = GeoLocation, into = c("lat", "long"), 
           sep = ", ")
CDC_sep
```

```
## # A tibble: 74,811 x 6
##    lat        long       YearStart LocationAbbr DataValue Question              
##    <chr>      <chr>          <dbl> <chr>            <dbl> <chr>                 
##  1 <NA>       <NA>            2016 US                16.9 Binge drinking preval…
##  2 (32.84057… -86.63186…      2016 AL                13   Binge drinking preval…
##  3 (64.84507… -147.7220…      2016 AK                18.2 Binge drinking preval…
##  4 (34.86597… -111.7638…      2016 AZ                15.6 Binge drinking preval…
##  5 (34.74865… -92.27449…      2016 AR                15   Binge drinking preval…
##  6 (37.63864… -120.9999…      2016 CA                16.3 Binge drinking preval…
##  7 (38.84384… -106.1336…      2016 CO                19   Binge drinking preval…
##  8 (41.56266… -72.64984…      2016 CT                16.7 Binge drinking preval…
##  9 (39.00883… -75.57774…      2016 DE                17   Binge drinking preval…
## 10 (38.907192 -77.03687…      2016 DC                25.6 Binge drinking preval…
## # … with 74,801 more rows
```

---

## `separate()`

```r
library(stringr)
CDC_sep <- read_csv("/home/courses/math241s21/Data/CDC2.csv") %>% 
  select(GeoLocation, YearStart, LocationAbbr, DataValue,
         Question) %>%
  separate(col = GeoLocation, into = c("lat", "long"),
           sep = ", ") %>%
  mutate(lat = parse_number(lat), 
         long = parse_number(long))
```

---

## Another Example: Live Coding

* Also a chance to see some additional `dplyr` functions.

```r
library(readxl)
url <- "https://www.bts.gov/sites/bts.dot.gov/files/table_01_33_102020.xlsx"
destfile <- "table_01_33_102020.xlsx"
curl::curl_download(url, destfile)
Amtrak <- read_excel(destfile)
Amtrak
```

```
## # A tibble: 20 x 35
##    `Table 1-33:  A… ...2    ...3   ...4   ...5  ...6  ...7   ...8   ...9  ...10
##    <chr>            <chr>  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
##  1 <NA>             1972  1975   1980   1985    1990  1991 1992   1993   1994  
##  2 Locomotives      <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
##  3 Percent availab… U       87     83     93      84    86   83     84     85  
##  4 Average age (ye… 22.3    14.4    7.4    7      12    13   13     13.2   13.4
##  5 Passenger and o… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
##  6 Percent availab… U       82     77     90      90    92   90     89     88  
##  7 Average age (ye… 22      24.7   14.3   14.2    20    21   21.5   22.6   22.4
##  8 KEY:  U = data … <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
##  9 <NA>             <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 10 a Year-end dail… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 11 b Fiscal Year-e… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 12 <NA>             <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 13 NOTES            <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 14 1972 was Amtrak… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 15 Roadrailers are… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 16 <NA>             <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 17 SOURCES          <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 18 1972-80: Amtrak… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 19 1985-2000: Ibid… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 20 2001-19: Amtrak… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## # … with 25 more variables: ...11 <dbl>, ...12 <dbl>, ...13 <dbl>, ...14 <dbl>,
## #   ...15 <dbl>, ...16 <dbl>, ...17 <chr>, ...18 <chr>, ...19 <dbl>,
## #   ...20 <dbl>, ...21 <dbl>, ...22 <dbl>, ...23 <dbl>, ...24 <dbl>,
## #   ...25 <dbl>, ...26 <dbl>, ...27 <dbl>, ...28 <dbl>, ...29 <dbl>,
## #   ...30 <dbl>, ...31 <dbl>, ...32 <chr>, ...33 <chr>, ...34 <chr>,
## #   ...35 <chr>
```