Data Wrangling with dplyr

class: center, middle

## Data Wrangling with `dplyr` and Collaborating with GitHub
<img src="img/hero_wall_pink.png" width="800px"/>

### Kelly McConville

.large[Math 241 | Week 3 | Spring 2021]

---

# Announcements

* `reprex` issues?

* Uploading to Gradescope.
    + For Lab 2, make sure to assign each problem to a page!

* For our GitHub Intro today:
    + Make sure you have your GitHub username and password handy for an activity we will do in class today. 
    + Make sure you accepted the email invite to the Reed-Math241 GitHub Organization.  
    + Make sure you accepted the email invite t a GitHub repository (we will define this today), named `labwork_yourGitHubUsername`.
    
---

## Looking Ahead...

* Lab 2 due Thursday morning on Gradescope.

* Will receive Mini-Project 1 on Thursday.

---

## Goals for Today

* R Chunk options

* Just a couple `ggplot2` loose ends

* Data wrangling with `dplyr`

* R Projects

* GitHub (for 1)
    + GitHub for collaboration on Th.

---

## R Chunk Options

* The r chunks allow for LOADS of useful [arguments](https://yihui.name/knitr/options/#chunk_options).
    + eval
    + echo
    + include
    + fig.width
    + fig.height
    + fig.cap

---

## Inheriting `aes` from `ggplot()`

```r
ggplot(data = Births2015, mapping = aes(x = date, y = births,
                                        color = wday)) + 
  geom_point() +
  geom_point(data = holidays,  color = "black", size = 3)
```

* What `aes`thetics did the second `geom_point()` inherit?  What didn't it inherit?

---

```r
glimpse(Births2015)
```

```
## Rows: 365
## Columns: 8
## $ date         <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01…
## $ births       <dbl> 8068, 10850, 8328, 7065, 11892, 12425, 12141, 12094, 118…
## $ wday         <ord> Thu, Fri, Sat, Sun, Mon, Tue, Wed, Thu, Fri, Sat, Sun, M…
## $ year         <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 20…
## $ month        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day_of_year  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ day_of_month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ day_of_week  <dbl> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2,…
```

```r
glimpse(holidays)
```

```
## Rows: 7
## Columns: 9
## $ date         <date> 2015-01-01, 2015-05-25, 2015-07-04, 2015-12-25, 2015-11…
## $ occasion     <chr> "New Year", "Memorial Day", "Independence Day", "Christm…
## $ births       <dbl> 8068, 7746, 7944, 6515, 7332, 8714, 8127
## $ wday         <ord> Thu, Mon, Sat, Fri, Thu, Thu, Mon
## $ year         <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015
## $ month        <dbl> 1, 5, 7, 12, 11, 12, 9
## $ day_of_year  <int> 1, 145, 185, 359, 330, 358, 250
## $ day_of_month <dbl> 1, 25, 4, 25, 26, 24, 7
## $ day_of_week  <dbl> 5, 2, 7, 6, 5, 5, 2
```

---

## Inheriting `aes` from `ggplot()`

* What `aes`thetics did the second `geom_point()` inherit?  What didn't it inherit?

```r
holidays <- rename(holidays, Dates = date)

ggplot(data = Births2015, mapping = aes(x = date, y = births, 
                                        color = wday)) + 
  geom_point() +
  geom_point(data = holidays,  color = "black", size = 3)
```

```
## Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = date. 
## Did you mistype the name of a data column or forget to add after_stat()?
```

---

## Inheriting `aes` from `ggplot()`

* What `aes`thetics did the second `geom_point()` inherit?  What didn't it inherit?

```r
ggplot(data = Births2015, mapping = aes(x = date, y = births, 
                                        color = wday)) + 
  geom_point() +
  geom_point(data = holidays,  color = "black", size = 3, 
             mapping = aes(x = Dates))
```

---

## Inheriting `aes` from `ggplot()`

* What `aes`thetics did the second `geom_point()` inherit?  What didn't it inherit?

```r
ggplot(data = Births2015, mapping = aes(x = date, y = births, 
                                        color = wday)) + 
  geom_point()  +
  geom_point(data = holidays,  color = "black", size = 3, 
             mapping = aes(x = Dates), inherit.aes = FALSE)
```

```
## Error: geom_point requires the following missing aesthetics: y
```

---

## Inheriting `aes` from `ggplot()`

* What `aes`thetics did the second `geom_point()` inherit?  What didn't it inherit?

```r
ggplot(data = Births2015, mapping = aes(x = date, y = births, 
                                        color = wday)) + 
  geom_point() +
  geom_point(data = holidays,  color = "black", size = 3, 
             mapping = aes(x = Dates, y = births),
             inherit.aes = FALSE)
```

---

## Inheriting `aes` from `ggplot()`

* Problem: non-matching `aes` arguments

```r
#Add a box around Thanksgiving to Christmas
holidays_season <- data.frame(start = as_date("2015-11-26"), 
                              end = as_date("2015-12-24"))

ggplot(data = Births2015, mapping = aes(x = date, y = births,
                                        color = wday)) +
  geom_rect(data = holidays_season, mapping = aes(xmin = start, 
                                    xmax = end, ymin = 6000, 
                                    ymax = 14000)) + 
  geom_point()
```

```
## Error in FUN(X[[i]], ...): object 'births' not found
```

---

## Inheriting `aes` from `ggplot()`

* Problem: non-matching aes options

```r
ggplot(data = Births2015, mapping = aes(x = date, y = births, 
                                        color = wday))  + 
  geom_rect(data = holidays_season, mapping = aes(xmin = start,
                                    xmax = end, ymin = 6000,
                                    ymax = 14000), 
            inherit.aes = FALSE) + 
  geom_point()
```

---

## Data Wrangling

* What is it?
    + Any processing you have to do to the data to summarize, visualize, model it.
    + EX: 
        + Recoding 999 as "NA"
        + Removing rows
        + Creating new variables
        + Recoding category variables
        + Fixing variable types
        + Reshaping data into a format that satisfies the tidy data principles
        
---

## [Data In the Wild](https://www.bts.gov/content/age-and-availability-amtrak-locomotive-and-car-fleets)

Unfortunately, many datasets on the internet are often in **Display Format**, not **Analysis Format**.

---

## Data Wrangling

Unfortunately, many datasets on the internet are often in **Display Format**, not **Analysis Format**.

```r
library(readxl)
url <- "https://www.bts.gov/sites/bts.dot.gov/files/table_01_33_102020.xlsx"
destfile <- "table_01_33_102020.xlsx"
curl::curl_download(url, destfile)
table_01_33_102020 <- read_excel(destfile)
table_01_33_102020
```

```
## # A tibble: 20 x 35
##    `Table 1-33:  A… ...2    ...3   ...4   ...5  ...6  ...7   ...8   ...9  ...10
##    <chr>            <chr>  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
##  1 <NA>             1972  1975   1980   1985    1990  1991 1992   1993   1994  
##  2 Locomotives      <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
##  3 Percent availab… U       87     83     93      84    86   83     84     85  
##  4 Average age (ye… 22.3    14.4    7.4    7      12    13   13     13.2   13.4
##  5 Passenger and o… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
##  6 Percent availab… U       82     77     90      90    92   90     89     88  
##  7 Average age (ye… 22      24.7   14.3   14.2    20    21   21.5   22.6   22.4
##  8 KEY:  U = data … <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
##  9 <NA>             <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 10 a Year-end dail… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 11 b Fiscal Year-e… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 12 <NA>             <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 13 NOTES            <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 14 1972 was Amtrak… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 15 Roadrailers are… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 16 <NA>             <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 17 SOURCES          <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 18 1972-80: Amtrak… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 19 1985-2000: Ibid… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## 20 2001-19: Amtrak… <NA>    NA     NA     NA      NA    NA   NA     NA     NA  
## # … with 25 more variables: ...11 <dbl>, ...12 <dbl>, ...13 <dbl>, ...14 <dbl>,
## #   ...15 <dbl>, ...16 <dbl>, ...17 <chr>, ...18 <chr>, ...19 <dbl>,
## #   ...20 <dbl>, ...21 <dbl>, ...22 <dbl>, ...23 <dbl>, ...24 <dbl>,
## #   ...25 <dbl>, ...26 <dbl>, ...27 <dbl>, ...28 <dbl>, ...29 <dbl>,
## #   ...30 <dbl>, ...31 <dbl>, ...32 <chr>, ...33 <chr>, ...34 <chr>,
## #   ...35 <chr>
```

---

## Math 241 `dplyr` Experience

---

## dplyr for Data Wrangling

* Five common wrangling verbs:
    + `select()`
    + `mutate()`
    + `filter()`
    + `arrange()`
    + `summarize()`
    
* One action:
    + `group_by()`

---

## Data: BLS Consumer Expenditure Survey

Last quarter of the 2016 Bureau of Labor Statistics Consumer Expenditure Survey.

```r
library(dplyr)

ce <- read_csv("/home/courses/math241s21/Data/ce.csv")
glimpse(ce)
```

```
## Rows: 6,301
## Columns: 49
## $ FINLWT21 <dbl> 25985, 6581, 20208, 18078, 20112, 19907, 11705, 24431, 42859…
## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 1…
## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ POPSIZE  <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, …
## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", …
## $ EDUCA2   <chr> "15", "15", "13", NA, NA, NA, NA, "15", "15", "14", "12", "1…
## $ AGE_REF  <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, …
## $ AGE2     <chr> "50", "47", "46", ".", ".", ".", ".", "36", "53", "67", "44"…
## $ SEX_REF  <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, …
## $ SEX2     <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1…
## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, …
## $ RACE2    <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1…
## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, …
## $ HISP2    <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2…
## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, …
## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, …
## $ REGION   <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, …
## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", …
## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ FOODCQ   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TRANSCQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EDUCACQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ STUDFINX <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", …
## $ IRAX     <chr> "1000000", "10000", "0", ".", ".", "0", "0", "15000", ".", "…
## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, …
## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, …
## $ VEHQ     <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, …
## $ ROOMSQ   <chr> "8", "5", "6", "4", "4", "4", "7", "5", "4", "9", "6", "10",…
## $ INC_HRS1 <chr> "40", "40", "40", "44", "40", ".", ".", "40", "40", ".", "40…
## $ INC_HRS2 <chr> "30", "40", "52", ".", ".", ".", ".", "40", "40", ".", "65",…
## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, …
## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, …
## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", NA, NA, "12", "04", NA, "01", …
## $ OCCUCOD2 <chr> "04", "02", "01", NA, NA, NA, NA, "02", "03", NA, "11", NA, …
## $ STATE    <chr> "41", "15", "48", "06", "06", "48", "06", "42", NA, "27", "2…
## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9,…
## $ TOTXEST  <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 14…
## $ CREDFINX <chr> "0", ".", "0", ".", "5", ".", ".", ".", ".", "0", ".", "0", …
## $ CREDITB  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ CREDITX  <chr> "4000", "5000", "2000", ".", "7000", "1800", ".", "6000", ".…
## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", …
## $ ST_HOUS  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
```

---

## select

* If I know that I only want to work with a certain set of variables, I can use `select()` to grab those specific variables:

```r
ce1 <- select(ce, BLS_URBN, FINCBTAX, TOTEXPCQ, EHOUSNGC,
              FOODCQ, HIGH_EDU, CUTENURE, IRAX)
glimpse(ce1)
```

```
## Rows: 6,301
## Columns: 8
## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 1…
## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ FOODCQ   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", …
## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, …
## $ IRAX     <chr> "1000000", "10000", "0", ".", ".", "0", "0", "15000", ".", "…
```

---

## mutate

* Add columns that are functions of the previous columns:

```r
ce2 <- mutate(ce1, SAVE = 1 - TOTEXPCQ*4/FINCBTAX, 
              FOODHOUS = SAVE + EHOUSNGC, 
              RETSAVE = SAVE*TOTEXPCQ)
glimpse(ce2)
```

```
## Rows: 6,301
## Columns: 11
## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 1…
## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ FOODCQ   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", …
## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, …
## $ IRAX     <chr> "1000000", "10000", "0", ".", ".", "0", "0", "15000", ".", "…
## $ SAVE     <dbl> 1, 1, 1, NaN, 1, 1, NaN, 1, 1, 1, 1, 1, NaN, 1, 1, 1, 1, 1, …
## $ FOODHOUS <dbl> 1, 1, 1, NaN, 1, 1, NaN, 1, 1, 1, 1, 1, NaN, 1, 1, 1, 1, 1, …
## $ RETSAVE  <dbl> 0, 0, 0, NaN, 0, 0, NaN, 0, 0, 0, 0, 0, NaN, 0, 0, 0, 0, 0, …
```

---

## filter

* Keep rows/observations that match certain criteria

```r
ce3 <- filter(ce2, TOTEXPCQ > 0, CUTENURE == 1)
glimpse(ce3)
```

```
## Rows: 1,471
## Columns: 11
## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ FINCBTAX <dbl> 745136, 36000, 45000, 0, 85000, 44982, 0, 60000, 63616, 7500…
## $ TOTEXPCQ <dbl> 8693, 3733, 3627, 7671, 3561, 3306, 2911, 3145, 9340, 4271, …
## $ EHOUSNGC <dbl> 1345, 2222, 1349, 4171, 1996, 1729, 1314, 1315, 4354, 2496, …
## $ FOODCQ   <dbl> 1820, 1300, 1937, 1300, 408, 650, 455, 668, 585, 152, 737, 9…
## $ HIGH_EDU <chr> "16", "16", "15", "16", "12", "13", "15", "13", "12", "15", …
## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ IRAX     <chr> "280000", ".", "3000", ".", "0", "0", "150000", "0", "410", …
## $ SAVE     <dbl> 0.95, 0.59, 0.68, -Inf, 0.83, 0.71, -Inf, 0.79, 0.41, 0.77, …
## $ FOODHOUS <dbl> 1346, 2222, 1350, -Inf, 1997, 1730, -Inf, 1315, 4354, 2496, …
## $ RETSAVE  <dbl> 8288, 2185, 2458, -Inf, 2964, 2334, -Inf, 2485, 3855, 3298, …
```

---

## arrange

* Sort the dataframe by specific columns

```r
ce4 <- arrange(ce3, desc(FINCBTAX))
ce4
```

```
## # A tibble: 1,471 x 11
##    BLS_URBN FINCBTAX TOTEXPCQ EHOUSNGC FOODCQ HIGH_EDU CUTENURE IRAX   SAVE
##       <dbl>    <dbl>    <dbl>    <dbl>  <dbl> <chr>       <dbl> <chr> <dbl>
##  1        1  1167106    9139.    3019.  1673. 15              1 .     0.969
##  2        1   779690   26688.   13434   5937. 16              1 50000 0.863
##  3        1   745136    8693.    1345   1820  16              1 2800… 0.953
##  4        1   700000   19892.    7869.  1777. 16              1 .     0.886
##  5        1   571000   13906.    5640    981  15              1 .     0.903
##  6        1   565000   21002.   10860.  2341. 16              1 5000… 0.851
##  7        1   561000    9225.    4263.  1192. 16              1 0     0.934
##  8        1   560158    3855.    1122.   722. 15              1 2080… 0.972
##  9        1   553859   41319.   12216.  5887. 16              1 .     0.702
## 10        1   527962    3158.    1395.   585  16              1 .     0.976
## # … with 1,461 more rows, and 2 more variables: FOODHOUS <dbl>, RETSAVE <dbl>
```

---

## summarize

* Collapse all observations into a single row with new variables
    + Can compute multiple measures

```r
ce5 <- summarize(ce4, meanFINCBTAX = mean(FINCBTAX),
                 medianFINCBTAX = median(FINCBTAX), 
                 size = n())
ce5
```

```
## # A tibble: 1 x 3
##   meanFINCBTAX medianFINCBTAX  size
##          <dbl>          <dbl> <int>
## 1       92034.          73500  1471
```

---

## summarize + group_by

* Collapse all observations by a specified grouping

```r
ce6 <- group_by(ce4, BLS_URBN)
glimpse(ce6)
```

```
## Rows: 1,471
## Columns: 11
## Groups: BLS_URBN [2]
## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ FINCBTAX <dbl> 1167106, 779690, 745136, 700000, 571000, 565000, 561000, 560…
## $ TOTEXPCQ <dbl> 9139, 26688, 8693, 19892, 13906, 21002, 9225, 3855, 41319, 3…
## $ EHOUSNGC <dbl> 3019, 13434, 1345, 7869, 5640, 10860, 4263, 1122, 12216, 139…
## $ FOODCQ   <dbl> 1673, 5937, 1820, 1777, 981, 2341, 1192, 722, 5887, 585, 216…
## $ HIGH_EDU <chr> "15", "16", "16", "16", "15", "16", "16", "15", "16", "16", …
## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ IRAX     <chr> ".", "50000", "280000", ".", ".", "500000", "0", "208000", "…
## $ SAVE     <dbl> 0.97, 0.86, 0.95, 0.89, 0.90, 0.85, 0.93, 0.97, 0.70, 0.98, …
## $ FOODHOUS <dbl> 3020, 13435, 1346, 7870, 5641, 10861, 4264, 1123, 12217, 139…
## $ RETSAVE  <dbl> 8853, 23034, 8288, 17631, 12551, 17879, 8618, 3749, 28989, 3…
```

---

## summarize + group_by

* Collapse all observations by a specified grouping

```r
ce6 <- group_by(ce4, BLS_URBN)
ce7 <- summarize(ce6, meanFINCBTAX = mean(FINCBTAX),
                 medianFINCBTAX = median(FINCBTAX), 
                 size = n())
ce7
```

```
## # A tibble: 2 x 4
##   BLS_URBN meanFINCBTAX medianFINCBTAX  size
## *    <dbl>        <dbl>          <dbl> <int>
## 1        1       93148.          74479  1404
## 2        2       68690.          48000    67
```

---

## count

* `count()` is a short-cut of `group_by()` + `summarize(n())`

```r
ce8 <- group_by(ce4, BLS_URBN)
summarize(ce8, n())
```

```
## # A tibble: 2 x 2
##   BLS_URBN `n()`
## *    <dbl> <int>
## 1        1  1404
## 2        2    67
```

```r
count(ce4, BLS_URBN)
```

```
## # A tibble: 2 x 2
##   BLS_URBN     n
## *    <dbl> <int>
## 1        1  1404
## 2        2    67
```

---

## group_by + mutate

```r
ce9 <- count(ce, BLS_URBN, HISP_REF)
ce9
```

```
## # A tibble: 4 x 3
##   BLS_URBN HISP_REF     n
##      <dbl>    <dbl> <int>
## 1        1        1   842
## 2        1        2  5110
## 3        2        1     3
## 4        2        2   346
```

```r
ce10 <- group_by(ce9, BLS_URBN)
mutate(ce10, p = n/sum(n))
```

```
## # A tibble: 4 x 4
## # Groups:   BLS_URBN [2]
##   BLS_URBN HISP_REF     n       p
##      <dbl>    <dbl> <int>   <dbl>
## 1        1        1   842 0.141  
## 2        1        2  5110 0.859  
## 3        2        1     3 0.00860
## 4        2        2   346 0.991
```

---

## The pipe!

---

## The pipe!

* Usually we want to run *several* data wrangling operations.

* It is annoying to save them the dataset at each intermediate step.

* In comes the pipe from the library `magrittr` which is loaded when you load `dplyr`.

* What does the pipe do?

```r
ce1 <- select(ce, BLS_URBN, FINCBTAX, TOTEXPCQ, EHOUSNGC,
              FOODCQ, HIGH_EDU, CUTENURE, IRAX)
```

is equivalent to

```r
ce1 <- ce %>%
  select(BLS_URBN, FINCBTAX, TOTEXPCQ, EHOUSNGC, 
         FOODCQ, HIGH_EDU, CUTENURE, IRAX)
```

---

## Pipe in action

```r
ce1 <- ce %>%
  select(BLS_URBN, FINCBTAX, TOTEXPCQ, EHOUSNGC, FOODCQ,
         HIGH_EDU, CUTENURE, IRAX) %>%
  mutate(SAVE = 1 - TOTEXPCQ*4/FINCBTAX, 
         FOODHOUS = FOODCQ  + EHOUSNGC) %>%
  filter(TOTEXPCQ > 0, CUTENURE == 1) %>%
  group_by(BLS_URBN) %>%
  summarize(meanFINCBTAX = mean(FINCBTAX), 
            medianFINCBTAX = median(FINCBTAX), 
            size = n()) %>%
  arrange(meanFINCBTAX) 
ce1
```

```
## # A tibble: 2 x 4
##   BLS_URBN meanFINCBTAX medianFINCBTAX  size
##      <dbl>        <dbl>          <dbl> <int>
## 1        2       68690.          48000    67
## 2        1       93148.          74479  1404
```

---

## Selecting a Set of Columns

* All columns that pertain to the reference person (i.e. the person in the household who was surveyed):

```r
select(ce, starts_with("REF"))
```

```
## # A tibble: 6,301 x 1
##    REF_RACE
##       <dbl>
##  1        1
##  2        4
##  3        2
##  4        1
##  5        1
##  6        1
##  7        1
##  8        1
##  9        1
## 10        1
## # … with 6,291 more rows
```

---

## Selecting a Set of Columns

* Unfortunately, `ends_with()` misses `REF_RACE`

```r
select(ce, contains("REF"))
```

```
## # A tibble: 6,301 x 5
##    EDUC_REF AGE_REF SEX_REF REF_RACE HISP_REF
##    <chr>      <dbl>   <dbl>    <dbl>    <dbl>
##  1 16            63       1        1        2
##  2 15            50       1        4        2
##  3 16            47       2        2        2
##  4 15            37       1        1        2
##  5 14            51       2        1        2
##  6 11            63       1        1        1
##  7 10            77       2        1        1
##  8 13            37       1        1        2
##  9 12            51       1        1        2
## 10 12            64       2        1        2
## # … with 6,291 more rows
```

---

## Selecting a Set of Columns

* Select columns whose names are given in a list

```r
int <- c("FINCBTAX", "BLS_URBN", "AGE")
select(ce, one_of(int))
```

```
## Warning: Unknown columns: `AGE`
```

```
## # A tibble: 6,301 x 2
##    FINCBTAX BLS_URBN
##       <dbl>    <dbl>
##  1   116920        1
##  2      200        1
##  3   117000        1
##  4        0        1
##  5     2000        1
##  6      942        1
##  7        0        1
##  8    91000        1
##  9    95000        2
## 10    40037        1
## # … with 6,291 more rows
```

---

## Logical Operators

```r
filter(ce, BLS_URBN == 1, HIGH_EDU %in% c("15", "16"),
       FINCBTAX >= 40000, 
       FAM_SIZE != 1)
```

```
## # A tibble: 1,452 x 49
##    FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF AGE2  SEX_REF
##       <dbl>    <dbl>    <dbl>   <dbl> <chr>    <chr>    <dbl> <chr>   <dbl>
##  1   25985.   116920        1       2 16       15          63 50          1
##  2   20208.   117000        1       4 16       13          47 46          2
##  3   24431.    91000        1       2 13       15          37 36          1
##  4   24925.   190000        1       1 13       12          53 55          2
##  5   30449.   261440        1       3 16       15          53 55          2
##  6   23266.    75000        1       1 15       12          45 33          1
##  7   23024.   135000        1       1 15       <NA>        32 .           2
##  8   24301.   169000        1       1 15       <NA>        39 .           2
##  9   21651.   155000        1       1 15       15          38 38          2
## 10   22551.    68420        1       2 16       16          57 87          2
## # … with 1,442 more rows, and 40 more variables: SEX2 <dbl>, REF_RACE <dbl>,
## #   RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>, MARITAL1 <dbl>,
## #   REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>, EHOUSNGC <dbl>,
## #   TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>, HEALTHCQ <dbl>,
## #   ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>, STUDFINX <chr>, IRAX <chr>,
## #   CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>, ROOMSQ <chr>, INC_HRS1 <chr>,
## #   INC_HRS2 <chr>, EARNCOMP <dbl>, NO_EARNR <dbl>, OCCUCOD1 <chr>,
## #   OCCUCOD2 <chr>, STATE <chr>, DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <chr>,
## #   CREDITB <dbl>, CREDITX <chr>, BUILDING <chr>, ST_HOUS <dbl>,
## #   INT_PHON <lgl>, INT_HOME <lgl>
```

---

##Removing NAs

```r
filter(ce, is.na(EDUCA2))
```

```
## # A tibble: 3,198 x 49
##    FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF AGE2  SEX_REF
##       <dbl>    <dbl>    <dbl>   <dbl> <chr>    <chr>    <dbl> <chr>   <dbl>
##  1   18078.        0        1       2 15       <NA>        37 .           1
##  2   20112.     2000        1       2 14       <NA>        51 .           2
##  3   19907.      942        1       2 11       <NA>        63 .           1
##  4   11705.        0        1       1 10       <NA>        77 .           2
##  5   23039.        0        1       2 12       <NA>        81 .           2
##  6   26750.     2242        1       3 14       <NA>        51 .           2
##  7   24100.     1462        1       4 15       <NA>        67 .           2
##  8   16369.    78062        1       3 13       <NA>        70 .           2
##  9   20420.   113000        1       4 16       <NA>        61 .           1
## 10   23485.    70000        1       2 13       <NA>        37 .           1
## # … with 3,188 more rows, and 40 more variables: SEX2 <dbl>, REF_RACE <dbl>,
## #   RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>, MARITAL1 <dbl>,
## #   REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>, EHOUSNGC <dbl>,
## #   TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>, HEALTHCQ <dbl>,
## #   ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>, STUDFINX <chr>, IRAX <chr>,
## #   CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>, ROOMSQ <chr>, INC_HRS1 <chr>,
## #   INC_HRS2 <chr>, EARNCOMP <dbl>, NO_EARNR <dbl>, OCCUCOD1 <chr>,
## #   OCCUCOD2 <chr>, STATE <chr>, DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <chr>,
## #   CREDITB <dbl>, CREDITX <chr>, BUILDING <chr>, ST_HOUS <dbl>,
## #   INT_PHON <lgl>, INT_HOME <lgl>
```

---

##Removing NAs

```r
filter(ce, !is.na(EDUCA2)) 
```

```
## # A tibble: 3,103 x 49
##    FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF AGE2  SEX_REF
##       <dbl>    <dbl>    <dbl>   <dbl> <chr>    <chr>    <dbl> <chr>   <dbl>
##  1   25985.   116920        1       2 16       15          63 50          1
##  2    6581.      200        1       3 15       15          50 47          1
##  3   20208.   117000        1       4 16       13          47 46          2
##  4   24431.    91000        1       2 13       15          37 36          1
##  5   42859.    95000        2       5 12       15          51 53          1
##  6   17481.    40037        1       2 12       14          64 67          2
##  7   26019.   109000        1       3 12       12          26 44          2
##  8   16087.     4104        1       2 12       12          59 62          2
##  9   23049.    80300        1       3 12       12          45 42          2
## 10   24925.   190000        1       1 13       12          53 55          2
## # … with 3,093 more rows, and 40 more variables: SEX2 <dbl>, REF_RACE <dbl>,
## #   RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>, MARITAL1 <dbl>,
## #   REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>, EHOUSNGC <dbl>,
## #   TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>, HEALTHCQ <dbl>,
## #   ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>, STUDFINX <chr>, IRAX <chr>,
## #   CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>, ROOMSQ <chr>, INC_HRS1 <chr>,
## #   INC_HRS2 <chr>, EARNCOMP <dbl>, NO_EARNR <dbl>, OCCUCOD1 <chr>,
## #   OCCUCOD2 <chr>, STATE <chr>, DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <chr>,
## #   CREDITB <dbl>, CREDITX <chr>, BUILDING <chr>, ST_HOUS <dbl>,
## #   INT_PHON <lgl>, INT_HOME <lgl>
```

---

##Removing NAs

```r
drop_na(ce, EDUCA2)
```

---

##Removing NAs

* Careful!

```r
ce %>%
  na.omit()
```

```
## # A tibble: 0 x 49
## # … with 49 variables: FINLWT21 <dbl>, FINCBTAX <dbl>, BLS_URBN <dbl>,
## #   POPSIZE <dbl>, EDUC_REF <chr>, EDUCA2 <chr>, AGE_REF <dbl>, AGE2 <chr>,
## #   SEX_REF <dbl>, SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>,
## #   HISP2 <dbl>, FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>,
## #   HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>,
## #   TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>,
## #   TOBACCCQ <dbl>, STUDFINX <chr>, IRAX <chr>, CUTENURE <dbl>, FAM_SIZE <dbl>,
## #   VEHQ <dbl>, ROOMSQ <chr>, INC_HRS1 <chr>, INC_HRS2 <chr>, EARNCOMP <dbl>,
## #   NO_EARNR <dbl>, OCCUCOD1 <chr>, OCCUCOD2 <chr>, STATE <chr>,
## #   DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <chr>, CREDITB <dbl>,
## #   CREDITX <chr>, BUILDING <chr>, ST_HOUS <dbl>, INT_PHON <lgl>,
## #   INT_HOME <lgl>
```

---

##Removing NAs

```r
select(ce, BLS_URBN, FINCBTAX, TOTEXPCQ, EHOUSNGC, FOODCQ,
       HIGH_EDU, CUTENURE, IRAX) %>%
  na.omit()
```

```
## # A tibble: 6,301 x 8
##    BLS_URBN FINCBTAX TOTEXPCQ EHOUSNGC FOODCQ HIGH_EDU CUTENURE IRAX   
##       <dbl>    <dbl>    <dbl>    <dbl>  <dbl> <chr>       <dbl> <chr>  
##  1        1   116920        0        0      0 16              1 1000000
##  2        1      200        0        0      0 15              1 10000  
##  3        1   117000        0        0      0 16              1 0      
##  4        1        0        0        0      0 15              1 .      
##  5        1     2000        0        0      0 14              1 .      
##  6        1      942        0        0      0 11              2 0      
##  7        1        0        0        0      0 10              4 0      
##  8        1    91000        0        0      0 15              1 15000  
##  9        2    95000        0        0      0 15              1 .      
## 10        1    40037        0        0      0 14              2 477000 
## # … with 6,291 more rows
```

---

## Recoding variables

* The CE dataset is coded for SAS, which supposedly can't handle non-numeric vectors.
    + So all of the categorical variables are coded as numerical.

```r
ce <- ce %>%
  mutate(BLS_URBN = case_when(BLS_URBN == '1' ~ "Urban", 
                              BLS_URBN == '2' ~ "Rural"))
glimpse(ce)
```

```
## Rows: 6,301
## Columns: 49
## $ FINLWT21 <dbl> 25985, 6581, 20208, 18078, 20112, 19907, 11705, 24431, 42859…
## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 1…
## $ BLS_URBN <chr> "Urban", "Urban", "Urban", "Urban", "Urban", "Urban", "Urban…
## $ POPSIZE  <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, …
## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", …
## $ EDUCA2   <chr> "15", "15", "13", NA, NA, NA, NA, "15", "15", "14", "12", "1…
## $ AGE_REF  <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, …
## $ AGE2     <chr> "50", "47", "46", ".", ".", ".", ".", "36", "53", "67", "44"…
## $ SEX_REF  <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, …
## $ SEX2     <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1…
## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, …
## $ RACE2    <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1…
## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, …
## $ HISP2    <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2…
## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, …
## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, …
## $ REGION   <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, …
## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", …
## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ FOODCQ   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TRANSCQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EDUCACQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ STUDFINX <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", …
## $ IRAX     <chr> "1000000", "10000", "0", ".", ".", "0", "0", "15000", ".", "…
## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, …
## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, …
## $ VEHQ     <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, …
## $ ROOMSQ   <chr> "8", "5", "6", "4", "4", "4", "7", "5", "4", "9", "6", "10",…
## $ INC_HRS1 <chr> "40", "40", "40", "44", "40", ".", ".", "40", "40", ".", "40…
## $ INC_HRS2 <chr> "30", "40", "52", ".", ".", ".", ".", "40", "40", ".", "65",…
## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, …
## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, …
## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", NA, NA, "12", "04", NA, "01", …
## $ OCCUCOD2 <chr> "04", "02", "01", NA, NA, NA, NA, "02", "03", NA, "11", NA, …
## $ STATE    <chr> "41", "15", "48", "06", "06", "48", "06", "42", NA, "27", "2…
## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9,…
## $ TOTXEST  <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 14…
## $ CREDFINX <chr> "0", ".", "0", ".", "5", ".", ".", ".", ".", "0", ".", "0", …
## $ CREDITB  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ CREDITX  <chr> "4000", "5000", "2000", ".", "7000", "1800", ".", "6000", ".…
## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", …
## $ ST_HOUS  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
```

---

## Multiple Groupings

```r
ce %>% 
  group_by(BLS_URBN, HIGH_EDU) %>%
  summarize(meanFINCBTAX = mean(FINCBTAX), 
            medianFINCBTAX = median(FINCBTAX), 
            size = n()) 
```

```
## # A tibble: 15 x 5
## # Groups:   BLS_URBN [2]
##    BLS_URBN HIGH_EDU meanFINCBTAX medianFINCBTAX  size
##    <chr>    <chr>           <dbl>          <dbl> <int>
##  1 Rural    10              9249.          5100     21
##  2 Rural    11             21092.         14528.    18
##  3 Rural    12             25965.         20662    121
##  4 Rural    13             39817.         31100     71
##  5 Rural    14             39511.         34019     33
##  6 Rural    15             64959.         52905     48
##  7 Rural    16             85111.         85862     37
##  8 Urban    00              6422.          2226      8
##  9 Urban    10             14798.         12284     89
## 10 Urban    11             26099.         17020    284
## 11 Urban    12             33993.         25340   1151
## 12 Urban    13             43986.         33082   1226
## 13 Urban    14             51841.         44000    681
## 14 Urban    15             84154.         62000   1480
## 15 Urban    16            114123.         90000   1033
```

---

## Naming Wrangled Data

* Should I name my new dataframe `ce` or `ce1`?
    + *My* answer:
        + Is your new dataset structurally different? If so, give it a new name.
        + Are you removing values you will need for a future analysis within the same document? If so, give it a new name.
        + Are you just adding to or cleaning the data?  If so, then write over the original.

---

## R Projects

* Where does your analysis live?
    + Working directory

```r
getwd()
```

```
## [1] "/home/mcconville/math241s21/slides"
```

* For a given project, your analysis should live in the folder where you store the files associated with the project.
    
---

## RStudio Projects

**RStudio Projects**: RStudio feature that helps you organize your work.

* We will create a Math 241 project shortly.

* For RStudio Projects, the working directory is the home directory of the project.

```r
getwd()
```

```
## [1] "/home/mcconville/math241s21/slides"
```

---

## RStudio Projects

* To access the project, you can
    + Go to the upper right and select "Open Project"
    + Click on the `___.Rproj`

* Notice that when we open the project, all the files and command history are still there.

---

## RStudio Projects

* To access files in this project, I use relative paths:

```r
include_graphics("img/hexStickers.png") 
```

---

## Projects and Workflow

* Create an RStudio project for each analysis project.
    + I will force you to have a project for the course and each project.

* Keep data files that you need for the project in a `data` folder.

* Keep your `Rmd`s and script files there.

* Save outputs there.

* Use relative paths whenever possible.

---

## Math 241 Git Experience

---

## `git`

What is *git*?

* Version control system for storing your work.
* It manages the evolution of a set of files in a structured way.
    + Think "Track Changes" in Microsoft Word

---

## Manual Version Control

* But I already do version control: 
    + draft.rmd, draft2.rmd, final.rmd, realFinal.rmd, REALLYRealFinal.rmd...
    + Or maybe more like [this](http://phdcomics.com/comics/archive.php?comicid=1531)

* Issues with this version of version control:
    + Hard to know how related these files are.
    + How do you extend this to working with others?

---
    
## GitHub

What is *GitHub*?

* Slick user interface
* Distribution mechanism for git repositories
* Think "Google Drive"
* **Place to share your work!**
    + Ex: R package

---

## Github Repo = RStudio Project

* A **repo**, short for repository, is the folder that contains all of the files for the project on [github.com](github.com).

* For each repo, you should create an RStudio Project (with version control).
    + We will all do this together in a moment.
 
* Under the **Reed-Math241** GitHub Organization you will have 4 repos:
    + `labwork_username`: Just you (and course helpers)
    + `pa1_grp#`: You, your group members (and course helpers)
    + `pa2_grp#`: You, your group members (and course helpers)    
    + `pa_final_grp#`: You, your group members (and course helpers)
    
---
    
## Workflow

![Jenny Bryan (2017)](img/push_pull.png)

---

## Workflow

* Do some work on your project in RStudio.

* **Pull** the most recent version of the project from GitHub to your account on the RStudio Server.

* **Commit** that work.
    + Committing takes a snapshot of all the files in the project.
    + Look over the **diff**: which shows what has changed since your last update.
    + Include a quick note, **commit message** to summarize the motivation for the changes.

* **Push** your commit to GitHub from your account on the RStudio Server.

---

## Collaboration: Git Style

* Git is a *decentralized* version control system.
    + Each collaborator has a complete version of the repo.
    + Everyone can work offline and simultaneously.
    + GitHub holds the master copy.
        + Pull regularly to receive and integrate changes.

* **Issues**: The primary method to communicate with your group members.
    + Make an issue if you have a question or comment or want to make a to do list for the project.
    + Remember that I am part of the repo... though I won't normally read the issues.

---

## Git Real

* git is not friendly and can be frustrating.
      + BUT, the version control and collaborative rewards are big!
      
* GitHub.com is a great place to develop an online presence.
    + For now, we will use private repos.

* If you end up with a mess of errors, then don't worry.  Come see me and we will make a new repo with your most recent copy of the project.  
    + It happens to [everyone](https://xkcd.com/1597/).

---

## Git Language

* **repo**:

* **commit**:

* **push**:

* **pull**:

* **issue**:

---

## Now:

* Get out your laptops and sign-in to both github.com and rstudio.reed.edu.

* Let's *git* our RStudio account on the server synced with our github account and then make a change to our personal repo!
    + Make sure you accepted the repo request sent to your email.

---

## Introduce Yourself to Git

* Run the following code to introduce yourself to git

```r
library(usethis)
use_git_config(user.name = "mcconvil", 
               user.email = "mcconville@reed.edu")
```

---

## Sync GitHub.com repo and an RStudio Project repo

**In your repo on github.com**:

* Click on the green clone or download button.
* Copy the given url for "Clone with HTTPS".

**On the RStudio Server**:

* In the upper left, go to File > New Project > Version Control.
* Select Git.
* Paste in the url.  It should automatically give it a name. Select where you want the project to live in your home directory. Then click okay.

---

## Ignoring Files

* There are several files that we want to **NOT** push to GitHub.

* These include:
    + `.gitignore`
    + `___.Rproj`
    + `.DS_Store`

* Add these files to the `.gitignore`.

---

## Test the waters: Let's go through the workflow.

* Pull. (Yes, there is nothing to pull yet but it is always good practice to start here.)
* Click on the readme.
* Add something to the readme.
* Click on the git tab.  Check the box next to the readme.md. Hit commit.
* Put in a commit message.  Look over the diff.  
* Push.

**Look for updates in the readme on github.com.**

---

## Cache credentials

So that we don't have to type in our username and password every time we want to push or pull from GitHub, run the following **in the Terminal** not **in the Console**:

`git config --global credential.helper 'cache --timeout=10000000'`