class: center, middle ## Data Wrangling with `dplyr` and Collaborating with GitHub <img src="img/hero_wall_pink.png" width="800px"/> ### Kelly McConville .large[Math 241 | Week 3 | Spring 2021] --- # Announcements * `reprex` issues? * Uploading to Gradescope. + For Lab 2, make sure to assign each problem to a page! * For our GitHub Intro today: + Make sure you have your GitHub username and password handy for an activity we will do in class today. + Make sure you accepted the email invite to the Reed-Math241 GitHub Organization. + Make sure you accepted the email invite t a GitHub repository (we will define this today), named `labwork_yourGitHubUsername`. --- ## Looking Ahead... * Lab 2 due Thursday morning on Gradescope. * Will receive Mini-Project 1 on Thursday. --- ## Goals for Today * R Chunk options * Just a couple `ggplot2` loose ends * Data wrangling with `dplyr` * R Projects * GitHub (for 1) + GitHub for collaboration on Th. --- ## R Chunk Options * The r chunks allow for LOADS of useful [arguments](https://yihui.name/knitr/options/#chunk_options). + eval + echo + include + fig.width + fig.height + fig.cap --- ## Inheriting `aes` from `ggplot()` ```r ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_point() + geom_point(data = holidays, color = "black", size = 3) ``` <img src="slidesWk3Tu_files/figure-html/unnamed-chunk-3-1.png" width="504" /> * What `aes`thetics did the second `geom_point()` inherit? What didn't it inherit? --- ```r glimpse(Births2015) ``` ``` ## Rows: 365 ## Columns: 8 ## $ date <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01… ## $ births <dbl> 8068, 10850, 8328, 7065, 11892, 12425, 12141, 12094, 118… ## $ wday <ord> Thu, Fri, Sat, Sun, Mon, Tue, Wed, Thu, Fri, Sat, Sun, M… ## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 20… ## $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ day_of_year <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1… ## $ day_of_month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1… ## $ day_of_week <dbl> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2,… ``` ```r glimpse(holidays) ``` ``` ## Rows: 7 ## Columns: 9 ## $ date <date> 2015-01-01, 2015-05-25, 2015-07-04, 2015-12-25, 2015-11… ## $ occasion <chr> "New Year", "Memorial Day", "Independence Day", "Christm… ## $ births <dbl> 8068, 7746, 7944, 6515, 7332, 8714, 8127 ## $ wday <ord> Thu, Mon, Sat, Fri, Thu, Thu, Mon ## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015 ## $ month <dbl> 1, 5, 7, 12, 11, 12, 9 ## $ day_of_year <int> 1, 145, 185, 359, 330, 358, 250 ## $ day_of_month <dbl> 1, 25, 4, 25, 26, 24, 7 ## $ day_of_week <dbl> 5, 2, 7, 6, 5, 5, 2 ``` --- ## Inheriting `aes` from `ggplot()` * What `aes`thetics did the second `geom_point()` inherit? What didn't it inherit? ```r holidays <- rename(holidays, Dates = date) ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_point() + geom_point(data = holidays, color = "black", size = 3) ``` ``` ## Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = date. ## Did you mistype the name of a data column or forget to add after_stat()? ``` <img src="slidesWk3Tu_files/figure-html/unnamed-chunk-5-1.png" width="504" /> --- ## Inheriting `aes` from `ggplot()` * What `aes`thetics did the second `geom_point()` inherit? What didn't it inherit? ```r ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_point() + geom_point(data = holidays, color = "black", size = 3, mapping = aes(x = Dates)) ``` <img src="slidesWk3Tu_files/figure-html/unnamed-chunk-6-1.png" width="504" /> --- ## Inheriting `aes` from `ggplot()` * What `aes`thetics did the second `geom_point()` inherit? What didn't it inherit? ```r ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_point() + geom_point(data = holidays, color = "black", size = 3, mapping = aes(x = Dates), inherit.aes = FALSE) ``` ``` ## Error: geom_point requires the following missing aesthetics: y ``` <img src="slidesWk3Tu_files/figure-html/unnamed-chunk-7-1.png" width="504" /> --- ## Inheriting `aes` from `ggplot()` * What `aes`thetics did the second `geom_point()` inherit? What didn't it inherit? ```r ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_point() + geom_point(data = holidays, color = "black", size = 3, mapping = aes(x = Dates, y = births), inherit.aes = FALSE) ``` <img src="slidesWk3Tu_files/figure-html/unnamed-chunk-8-1.png" width="504" /> --- ## Inheriting `aes` from `ggplot()` * Problem: non-matching `aes` arguments ```r #Add a box around Thanksgiving to Christmas holidays_season <- data.frame(start = as_date("2015-11-26"), end = as_date("2015-12-24")) ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_rect(data = holidays_season, mapping = aes(xmin = start, xmax = end, ymin = 6000, ymax = 14000)) + geom_point() ``` ``` ## Error in FUN(X[[i]], ...): object 'births' not found ``` <img src="slidesWk3Tu_files/figure-html/unnamed-chunk-9-1.png" width="504" /> --- ## Inheriting `aes` from `ggplot()` * Problem: non-matching aes options ```r ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_rect(data = holidays_season, mapping = aes(xmin = start, xmax = end, ymin = 6000, ymax = 14000), inherit.aes = FALSE) + geom_point() ``` <img src="slidesWk3Tu_files/figure-html/unnamed-chunk-10-1.png" width="504" /> --- ## Data Wrangling * What is it? + Any processing you have to do to the data to summarize, visualize, model it. + EX: + Recoding 999 as "NA" + Removing rows + Creating new variables + Recoding category variables + Fixing variable types + Reshaping data into a format that satisfies the tidy data principles --- ## [Data In the Wild](https://www.bts.gov/content/age-and-availability-amtrak-locomotive-and-car-fleets) Unfortunately, many datasets on the internet are often in **Display Format**, not **Analysis Format**. <img src="img/amtrak.png" width="80%" /> --- ## Data Wrangling Unfortunately, many datasets on the internet are often in **Display Format**, not **Analysis Format**. ```r library(readxl) url <- "https://www.bts.gov/sites/bts.dot.gov/files/table_01_33_102020.xlsx" destfile <- "table_01_33_102020.xlsx" curl::curl_download(url, destfile) table_01_33_102020 <- read_excel(destfile) table_01_33_102020 ``` ``` ## # A tibble: 20 x 35 ## `Table 1-33: A… ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 <NA> 1972 1975 1980 1985 1990 1991 1992 1993 1994 ## 2 Locomotives <NA> NA NA NA NA NA NA NA NA ## 3 Percent availab… U 87 83 93 84 86 83 84 85 ## 4 Average age (ye… 22.3 14.4 7.4 7 12 13 13 13.2 13.4 ## 5 Passenger and o… <NA> NA NA NA NA NA NA NA NA ## 6 Percent availab… U 82 77 90 90 92 90 89 88 ## 7 Average age (ye… 22 24.7 14.3 14.2 20 21 21.5 22.6 22.4 ## 8 KEY: U = data … <NA> NA NA NA NA NA NA NA NA ## 9 <NA> <NA> NA NA NA NA NA NA NA NA ## 10 a Year-end dail… <NA> NA NA NA NA NA NA NA NA ## 11 b Fiscal Year-e… <NA> NA NA NA NA NA NA NA NA ## 12 <NA> <NA> NA NA NA NA NA NA NA NA ## 13 NOTES <NA> NA NA NA NA NA NA NA NA ## 14 1972 was Amtrak… <NA> NA NA NA NA NA NA NA NA ## 15 Roadrailers are… <NA> NA NA NA NA NA NA NA NA ## 16 <NA> <NA> NA NA NA NA NA NA NA NA ## 17 SOURCES <NA> NA NA NA NA NA NA NA NA ## 18 1972-80: Amtrak… <NA> NA NA NA NA NA NA NA NA ## 19 1985-2000: Ibid… <NA> NA NA NA NA NA NA NA NA ## 20 2001-19: Amtrak… <NA> NA NA NA NA NA NA NA NA ## # … with 25 more variables: ...11 <dbl>, ...12 <dbl>, ...13 <dbl>, ...14 <dbl>, ## # ...15 <dbl>, ...16 <dbl>, ...17 <chr>, ...18 <chr>, ...19 <dbl>, ## # ...20 <dbl>, ...21 <dbl>, ...22 <dbl>, ...23 <dbl>, ...24 <dbl>, ## # ...25 <dbl>, ...26 <dbl>, ...27 <dbl>, ...28 <dbl>, ...29 <dbl>, ## # ...30 <dbl>, ...31 <dbl>, ...32 <chr>, ...33 <chr>, ...34 <chr>, ## # ...35 <chr> ``` --- ## Math 241 `dplyr` Experience <img src="img/dplyr.png" width="90%" /> --- ## dplyr for Data Wrangling * Five common wrangling verbs: + `select()` + `mutate()` + `filter()` + `arrange()` + `summarize()` * One action: + `group_by()` --- ## Data: BLS Consumer Expenditure Survey Last quarter of the 2016 Bureau of Labor Statistics Consumer Expenditure Survey. ```r library(dplyr) ce <- read_csv("/home/courses/math241s21/Data/ce.csv") glimpse(ce) ``` ``` ## Rows: 6,301 ## Columns: 49 ## $ FINLWT21 <dbl> 25985, 6581, 20208, 18078, 20112, 19907, 11705, 24431, 42859… ## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 1… ## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ POPSIZE <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, … ## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", … ## $ EDUCA2 <chr> "15", "15", "13", NA, NA, NA, NA, "15", "15", "14", "12", "1… ## $ AGE_REF <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, … ## $ AGE2 <chr> "50", "47", "46", ".", ".", ".", ".", "36", "53", "67", "44"… ## $ SEX_REF <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, … ## $ SEX2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1… ## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, … ## $ RACE2 <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1… ## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, … ## $ HISP2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2… ## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, … ## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, … ## $ REGION <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, … ## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", … ## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ FOODCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ TRANSCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ EDUCACQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ STUDFINX <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", … ## $ IRAX <chr> "1000000", "10000", "0", ".", ".", "0", "0", "15000", ".", "… ## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, … ## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, … ## $ VEHQ <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, … ## $ ROOMSQ <chr> "8", "5", "6", "4", "4", "4", "7", "5", "4", "9", "6", "10",… ## $ INC_HRS1 <chr> "40", "40", "40", "44", "40", ".", ".", "40", "40", ".", "40… ## $ INC_HRS2 <chr> "30", "40", "52", ".", ".", ".", ".", "40", "40", ".", "65",… ## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, … ## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, … ## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", NA, NA, "12", "04", NA, "01", … ## $ OCCUCOD2 <chr> "04", "02", "01", NA, NA, NA, NA, "02", "03", NA, "11", NA, … ## $ STATE <chr> "41", "15", "48", "06", "06", "48", "06", "42", NA, "27", "2… ## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9,… ## $ TOTXEST <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 14… ## $ CREDFINX <chr> "0", ".", "0", ".", "5", ".", ".", ".", ".", "0", ".", "0", … ## $ CREDITB <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ CREDITX <chr> "4000", "5000", "2000", ".", "7000", "1800", ".", "6000", ".… ## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", … ## $ ST_HOUS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, … ## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ``` --- ## select * If I know that I only want to work with a certain set of variables, I can use `select()` to grab those specific variables: ```r ce1 <- select(ce, BLS_URBN, FINCBTAX, TOTEXPCQ, EHOUSNGC, FOODCQ, HIGH_EDU, CUTENURE, IRAX) glimpse(ce1) ``` ``` ## Rows: 6,301 ## Columns: 8 ## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 1… ## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ FOODCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", … ## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, … ## $ IRAX <chr> "1000000", "10000", "0", ".", ".", "0", "0", "15000", ".", "… ``` --- ## mutate * Add columns that are functions of the previous columns: ```r ce2 <- mutate(ce1, SAVE = 1 - TOTEXPCQ*4/FINCBTAX, FOODHOUS = SAVE + EHOUSNGC, RETSAVE = SAVE*TOTEXPCQ) glimpse(ce2) ``` ``` ## Rows: 6,301 ## Columns: 11 ## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 1… ## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ FOODCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", … ## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, … ## $ IRAX <chr> "1000000", "10000", "0", ".", ".", "0", "0", "15000", ".", "… ## $ SAVE <dbl> 1, 1, 1, NaN, 1, 1, NaN, 1, 1, 1, 1, 1, NaN, 1, 1, 1, 1, 1, … ## $ FOODHOUS <dbl> 1, 1, 1, NaN, 1, 1, NaN, 1, 1, 1, 1, 1, NaN, 1, 1, 1, 1, 1, … ## $ RETSAVE <dbl> 0, 0, 0, NaN, 0, 0, NaN, 0, 0, 0, 0, 0, NaN, 0, 0, 0, 0, 0, … ``` --- ## filter * Keep rows/observations that match certain criteria ```r ce3 <- filter(ce2, TOTEXPCQ > 0, CUTENURE == 1) glimpse(ce3) ``` ``` ## Rows: 1,471 ## Columns: 11 ## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ FINCBTAX <dbl> 745136, 36000, 45000, 0, 85000, 44982, 0, 60000, 63616, 7500… ## $ TOTEXPCQ <dbl> 8693, 3733, 3627, 7671, 3561, 3306, 2911, 3145, 9340, 4271, … ## $ EHOUSNGC <dbl> 1345, 2222, 1349, 4171, 1996, 1729, 1314, 1315, 4354, 2496, … ## $ FOODCQ <dbl> 1820, 1300, 1937, 1300, 408, 650, 455, 668, 585, 152, 737, 9… ## $ HIGH_EDU <chr> "16", "16", "15", "16", "12", "13", "15", "13", "12", "15", … ## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ IRAX <chr> "280000", ".", "3000", ".", "0", "0", "150000", "0", "410", … ## $ SAVE <dbl> 0.95, 0.59, 0.68, -Inf, 0.83, 0.71, -Inf, 0.79, 0.41, 0.77, … ## $ FOODHOUS <dbl> 1346, 2222, 1350, -Inf, 1997, 1730, -Inf, 1315, 4354, 2496, … ## $ RETSAVE <dbl> 8288, 2185, 2458, -Inf, 2964, 2334, -Inf, 2485, 3855, 3298, … ``` --- ## arrange * Sort the dataframe by specific columns ```r ce4 <- arrange(ce3, desc(FINCBTAX)) ce4 ``` ``` ## # A tibble: 1,471 x 11 ## BLS_URBN FINCBTAX TOTEXPCQ EHOUSNGC FOODCQ HIGH_EDU CUTENURE IRAX SAVE ## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> ## 1 1 1167106 9139. 3019. 1673. 15 1 . 0.969 ## 2 1 779690 26688. 13434 5937. 16 1 50000 0.863 ## 3 1 745136 8693. 1345 1820 16 1 2800… 0.953 ## 4 1 700000 19892. 7869. 1777. 16 1 . 0.886 ## 5 1 571000 13906. 5640 981 15 1 . 0.903 ## 6 1 565000 21002. 10860. 2341. 16 1 5000… 0.851 ## 7 1 561000 9225. 4263. 1192. 16 1 0 0.934 ## 8 1 560158 3855. 1122. 722. 15 1 2080… 0.972 ## 9 1 553859 41319. 12216. 5887. 16 1 . 0.702 ## 10 1 527962 3158. 1395. 585 16 1 . 0.976 ## # … with 1,461 more rows, and 2 more variables: FOODHOUS <dbl>, RETSAVE <dbl> ``` --- ## summarize * Collapse all observations into a single row with new variables + Can compute multiple measures ```r ce5 <- summarize(ce4, meanFINCBTAX = mean(FINCBTAX), medianFINCBTAX = median(FINCBTAX), size = n()) ce5 ``` ``` ## # A tibble: 1 x 3 ## meanFINCBTAX medianFINCBTAX size ## <dbl> <dbl> <int> ## 1 92034. 73500 1471 ``` --- ## summarize + group_by * Collapse all observations by a specified grouping ```r ce6 <- group_by(ce4, BLS_URBN) glimpse(ce6) ``` ``` ## Rows: 1,471 ## Columns: 11 ## Groups: BLS_URBN [2] ## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ FINCBTAX <dbl> 1167106, 779690, 745136, 700000, 571000, 565000, 561000, 560… ## $ TOTEXPCQ <dbl> 9139, 26688, 8693, 19892, 13906, 21002, 9225, 3855, 41319, 3… ## $ EHOUSNGC <dbl> 3019, 13434, 1345, 7869, 5640, 10860, 4263, 1122, 12216, 139… ## $ FOODCQ <dbl> 1673, 5937, 1820, 1777, 981, 2341, 1192, 722, 5887, 585, 216… ## $ HIGH_EDU <chr> "15", "16", "16", "16", "15", "16", "16", "15", "16", "16", … ## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ IRAX <chr> ".", "50000", "280000", ".", ".", "500000", "0", "208000", "… ## $ SAVE <dbl> 0.97, 0.86, 0.95, 0.89, 0.90, 0.85, 0.93, 0.97, 0.70, 0.98, … ## $ FOODHOUS <dbl> 3020, 13435, 1346, 7870, 5641, 10861, 4264, 1123, 12217, 139… ## $ RETSAVE <dbl> 8853, 23034, 8288, 17631, 12551, 17879, 8618, 3749, 28989, 3… ``` --- ## summarize + group_by * Collapse all observations by a specified grouping ```r ce6 <- group_by(ce4, BLS_URBN) ce7 <- summarize(ce6, meanFINCBTAX = mean(FINCBTAX), medianFINCBTAX = median(FINCBTAX), size = n()) ce7 ``` ``` ## # A tibble: 2 x 4 ## BLS_URBN meanFINCBTAX medianFINCBTAX size ## * <dbl> <dbl> <dbl> <int> ## 1 1 93148. 74479 1404 ## 2 2 68690. 48000 67 ``` --- ## count * `count()` is a short-cut of `group_by()` + `summarize(n())` ```r ce8 <- group_by(ce4, BLS_URBN) summarize(ce8, n()) ``` ``` ## # A tibble: 2 x 2 ## BLS_URBN `n()` ## * <dbl> <int> ## 1 1 1404 ## 2 2 67 ``` ```r count(ce4, BLS_URBN) ``` ``` ## # A tibble: 2 x 2 ## BLS_URBN n ## * <dbl> <int> ## 1 1 1404 ## 2 2 67 ``` --- ## group_by + mutate ```r ce9 <- count(ce, BLS_URBN, HISP_REF) ce9 ``` ``` ## # A tibble: 4 x 3 ## BLS_URBN HISP_REF n ## <dbl> <dbl> <int> ## 1 1 1 842 ## 2 1 2 5110 ## 3 2 1 3 ## 4 2 2 346 ``` ```r ce10 <- group_by(ce9, BLS_URBN) mutate(ce10, p = n/sum(n)) ``` ``` ## # A tibble: 4 x 4 ## # Groups: BLS_URBN [2] ## BLS_URBN HISP_REF n p ## <dbl> <dbl> <int> <dbl> ## 1 1 1 842 0.141 ## 2 1 2 5110 0.859 ## 3 2 1 3 0.00860 ## 4 2 2 346 0.991 ``` --- ## The pipe! <img src="img/pipe.png" width="60%" style="display: block; margin: auto;" /> --- ## The pipe! * Usually we want to run *several* data wrangling operations. * It is annoying to save them the dataset at each intermediate step. * In comes the pipe from the library `magrittr` which is loaded when you load `dplyr`. * What does the pipe do? ```r ce1 <- select(ce, BLS_URBN, FINCBTAX, TOTEXPCQ, EHOUSNGC, FOODCQ, HIGH_EDU, CUTENURE, IRAX) ``` is equivalent to ```r ce1 <- ce %>% select(BLS_URBN, FINCBTAX, TOTEXPCQ, EHOUSNGC, FOODCQ, HIGH_EDU, CUTENURE, IRAX) ``` --- ## Pipe in action ```r ce1 <- ce %>% select(BLS_URBN, FINCBTAX, TOTEXPCQ, EHOUSNGC, FOODCQ, HIGH_EDU, CUTENURE, IRAX) %>% mutate(SAVE = 1 - TOTEXPCQ*4/FINCBTAX, FOODHOUS = FOODCQ + EHOUSNGC) %>% filter(TOTEXPCQ > 0, CUTENURE == 1) %>% group_by(BLS_URBN) %>% summarize(meanFINCBTAX = mean(FINCBTAX), medianFINCBTAX = median(FINCBTAX), size = n()) %>% arrange(meanFINCBTAX) ce1 ``` ``` ## # A tibble: 2 x 4 ## BLS_URBN meanFINCBTAX medianFINCBTAX size ## <dbl> <dbl> <dbl> <int> ## 1 2 68690. 48000 67 ## 2 1 93148. 74479 1404 ``` --- ## Selecting a Set of Columns * All columns that pertain to the reference person (i.e. the person in the household who was surveyed): ```r select(ce, starts_with("REF")) ``` ``` ## # A tibble: 6,301 x 1 ## REF_RACE ## <dbl> ## 1 1 ## 2 4 ## 3 2 ## 4 1 ## 5 1 ## 6 1 ## 7 1 ## 8 1 ## 9 1 ## 10 1 ## # … with 6,291 more rows ``` --- ## Selecting a Set of Columns * Unfortunately, `ends_with()` misses `REF_RACE` ```r select(ce, contains("REF")) ``` ``` ## # A tibble: 6,301 x 5 ## EDUC_REF AGE_REF SEX_REF REF_RACE HISP_REF ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 16 63 1 1 2 ## 2 15 50 1 4 2 ## 3 16 47 2 2 2 ## 4 15 37 1 1 2 ## 5 14 51 2 1 2 ## 6 11 63 1 1 1 ## 7 10 77 2 1 1 ## 8 13 37 1 1 2 ## 9 12 51 1 1 2 ## 10 12 64 2 1 2 ## # … with 6,291 more rows ``` --- ## Selecting a Set of Columns * Select columns whose names are given in a list ```r int <- c("FINCBTAX", "BLS_URBN", "AGE") select(ce, one_of(int)) ``` ``` ## Warning: Unknown columns: `AGE` ``` ``` ## # A tibble: 6,301 x 2 ## FINCBTAX BLS_URBN ## <dbl> <dbl> ## 1 116920 1 ## 2 200 1 ## 3 117000 1 ## 4 0 1 ## 5 2000 1 ## 6 942 1 ## 7 0 1 ## 8 91000 1 ## 9 95000 2 ## 10 40037 1 ## # … with 6,291 more rows ``` --- ## Logical Operators ```r filter(ce, BLS_URBN == 1, HIGH_EDU %in% c("15", "16"), FINCBTAX >= 40000, FAM_SIZE != 1) ``` ``` ## # A tibble: 1,452 x 49 ## FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF AGE2 SEX_REF ## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl> ## 1 25985. 116920 1 2 16 15 63 50 1 ## 2 20208. 117000 1 4 16 13 47 46 2 ## 3 24431. 91000 1 2 13 15 37 36 1 ## 4 24925. 190000 1 1 13 12 53 55 2 ## 5 30449. 261440 1 3 16 15 53 55 2 ## 6 23266. 75000 1 1 15 12 45 33 1 ## 7 23024. 135000 1 1 15 <NA> 32 . 2 ## 8 24301. 169000 1 1 15 <NA> 39 . 2 ## 9 21651. 155000 1 1 15 15 38 38 2 ## 10 22551. 68420 1 2 16 16 57 87 2 ## # … with 1,442 more rows, and 40 more variables: SEX2 <dbl>, REF_RACE <dbl>, ## # RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>, MARITAL1 <dbl>, ## # REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>, EHOUSNGC <dbl>, ## # TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>, HEALTHCQ <dbl>, ## # ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>, STUDFINX <chr>, IRAX <chr>, ## # CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>, ROOMSQ <chr>, INC_HRS1 <chr>, ## # INC_HRS2 <chr>, EARNCOMP <dbl>, NO_EARNR <dbl>, OCCUCOD1 <chr>, ## # OCCUCOD2 <chr>, STATE <chr>, DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <chr>, ## # CREDITB <dbl>, CREDITX <chr>, BUILDING <chr>, ST_HOUS <dbl>, ## # INT_PHON <lgl>, INT_HOME <lgl> ``` --- ##Removing NAs ```r filter(ce, is.na(EDUCA2)) ``` ``` ## # A tibble: 3,198 x 49 ## FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF AGE2 SEX_REF ## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl> ## 1 18078. 0 1 2 15 <NA> 37 . 1 ## 2 20112. 2000 1 2 14 <NA> 51 . 2 ## 3 19907. 942 1 2 11 <NA> 63 . 1 ## 4 11705. 0 1 1 10 <NA> 77 . 2 ## 5 23039. 0 1 2 12 <NA> 81 . 2 ## 6 26750. 2242 1 3 14 <NA> 51 . 2 ## 7 24100. 1462 1 4 15 <NA> 67 . 2 ## 8 16369. 78062 1 3 13 <NA> 70 . 2 ## 9 20420. 113000 1 4 16 <NA> 61 . 1 ## 10 23485. 70000 1 2 13 <NA> 37 . 1 ## # … with 3,188 more rows, and 40 more variables: SEX2 <dbl>, REF_RACE <dbl>, ## # RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>, MARITAL1 <dbl>, ## # REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>, EHOUSNGC <dbl>, ## # TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>, HEALTHCQ <dbl>, ## # ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>, STUDFINX <chr>, IRAX <chr>, ## # CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>, ROOMSQ <chr>, INC_HRS1 <chr>, ## # INC_HRS2 <chr>, EARNCOMP <dbl>, NO_EARNR <dbl>, OCCUCOD1 <chr>, ## # OCCUCOD2 <chr>, STATE <chr>, DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <chr>, ## # CREDITB <dbl>, CREDITX <chr>, BUILDING <chr>, ST_HOUS <dbl>, ## # INT_PHON <lgl>, INT_HOME <lgl> ``` --- ##Removing NAs ```r filter(ce, !is.na(EDUCA2)) ``` ``` ## # A tibble: 3,103 x 49 ## FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF AGE2 SEX_REF ## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl> ## 1 25985. 116920 1 2 16 15 63 50 1 ## 2 6581. 200 1 3 15 15 50 47 1 ## 3 20208. 117000 1 4 16 13 47 46 2 ## 4 24431. 91000 1 2 13 15 37 36 1 ## 5 42859. 95000 2 5 12 15 51 53 1 ## 6 17481. 40037 1 2 12 14 64 67 2 ## 7 26019. 109000 1 3 12 12 26 44 2 ## 8 16087. 4104 1 2 12 12 59 62 2 ## 9 23049. 80300 1 3 12 12 45 42 2 ## 10 24925. 190000 1 1 13 12 53 55 2 ## # … with 3,093 more rows, and 40 more variables: SEX2 <dbl>, REF_RACE <dbl>, ## # RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>, MARITAL1 <dbl>, ## # REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>, EHOUSNGC <dbl>, ## # TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>, HEALTHCQ <dbl>, ## # ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>, STUDFINX <chr>, IRAX <chr>, ## # CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>, ROOMSQ <chr>, INC_HRS1 <chr>, ## # INC_HRS2 <chr>, EARNCOMP <dbl>, NO_EARNR <dbl>, OCCUCOD1 <chr>, ## # OCCUCOD2 <chr>, STATE <chr>, DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <chr>, ## # CREDITB <dbl>, CREDITX <chr>, BUILDING <chr>, ST_HOUS <dbl>, ## # INT_PHON <lgl>, INT_HOME <lgl> ``` --- ##Removing NAs ```r drop_na(ce, EDUCA2) ``` ``` ## # A tibble: 3,103 x 49 ## FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF AGE2 SEX_REF ## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl> ## 1 25985. 116920 1 2 16 15 63 50 1 ## 2 6581. 200 1 3 15 15 50 47 1 ## 3 20208. 117000 1 4 16 13 47 46 2 ## 4 24431. 91000 1 2 13 15 37 36 1 ## 5 42859. 95000 2 5 12 15 51 53 1 ## 6 17481. 40037 1 2 12 14 64 67 2 ## 7 26019. 109000 1 3 12 12 26 44 2 ## 8 16087. 4104 1 2 12 12 59 62 2 ## 9 23049. 80300 1 3 12 12 45 42 2 ## 10 24925. 190000 1 1 13 12 53 55 2 ## # … with 3,093 more rows, and 40 more variables: SEX2 <dbl>, REF_RACE <dbl>, ## # RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>, MARITAL1 <dbl>, ## # REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>, EHOUSNGC <dbl>, ## # TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>, HEALTHCQ <dbl>, ## # ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>, STUDFINX <chr>, IRAX <chr>, ## # CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>, ROOMSQ <chr>, INC_HRS1 <chr>, ## # INC_HRS2 <chr>, EARNCOMP <dbl>, NO_EARNR <dbl>, OCCUCOD1 <chr>, ## # OCCUCOD2 <chr>, STATE <chr>, DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <chr>, ## # CREDITB <dbl>, CREDITX <chr>, BUILDING <chr>, ST_HOUS <dbl>, ## # INT_PHON <lgl>, INT_HOME <lgl> ``` --- ##Removing NAs * Careful! ```r ce %>% na.omit() ``` ``` ## # A tibble: 0 x 49 ## # … with 49 variables: FINLWT21 <dbl>, FINCBTAX <dbl>, BLS_URBN <dbl>, ## # POPSIZE <dbl>, EDUC_REF <chr>, EDUCA2 <chr>, AGE_REF <dbl>, AGE2 <chr>, ## # SEX_REF <dbl>, SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, ## # HISP2 <dbl>, FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <chr>, IRAX <chr>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <chr>, INC_HRS1 <chr>, INC_HRS2 <chr>, EARNCOMP <dbl>, ## # NO_EARNR <dbl>, OCCUCOD1 <chr>, OCCUCOD2 <chr>, STATE <chr>, ## # DIVISION <dbl>, TOTXEST <dbl>, CREDFINX <chr>, CREDITB <dbl>, ## # CREDITX <chr>, BUILDING <chr>, ST_HOUS <dbl>, INT_PHON <lgl>, ## # INT_HOME <lgl> ``` --- ##Removing NAs ```r select(ce, BLS_URBN, FINCBTAX, TOTEXPCQ, EHOUSNGC, FOODCQ, HIGH_EDU, CUTENURE, IRAX) %>% na.omit() ``` ``` ## # A tibble: 6,301 x 8 ## BLS_URBN FINCBTAX TOTEXPCQ EHOUSNGC FOODCQ HIGH_EDU CUTENURE IRAX ## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> ## 1 1 116920 0 0 0 16 1 1000000 ## 2 1 200 0 0 0 15 1 10000 ## 3 1 117000 0 0 0 16 1 0 ## 4 1 0 0 0 0 15 1 . ## 5 1 2000 0 0 0 14 1 . ## 6 1 942 0 0 0 11 2 0 ## 7 1 0 0 0 0 10 4 0 ## 8 1 91000 0 0 0 15 1 15000 ## 9 2 95000 0 0 0 15 1 . ## 10 1 40037 0 0 0 14 2 477000 ## # … with 6,291 more rows ``` --- ## Recoding variables * The CE dataset is coded for SAS, which supposedly can't handle non-numeric vectors. + So all of the categorical variables are coded as numerical. ```r ce <- ce %>% mutate(BLS_URBN = case_when(BLS_URBN == '1' ~ "Urban", BLS_URBN == '2' ~ "Rural")) glimpse(ce) ``` ``` ## Rows: 6,301 ## Columns: 49 ## $ FINLWT21 <dbl> 25985, 6581, 20208, 18078, 20112, 19907, 11705, 24431, 42859… ## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 1… ## $ BLS_URBN <chr> "Urban", "Urban", "Urban", "Urban", "Urban", "Urban", "Urban… ## $ POPSIZE <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, … ## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", … ## $ EDUCA2 <chr> "15", "15", "13", NA, NA, NA, NA, "15", "15", "14", "12", "1… ## $ AGE_REF <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, … ## $ AGE2 <chr> "50", "47", "46", ".", ".", ".", ".", "36", "53", "67", "44"… ## $ SEX_REF <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, … ## $ SEX2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1… ## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, … ## $ RACE2 <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1… ## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, … ## $ HISP2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2… ## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, … ## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, … ## $ REGION <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, … ## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", … ## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ FOODCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ TRANSCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ EDUCACQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ STUDFINX <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", … ## $ IRAX <chr> "1000000", "10000", "0", ".", ".", "0", "0", "15000", ".", "… ## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, … ## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, … ## $ VEHQ <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, … ## $ ROOMSQ <chr> "8", "5", "6", "4", "4", "4", "7", "5", "4", "9", "6", "10",… ## $ INC_HRS1 <chr> "40", "40", "40", "44", "40", ".", ".", "40", "40", ".", "40… ## $ INC_HRS2 <chr> "30", "40", "52", ".", ".", ".", ".", "40", "40", ".", "65",… ## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, … ## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, … ## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", NA, NA, "12", "04", NA, "01", … ## $ OCCUCOD2 <chr> "04", "02", "01", NA, NA, NA, NA, "02", "03", NA, "11", NA, … ## $ STATE <chr> "41", "15", "48", "06", "06", "48", "06", "42", NA, "27", "2… ## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9,… ## $ TOTXEST <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 14… ## $ CREDFINX <chr> "0", ".", "0", ".", "5", ".", ".", ".", ".", "0", ".", "0", … ## $ CREDITB <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ CREDITX <chr> "4000", "5000", "2000", ".", "7000", "1800", ".", "6000", ".… ## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", … ## $ ST_HOUS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, … ## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ``` --- ## Multiple Groupings ```r ce %>% group_by(BLS_URBN, HIGH_EDU) %>% summarize(meanFINCBTAX = mean(FINCBTAX), medianFINCBTAX = median(FINCBTAX), size = n()) ``` ``` ## # A tibble: 15 x 5 ## # Groups: BLS_URBN [2] ## BLS_URBN HIGH_EDU meanFINCBTAX medianFINCBTAX size ## <chr> <chr> <dbl> <dbl> <int> ## 1 Rural 10 9249. 5100 21 ## 2 Rural 11 21092. 14528. 18 ## 3 Rural 12 25965. 20662 121 ## 4 Rural 13 39817. 31100 71 ## 5 Rural 14 39511. 34019 33 ## 6 Rural 15 64959. 52905 48 ## 7 Rural 16 85111. 85862 37 ## 8 Urban 00 6422. 2226 8 ## 9 Urban 10 14798. 12284 89 ## 10 Urban 11 26099. 17020 284 ## 11 Urban 12 33993. 25340 1151 ## 12 Urban 13 43986. 33082 1226 ## 13 Urban 14 51841. 44000 681 ## 14 Urban 15 84154. 62000 1480 ## 15 Urban 16 114123. 90000 1033 ``` --- ## Naming Wrangled Data * Should I name my new dataframe `ce` or `ce1`? + *My* answer: + Is your new dataset structurally different? If so, give it a new name. + Are you removing values you will need for a future analysis within the same document? If so, give it a new name. + Are you just adding to or cleaning the data? If so, then write over the original. --- ## R Projects * Where does your analysis live? + Working directory ```r getwd() ``` ``` ## [1] "/home/mcconville/math241s21/slides" ``` * For a given project, your analysis should live in the folder where you store the files associated with the project. --- ## RStudio Projects **RStudio Projects**: RStudio feature that helps you organize your work. * We will create a Math 241 project shortly. * For RStudio Projects, the working directory is the home directory of the project. ```r getwd() ``` ``` ## [1] "/home/mcconville/math241s21/slides" ``` --- ## RStudio Projects * To access the project, you can + Go to the upper right and select "Open Project" + Click on the `___.Rproj` * Notice that when we open the project, all the files and command history are still there. --- ## RStudio Projects * To access files in this project, I use relative paths: ```r include_graphics("img/hexStickers.png") ``` <img src="img/hexStickers.png" width="40%" style="display: block; margin: auto;" /> --- ## Projects and Workflow * Create an RStudio project for each analysis project. + I will force you to have a project for the course and each project. -- * Keep data files that you need for the project in a `data` folder. -- * Keep your `Rmd`s and script files there. -- * Save outputs there. -- * Use relative paths whenever possible. --- ## Math 241 Git Experience <img src="img/git.png" width="1136" /> --- ## `git` What is *git*? -- * Version control system for storing your work. * It manages the evolution of a set of files in a structured way. + Think "Track Changes" in Microsoft Word --- ## Manual Version Control * But I already do version control: + draft.rmd, draft2.rmd, final.rmd, realFinal.rmd, REALLYRealFinal.rmd... + Or maybe more like [this](http://phdcomics.com/comics/archive.php?comicid=1531) -- * Issues with this version of version control: + Hard to know how related these files are. + How do you extend this to working with others? --- ## GitHub What is *GitHub*? -- * Slick user interface * Distribution mechanism for git repositories * Think "Google Drive" * **Place to share your work!** + Ex: R package <img src="https://avatars1.githubusercontent.com/u/583231?s=400&v=4" width="30%" style="display: block; margin: auto;" /> --- ## Github Repo = RStudio Project * A **repo**, short for repository, is the folder that contains all of the files for the project on [github.com](github.com). * For each repo, you should create an RStudio Project (with version control). + We will all do this together in a moment. * Under the **Reed-Math241** GitHub Organization you will have 4 repos: + `labwork_username`: Just you (and course helpers) + `pa1_grp#`: You, your group members (and course helpers) + `pa2_grp#`: You, your group members (and course helpers) + `pa_final_grp#`: You, your group members (and course helpers) --- ## Workflow ![Jenny Bryan (2017)](img/push_pull.png) --- ## Workflow * Do some work on your project in RStudio. * **Pull** the most recent version of the project from GitHub to your account on the RStudio Server. * **Commit** that work. + Committing takes a snapshot of all the files in the project. + Look over the **diff**: which shows what has changed since your last update. + Include a quick note, **commit message** to summarize the motivation for the changes. * **Push** your commit to GitHub from your account on the RStudio Server. --- ## Collaboration: Git Style * Git is a *decentralized* version control system. + Each collaborator has a complete version of the repo. + Everyone can work offline and simultaneously. + GitHub holds the master copy. + Pull regularly to receive and integrate changes. * **Issues**: The primary method to communicate with your group members. + Make an issue if you have a question or comment or want to make a to do list for the project. + Remember that I am part of the repo... though I won't normally read the issues. --- ## Git Real * git is not friendly and can be frustrating. + BUT, the version control and collaborative rewards are big! * GitHub.com is a great place to develop an online presence. + For now, we will use private repos. * If you end up with a mess of errors, then don't worry. Come see me and we will make a new repo with your most recent copy of the project. + It happens to [everyone](https://xkcd.com/1597/). --- ## Git Language * **repo**: * **commit**: * **push**: * **pull**: * **issue**: --- ## Now: * Get out your laptops and sign-in to both github.com and rstudio.reed.edu. * Let's *git* our RStudio account on the server synced with our github account and then make a change to our personal repo! + Make sure you accepted the repo request sent to your email. --- ## Introduce Yourself to Git * Run the following code to introduce yourself to git ```r library(usethis) use_git_config(user.name = "mcconvil", user.email = "mcconville@reed.edu") ``` --- ## Sync GitHub.com repo and an RStudio Project repo **In your repo on github.com**: * Click on the green clone or download button. * Copy the given url for "Clone with HTTPS". **On the RStudio Server**: * In the upper left, go to File > New Project > Version Control. * Select Git. * Paste in the url. It should automatically give it a name. Select where you want the project to live in your home directory. Then click okay. --- ## Ignoring Files * There are several files that we want to **NOT** push to GitHub. * These include: + `.gitignore` + `___.Rproj` + `.DS_Store` * Add these files to the `.gitignore`. --- ## Test the waters: Let's go through the workflow. * Pull. (Yes, there is nothing to pull yet but it is always good practice to start here.) * Click on the readme. * Add something to the readme. * Click on the git tab. Check the box next to the readme.md. Hit commit. * Put in a commit message. Look over the diff. * Push. **Look for updates in the readme on github.com.** --- ## Cache credentials So that we don't have to type in our username and password every time we want to push or pull from GitHub, run the following **in the Terminal** not **in the Console**: `git config --global credential.helper 'cache --timeout=10000000'`