### Dates/Times with `lubridate` and Factors with `forcats`

### Kelly McConville .large[Math 241 | Week 8 | Spring 2021]

---

## A Few Thoughts on R Programming

* For what you want to do, start with the minimal viable product.
* Think about your inputs and outputs.
    + Class?
    + Size?
    + Indexing?
* Sometimes boxed mac and cheese is better than homemade. Sometimes homemade is better. * Reduce redundancies with functions and iteration. * Good names can be as helpful as good comments. * Consider how you are handling missingness. * Its okay to start with smelly, working code. + And then refactor. --- ## Why do we need to talk about dates and times? **Question:** When did the crashes happen in Portland in 2018? ```r library(tidyverse) crashes <- read_csv("/home/courses/math241s21/Data/pdx_crash_2018_CRASH.csv") crashes %>% count(CRASH_DT) %>% ggplot(mapping = aes(x = CRASH_DT, y = n)) + geom_point() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-1-1.png" width="288" /> --- ## Dates ```r class(crashes$CRASH_DT) ``` ``` ## [1] "character" ``` -- What class should it be? --- ## Converting Strings to Dates * Identify the order of year, month, day, hour, minute, second * Pick the `lubridate` function that replicates that order. ```r sample(crashes$CRASH_DT, size = 10) ``` ``` ## [1] "09/07/18 00:00:00" "03/28/18 00:00:00" "08/19/18 00:00:00" ## [4] "02/25/18 00:00:00" "06/26/18 00:00:00" "11/02/18 00:00:00" ## [7] "09/24/18 00:00:00" "08/08/18 00:00:00" "11/17/18 00:00:00" ## [10] "08/28/18 00:00:00" ``` ```r library(lubridate) crashes <- crashes %>% mutate(CRASH_DT = mdy_hms(CRASH_DT), CRASH_D = date(CRASH_DT)) class(crashes$CRASH_DT) ``` ``` ## [1] "POSIXct" "POSIXt" ``` ```r class(crashes$CRASH_D) ``` ``` ## [1] "Date" ``` --- ## Why do we need to talk about dates and times? **Question:** When did the crashes happen in Portland in 2018? ```r crashes %>% count(CRASH_D) %>% ggplot(mapping = aes(x = CRASH_D, y = n)) + geom_point() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-4-1.png" width="288" /> --- ## What else makes dates and times unique? -- * Hours have 60 minutes. (Well, some have 61) * Not all years have 365 days. * Daylight Savings caused us to lose an hour on Sunday but not folks in Arizona. --- ## Let's Look at [Portland's Biketown Data]( ```r biketown <- read_csv("/home/courses/math141f19/Data/biketown_2017_07_09.csv") %>% filter(Distance_Miles < 1000) biketown_dt <- biketown %>% select(StartDate, StartTime, EndDate, EndTime, Distance_Miles, BikeID) glimpse(biketown_dt) ``` ``` ## Rows: 134,838 ## Columns: 6 ## $ StartDate <chr> "7/1/2017", "7/1/2017", "7/1/2017", "7/1/2017", "7/1/20… ## $ StartTime <time> 00:00:00, 00:00:00, 00:00:00, 00:01:00, 00:03:00, 00:0… ## $ EndDate <chr> "7/1/2017", "7/1/2017", "7/1/2017", "7/1/2017", "7/1/20… ## $ EndTime <time> 00:06:00, 00:16:00, 00:02:00, 00:33:00, 00:06:00, 00:0… ## $ Distance_Miles <dbl> 0.55, 2.03, 0.17, 2.75, 0.40, 0.40, 5.08, 0.95, 2.39, 2… ## $ BikeID <dbl> 7375, 6191, 6321, 6434, 6850, 6420, 6593, 6160, 7380, 6… ``` --- ## Let's Look at [Portland's Biketown Data]( * Fix the class of the date columns. * Create date-time columns. ```r library(lubridate) biketown_dt <- biketown_dt %>% mutate(StartDate = mdy(StartDate), EndDate = mdy(EndDate)) %>% mutate(StartDateTime = ymd_hms(paste(StartDate, StartTime, sep = " ")), EndDateTime = ymd_hms(paste(EndDate, EndTime, sep = " "))) glimpse(biketown_dt) ``` ``` ## Rows: 134,838 ## Columns: 8 ## $ StartDate <date> 2017-07-01, 2017-07-01, 2017-07-01, 2017-07-01, 2017-0… ## $ StartTime <time> 00:00:00, 00:00:00, 00:00:00, 00:01:00, 00:03:00, 00:0… ## $ EndDate <date> 2017-07-01, 2017-07-01, 2017-07-01, 2017-07-01, 2017-0… ## $ EndTime <time> 00:06:00, 00:16:00, 00:02:00, 00:33:00, 00:06:00, 00:0… ## $ Distance_Miles <dbl> 0.55, 2.03, 0.17, 2.75, 0.40, 0.40, 5.08, 0.95, 2.39, 2… ## $ BikeID <dbl> 7375, 6191, 6321, 6434, 6850, 6420, 6593, 6160, 7380, 6… ## $ StartDateTime <dttm> 2017-07-01 00:00:00, 2017-07-01 00:00:00, 2017-07-01 0… ## $ EndDateTime <dttm> 2017-07-01 00:06:00, 2017-07-01 00:16:00, 2017-07-01 0… ``` --- ## Grabbing Components ```r biketown_dt$StartDateTime[40008] ``` ``` ## [1] "2017-07-23 13:44:00 UTC" ``` ```r year(biketown_dt$StartDateTime[40008]) ``` ``` ## [1] 2017 ``` ```r month(biketown_dt$StartDateTime[40008], label = TRUE) ``` ``` ## [1] Jul ## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec ``` ```r day(biketown_dt$StartDateTime[40008]) ``` ``` ## [1] 23 ``` --- ## Grabbing Components ```r week(biketown_dt$StartDateTime[40008]) ``` ``` ## [1] 30 ``` ```r wday(biketown_dt$StartDateTime[40008], label = TRUE) ``` ``` ## [1] Sun ## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat ``` ```r hour(biketown_dt$StartDateTime[40008]) ``` ``` ## [1] 13 ``` ```r minute(biketown_dt$StartDateTime[40008]) ``` ``` ## [1] 44 ``` --- ## Grabbing Components ```r ggplot(data = biketown_dt, mapping = aes(month(StartDateTime, label = TRUE))) + geom_bar() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-9-1.png" width="360" /> --- ## Grabbing Components ```r ggplot(data = biketown_dt, mapping = aes(wday(StartDateTime, label = TRUE))) + geom_bar() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-10-1.png" width="360" /> --- ## Grabbing Components ```r ggplot(data = biketown_dt, mapping = aes(hour(StartDateTime))) + geom_bar() ggplot(data = biketown_dt, mapping = aes(hour(EndDateTime))) + geom_bar() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-11-1.png" width="288" /><img src="slidesWk8Th_files/figure-html/unnamed-chunk-11-2.png" width="288" /> --- ## Grabbing Components ```r biketown_dt %>% mutate(hour = hour(StartDateTime), month = month(StartDateTime, label = TRUE)) %>% group_by(hour, month) %>% summarise(mean_dist = mean(Distance_Miles, na.rm = TRUE)) %>% ggplot(mapping = aes(x = hour, y = mean_dist, color = month)) + geom_line(size = 2) ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-12-1.png" width="360" /> --- # And if you are in R and want to know the date/time: ```r today() ``` ``` ## [1] "2021-03-18" ``` ```r now() ``` ``` ## [1] "2021-03-18 08:14:27 PDT" ``` --- class: inverse, middle, center ## Topic Shift! --- ## Motivation: Imposing Structure on Categorical Variables ```r library(pdxTrees) pdxTrees <- get_pdxTrees_parks() five_most_common <- c("Douglas-Fir", "Norway Maple", "Western Redcedar", "Northern Red Oak", "Pin Oak") pdxCommon <- pdxTrees %>% filter(Common_Name %in% five_most_common) ``` --- ## Motivation: Imposing Structure on Categorical Variables ```r ggplot(data = pdxCommon, mapping = aes(x = Common_Name)) + geom_bar() + coord_flip() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-15-1.png" width="360" /> -- How might we want to restructure this graph? --- ```r levels(pdxCommon$Common_Name) ``` ``` ## NULL ``` ```r class(pdxCommon$Common_Name) ``` ``` ## [1] "character" ``` ```r pdxCommon <- pdxCommon %>% mutate(Common_Name = factor(Common_Name)) levels(pdxCommon$Common_Name) ``` ``` ## [1] "Douglas-Fir" "Northern Red Oak" "Norway Maple" "Pin Oak" ## [5] "Western Redcedar" ``` ```r class(pdxCommon$Common_Name) ``` ``` ## [1] "factor" ``` * What is the order of the levels? --- ## What Are the Classes? ```r pdxCommon$Common_Name %>% fct_unique() %>% unclass() ``` ``` ## [1] 1 2 3 4 5 ## attr(,"levels") ## [1] "Douglas-Fir" "Northern Red Oak" "Norway Maple" "Pin Oak" ## [5] "Western Redcedar" ``` ```r unique(pdxCommon$Common_Name) ``` ``` ## [1] Douglas-Fir Northern Red Oak Norway Maple Pin Oak ## [5] Western Redcedar ## 5 Levels: Douglas-Fir Northern Red Oak Norway Maple ... Western Redcedar ``` --- # Simple Frequency ```r pdxCommon$Common_Name %>% fct_count() ``` ``` ## # A tibble: 5 x 2 ## f n ## <fct> <int> ## 1 Douglas-Fir 6783 ## 2 Northern Red Oak 736 ## 3 Norway Maple 1502 ## 4 Pin Oak 619 ## 5 Western Redcedar 964 ``` ```r count(pdxCommon, Common_Name) %>% arrange(desc(n)) ``` ``` ## # A tibble: 5 x 2 ## Common_Name n ## <fct> <int> ## 1 Douglas-Fir 6783 ## 2 Norway Maple 1502 ## 3 Western Redcedar 964 ## 4 Northern Red Oak 736 ## 5 Pin Oak 619 ``` --- ## Reorder the Levels * Add the `levels` argument ```r pdxCommon <- pdxCommon %>% mutate(Common_Name = factor(Common_Name, levels = five_most_common)) levels(pdxCommon$Common_Name) ``` ``` ## [1] "Douglas-Fir" "Norway Maple" "Western Redcedar" "Northern Red Oak" ## [5] "Pin Oak" ``` --- ## Reorder the Levels * Order levels by when they show up in the dataset ```r pdxCommon <- pdxCommon %>% mutate(Common_Name = fct_inorder(Common_Name)) levels(pdxCommon$Common_Name) ``` ``` ## [1] "Douglas-Fir" "Northern Red Oak" "Norway Maple" "Pin Oak" ## [5] "Western Redcedar" ``` --- ## Reorder the Levels + Note: This code didn't permanently change the order in `pdxCommon`. + Why? ```r pdxCommon %>% mutate(Common_Name = fct_infreq(Common_Name)) %>% ggplot(mapping = aes(Common_Name)) + geom_bar() + coord_flip() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-21-1.png" width="360" /> --- ## Reorder the Levels ```r pdxCommon %>% mutate(Common_Name = fct_infreq(Common_Name), Common_Name = fct_rev(Common_Name)) %>% ggplot(mapping = aes(Common_Name)) + geom_bar() + coord_flip() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-22-1.png" width="360" /> --- ## If You Love the Pipe... ```r pdxCommon %>% mutate(Common_Name = fct_infreq(Common_Name) %>% fct_rev()) %>% ggplot(mapping = aes(Common_Name)) + geom_bar() + coord_flip() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-23-1.png" width="360" /> --- ## Reorder the Levels * Can also relevel after the fact manually ```r pdxCommon %>% mutate(Common_Name = fct_relevel(Common_Name, five_most_common)) %>% ggplot(mapping = aes(x = Common_Name)) + geom_bar() + coord_flip() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-24-1.png" width="360" /> --- ## Reorder the Levels * Maybe I just want to bring one or two category to the front ```r pdxCommon %>% mutate(Common_Name = fct_relevel(Common_Name, "Norway Maple", "Pin Oak")) %>% ggplot(mapping = aes(x = Common_Name)) + geom_bar() + coord_flip() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-25-1.png" width="360" /> --- ## What Have We Wrangled Here? ```r DBH_by_name <- pdxCommon %>% group_by(Common_Name) %>% summarize(mean_DBH = mean(DBH), lb_DBH = mean_DBH - 2*sd(DBH)/sqrt(n()), ub_DBH = mean_DBH + 2 *sd(DBH/sqrt(n()))) DBH_by_name ``` ``` ## # A tibble: 5 x 4 ## Common_Name mean_DBH lb_DBH ub_DBH ## <fct> <dbl> <dbl> <dbl> ## 1 Douglas-Fir 29.6 29.3 29.8 ## 2 Northern Red Oak 29.4 28.3 30.5 ## 3 Norway Maple 20.3 19.9 20.8 ## 4 Pin Oak 25.6 24.8 26.4 ## 5 Western Redcedar 18.1 17.3 18.9 ``` --- ## Reordering by Another Variable * How might we want to reorder `Common_Name`? ```r ggplot(data = DBH_by_name, mapping = aes(y = mean_DBH, x = Common_Name)) + geom_point() + geom_errorbar(aes(ymin = lb_DBH, ymax = ub_DBH), width = 0.4) ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-27-1.png" width="360" /> --- ## Reordering by Another Variable ```r thing <- pdxCommon %>% mutate(Common_Name = fct_reorder(Common_Name, DBH)) levels(thing$Common_Name) ``` ``` ## [1] "Western Redcedar" "Norway Maple" "Pin Oak" "Northern Red Oak" ## [5] "Douglas-Fir" ``` ```r DBH_by_name %>% mutate(Common_Name = fct_reorder(Common_Name, -mean_DBH)) %>% ggplot(mapping = aes(y = mean_DBH, x = Common_Name)) + geom_point() + geom_errorbar(aes(ymin = lb_DBH, ymax = ub_DBH), width = 0.4) ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-28-1.png" width="360" /> --- ## Reordering by Another Variable ```r ggplot(data = pdxCommon, mapping = aes(x = DBH, y = Total_Annual_Services, color = Condition)) + geom_smooth() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-29-1.png" width="360" /> --- ## Reordering by Another Variable ```r pdxCommon %>% mutate(Condition = fct_reorder2(Condition, DBH, Total_Annual_Services)) %>% ggplot(mapping = aes(x = DBH, y = Total_Annual_Services, color = Condition)) + geom_smooth() ``` <img src="slidesWk8Th_files/figure-html/unnamed-chunk-30-1.png" width="360" /> --- ## Recode * How might we want to change these categories? ```r levels(pdxCommon$Common_Name) ``` ``` ## [1] "Douglas-Fir" "Northern Red Oak" "Norway Maple" "Pin Oak" ## [5] "Western Redcedar" ``` --- ## Recode ```r pdxCommon <- pdxCommon %>% mutate(Common_Name = fct_recode(Common_Name, "Douglas Fir" = "Douglas-Fir")) count(pdxCommon, Common_Name) ``` ``` ## # A tibble: 5 x 2 ## Common_Name n ## <fct> <int> ## 1 Douglas Fir 6783 ## 2 Northern Red Oak 736 ## 3 Norway Maple 1502 ## 4 Pin Oak 619 ## 5 Western Redcedar 964 ``` --- ## Collapsing Levels ```r pdxCommon <- pdxCommon %>% mutate(Common_Name2 = fct_collapse(Common_Name, Oak = c("Northern Red Oak", "Pin Oak"))) count(pdxCommon, Common_Name2) ``` ``` ## # A tibble: 4 x 2 ## Common_Name2 n ## <fct> <int> ## 1 Douglas Fir 6783 ## 2 Oak 1355 ## 3 Norway Maple 1502 ## 4 Western Redcedar 964 ``` --- ## Dropping Unused Levels ```r pdxCommon <- pdxTrees %>% mutate(Common_Name = factor(Common_Name)) %>% filter(Common_Name %in% five_most_common) length(levels(pdxCommon$Common_Name)) ``` ``` ## [1] "cat" "dog" "mouse" ```