class: center, middle, inverse ## R Objects and Functions <img src="img/hero_wall_pink.png" width="800px"/> ### Kelly McConville .large[Math 241 | Week 6 | Spring 2021] --- ## Announcements/Reminders * Lab 4 due Thursday. * Mini-Project 1 due Thursday (Mar 5th) at 8:30am + Presentations during class. + Add slides [here](https://docs.google.com/presentation/d/1Vn_YwO9hqDhDELJ-TqLiHyoLFZ4vroA-FX1aLem3c9U/edit?usp=sharing). + Please add the title page (with the package name) ASAP as I will grab those names tomorrow for the voting forms. --- ## Looking Ahead... * After you have finished Mini-Project 1, make sure to provide feedback [here](https://docs.google.com/forms/d/e/1FAIpQLSfaQarOzh3X_O9znuxjEV7koy8nangCDFRP0Ck35nR1W-fzQQ/viewform?usp=sf_link). * Will receive Mini-Project 2 next week. --- ## Looking Behind... What have we done so far? * Data viz principles and construction with `ggplot2` * Reproducible examples with `reprex` * Data wrangling with `dplyr` and `tidyr` * R (data) packages with `usethis`, `devtools`, and `roxygen2` * Version control and project collaboration with `git` and GitHub * Ingesting data with `readr` * Web data with API wrappers, `httr` for talking to APIs directly, and `rvest` for scraping web data --- <img src="img/function creator highest.001.jpeg" width="100%" style="display: block; margin: auto;" /> --- <img src="img/function creator highest.002.jpeg" width="100%" style="display: block; margin: auto;" /> --- <img src="img/function creator highest.003.jpeg" width="100%" style="display: block; margin: auto;" /> --- <img src="img/function creator highest.004.jpeg" width="100%" style="display: block; margin: auto;" /> --- <img src="img/function creator highest.005.jpeg" width="100%" style="display: block; margin: auto;" /> --- <img src="img/function creator highest.006.jpeg" width="100%" style="display: block; margin: auto;" /> --- <img src="img/function creator highest.007.jpeg" width="100%" style="display: block; margin: auto;" /> --- <img src="img/function creator highest.008.jpeg" width="100%" style="display: block; margin: auto;" /> --- ## Goals for Today * Strengthening our R programming skills * Exploring R objects * Learning how to create functions + Motivation + Basic components + Testing, testing, testing + Input decisions + Generalizing --- ## Let's Go Back to Square One * We talked about `list()`s in R last time. ```r groceries <- list() groceries$new_seasons <- c("apples", "chocolate", "kale", "garlic") groceries$safeway <- c("vinegar", "soap") groceries$salt_n_straw <- c("almond_brittle", "double_fold_vanilla", "tanzanian_tin_roof") groceries$budget <- data.frame(stores = c("new_seasons", "safeway", "salt_n_straw"), fund = c(100, 25, 200)) ``` --- ## Let's Go Back to Square One * Square one: `vector()`s + `c()` simple way to create a vector of length greater than 1. ```r x <- 5 is.vector(x) ``` ``` ## [1] TRUE ``` ```r y <- c(5, 1, 7) is.vector(y) ``` ``` ## [1] TRUE ``` --- ## Flavors of Vectors ```r str(x) ``` ``` ## num 5 ``` ```r str(y) ``` ``` ## num [1:3] 5 1 7 ``` ```r z <- c("cat", "dog", "mouse", "snail") str(z) ``` ``` ## chr [1:4] "cat" "dog" "mouse" "snail" ``` ```r a <- c(TRUE, FALSE, FALSE) str(a) ``` ``` ## logi [1:3] TRUE FALSE FALSE ``` --- ## Flavors of Vectors ```r b <- c(x, z) str(b) ``` ``` ## chr [1:5] "5" "cat" "dog" "mouse" "snail" ``` ```r z <- as.factor(z) str(z) ``` ``` ## Factor w/ 4 levels "cat","dog","mouse",..: 1 2 3 4 ``` ```r levels(z) ``` ``` ## [1] "cat" "dog" "mouse" "snail" ``` --- ## Flavors of Vectors **Logical**: TRUEs and FALSEs **Numeric**: numbers, integers, double-precision floating point numbers **Character**: strings (contains 1 or more character) **Factors**: strings with order --- ## Vectorized * R is built to work with vectors. * Many operations are **vectorized**: will happen component-wise when given a vector as input. ```r y + 4 ``` ``` ## [1] 9 5 11 ``` ```r y * 2 ``` ``` ## [1] 10 2 14 ``` ```r rnorm(n = 3, mean = c(-5, 0, 5)) ``` ``` ## [1] -5.31 0.79 5.54 ``` --- * But we need to be careful! ```r dat <- data.frame(x = c(1, 2, 2, 4, 1, 3), y = c(8, 7, 6, 5, 8, 6)) dat ``` ``` ## x y ## 1 1 8 ## 2 2 7 ## 3 2 6 ## 4 4 5 ## 5 1 8 ## 6 3 6 ``` * Want the rows where `x` equals 1 or 2 + What happened? ```r library(tidyverse) dat %>% filter(x == c(1, 2)) ``` ``` ## x y ## 1 1 8 ## 2 2 7 ## 3 1 8 ``` --- ## Recycling * R recycles vectors if they are not the necessary length. + Notice we got NO error but that wasn't what we wanted. ```r library(tidyverse) dat %>% filter(x == c(1, 2)) ``` ``` ## x y ## 1 1 8 ## 2 2 7 ## 3 1 8 ``` --- ## Recycling * R recycles vectors if they are not the necessary length. + Notice we got NO error but that wasn't what we wanted. ```r library(tidyverse) dat %>% filter(x %in% c(1, 2)) ``` ``` ## x y ## 1 1 8 ## 2 2 7 ## 3 2 6 ## 4 1 8 ``` --- ## Indexing a Vector ```r x[1] ``` ``` ## [1] 5 ``` ```r y[-1] ``` ``` ## [1] 1 7 ``` ```r b[c(2, 4)] ``` ``` ## [1] "cat" "mouse" ``` ```r y[a] ``` ``` ## [1] 5 ``` --- ## Now Back to Lists * `list()`s are just super vectors ```r groceries <- list() groceries$new_seasons <- c("apples", "chocolate", "kale", "garlic") groceries$safeway <- c("vinegar", "soap") groceries$salt_n_straw <- c("almond_brittle", "double_fold_vanilla", "tanzanian_tin_roof") groceries$budget <- data.frame(stores = c("new_seasons", "safeway", "salt_n_straw"), fund = c(100, 25, 200)) ``` --- ## Now Back to Lists ```r groceries ``` ``` ## $new_seasons ## [1] "apples" "chocolate" "kale" "garlic" ## ## $safeway ## [1] "vinegar" "soap" ## ## $salt_n_straw ## [1] "almond_brittle" "double_fold_vanilla" "tanzanian_tin_roof" ## ## $budget ## stores fund ## 1 new_seasons 100 ## 2 safeway 25 ## 3 salt_n_straw 200 ``` --- ## [Better Explanation of `[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x` is a list: the pepper shaker containing packets of pepper. <img src="img/list.png" width="30%" style="display: block; margin: auto;" /> --- ## [Better Explanation of `[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x[1]` is a pepper shaker containing the first packet of pepper. <img src="img/innerlist.png" width="30%" style="display: block; margin: auto;" /> --- ## [Better Explanation of `[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x[2]` is what? -- <img src="img/innerlist.png" width="30%" style="display: block; margin: auto;" /> --- ## [Better Explanation of `[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x[[1]]` is what? -- <img src="img/innerobject.png" width="30%" style="display: block; margin: auto;" /> --- ## [Better Explanation of `[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x[[1]][[1]]` is what? -- <img src="img/innerobject2.png" width="30%" style="display: block; margin: auto;" /> --- ## Data Frames Let's relate this to our favorite R object: `data.frame()`s! * `data.frame()`s are `list()`s. * Each variable of a `data.frame`() is a `vector()`. * The `vector()`s all have the same length but not necessary the same class. --- ## Data Frames ```r dat ``` ``` ## x y ## 1 1 8 ## 2 2 7 ## 3 2 6 ## 4 4 5 ## 5 1 8 ## 6 3 6 ``` ```r dat$x ``` ``` ## [1] 1 2 2 4 1 3 ``` --- ## Data Frames ```r str(dat[1]) ``` ``` ## 'data.frame': 6 obs. of 1 variable: ## $ x: num 1 2 2 4 1 3 ``` ```r str(dat[[1]]) ``` ``` ## num [1:6] 1 2 2 4 1 3 ``` ```r dat[1, 2] ``` ``` ## [1] 8 ``` ```r dat[1, ] ``` ``` ## x y ## 1 1 8 ``` --- class: inverse, center, middle ## Back By Popular Demand: ### Tidy Stretches! --- class: inverse, center, middle ## Practice --- ## Functions * Write functions to **automate** common task instead of using copy-and-paste. -- ```r library(pdxTrees) pdxTrees <- get_pdxTrees_parks() pdxTrees %>% rename(ht = Tree_Height, pol = Pollution_Removal_oz) %>% summarize(DBH_k = sd(DBH, na.rm = TRUE)/mean(DBH, na.rm = TRUE), ht_k = sd(ht, na.rm = TRUE)/mean(ht, na.rm = TRUE), pol_k = sd(pol, na.rm = TRUE)/mean(pol, na.rm = TRUE)) ``` ``` ## # A tibble: 1 x 3 ## DBH_k ht_k pol_k ## <dbl> <dbl> <dbl> ## 1 0.650 0.613 0.884 ``` **Question**: What would we like to automate here? --- ## Functions **Question**: Why write a function when my codes fine without?? -- - Well named functions make code more readable -- - If your analysis changes, you have less places to update code. -- - You eliminate copy-and-paste errors. ```r library(pdxTrees) pdxTrees %>% rename(ht = Tree_Height, pol = Pollution_Removal_oz) %>% summarize(DBH_k = sd(DBH, na.rm = TRUE)/mean(DBH, na.rm = TRUE), ht_k = sd(ht, na.rm = TRUE)/mean(ht, na.rm = TRUE), pol_k = sd(pol, na.rm = TRUE)/mean(ht, na.rm = TRUE)) ``` --- ## Function Writing **First:** Determine what you want to the function to do. -- **Question**: For our example, what do we want? ```r library(pdxTrees) pdxTrees <- get_pdxTrees_parks() pdxTrees %>% rename(ht = Tree_Height, pol = Pollution_Removal_oz) %>% summarize(DBH_k = sd(DBH, na.rm = TRUE)/mean(DBH, na.rm = TRUE), ht_k = sd(ht, na.rm = TRUE)/mean(ht, na.rm = TRUE), pol_k = sd(pol, na.rm = TRUE)/mean(pol, na.rm = TRUE)) ``` ``` ## # A tibble: 1 x 3 ## DBH_k ht_k pol_k ## <dbl> <dbl> <dbl> ## 1 0.650 0.613 0.884 ``` --- ## Function Writing **Next**: Get Something that Works ```r sd(pdxTrees$DBH, na.rm = TRUE)/mean(pdxTrees$DBH, na.rm = TRUE) ``` ``` ## [1] 0.65 ``` --- ## [Minimum Viable Product](https://blog.fastmonkeys.com/2014/06/18/Minimum-viable-product-your-ultimate-guide-to-mvp-great-examples/) <img src="img/mvp.png" width="60%" style="display: block; margin: auto;" /> --- ## First Function ```r coef_of_var <- function(x){ sd(x)/mean(x) } # Test it coef_of_var(x = pdxTrees$DBH) ``` ``` ## [1] 0.65 ``` ```r summarize(pdxTrees, coef_of_var(x = DBH)) ``` ``` ## # A tibble: 1 x 1 ## `coef_of_var(x = DBH)` ## <dbl> ## 1 0.650 ``` --- ## Structure of Functions * Name * Inputs/Arguments * Body ```r coef_of_var <- function(x){ sd(x)/mean(x) } ``` --- ## Test It More * New variable ```r coef_of_var(x = pdxTrees$Tree_Height) ``` ``` ## [1] NA ``` --- ## Test It More ```r coef_of_var <- function(x){ sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE) } coef_of_var(x = pdxTrees$Tree_Height) ``` ``` ## [1] 0.61 ``` --- ## Test It More * New dataset ```r library(mosaicData) coef_of_var(x = HELPrct$age) ``` ``` ## [1] 0.22 ``` --- ## Test It More ```r ggplot(pdxTrees, aes(DBH)) + geom_histogram() ggplot(HELPrct, aes(age)) + geom_histogram() ``` <img src="slidesWk6Tu_files/figure-html/unnamed-chunk-38-1.png" width="288" /><img src="slidesWk6Tu_files/figure-html/unnamed-chunk-38-2.png" width="288" /> --- ## Test It More * Data where you know the answer ```r coef_of_var(x = rnorm(n = 10000, mean = 1, sd = 4)) ``` ``` ## [1] 3.9 ``` --- ## Test It More * Weird data ```r DBH_new <- c(pdxTrees$DBH, Inf) coef_of_var(DBH_new) ``` ``` ## [1] NaN ``` ```r coef_of_var(pdxTrees) ``` ``` ## Error in is.data.frame(x): 'list' object cannot be coerced to type 'double' ``` ```r coef_of_var(pdxTrees$Condition) ``` ``` ## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = ## na.rm): NAs introduced by coercion ``` ``` ## Warning in mean.default(x, na.rm = TRUE): argument is not numeric or logical: ## returning NA ``` ``` ## [1] NA ``` --- ## Test It More * Weird data ```r coef_of_var(c(TRUE, FALSE, FALSE)) ``` ``` ## [1] 1.7 ``` * To learn about formal, automated testing, check out [unit testing](http://r-pkgs.had.co.nz/tests.html). --- ## Check the Inputs From the Unix Philosophy: > "Rule of Repair: When you must fail, fail noisily and as soon as possible." --- ## Check the Inputs: `stopifnot()` ```r coef_of_var <- function(x){ stopifnot(is.numeric(x)) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE) } coef_of_var(pdxTrees$Condition) ``` ``` ## Error in coef_of_var(pdxTrees$Condition): is.numeric(x) is not TRUE ``` ```r coef_of_var(c(TRUE, FALSE, FALSE)) ``` ``` ## Error in coef_of_var(c(TRUE, FALSE, FALSE)): is.numeric(x) is not TRUE ``` * But we want a more useful error message. --- ## Check the Inputs: `if()` then `stop()` ```r coef_of_var <- function(x) { if(!is.numeric(x)) { stop('Unfortunately this function only works for numeric input.\n', 'You have provided an object of class:', class(x)) } sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE) } coef_of_var(pdxTrees$Condition) ``` ``` ## Error in coef_of_var(pdxTrees$Condition): Unfortunately this function only works for numeric input. ## You have provided an object of class:character ``` ```r coef_of_var(c(TRUE, FALSE, FALSE)) ``` ``` ## Error in coef_of_var(c(TRUE, FALSE, FALSE)): Unfortunately this function only works for numeric input. ## You have provided an object of class:logical ``` -- * Look at the [Tidyverse Style Guide](https://style.tidyverse.org/error-messages.html) for more advice on writing **helpful** error messages. --- ## Generalizing our Function ```r coef_of_var <- function(x){ stopifnot(is.numeric(x)) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE) } ``` * Allow user to trim the mean ```r mean(pdxTrees$Tree_Height, trim = 0.0, na.rm = TRUE) ``` ``` ## [1] 66 ``` ```r mean(pdxTrees$Tree_Height, trim = 0.1, na.rm = TRUE) ``` ``` ## [1] 63 ``` --- ## MVP: Functional Code * Old code: ```r sd(pdxTrees$DBH, na.rm = TRUE)/mean(pdxTrees$DBH, na.rm = TRUE) ``` ``` ## [1] 0.65 ``` -- * MVP: Code with trimming ```r trim <- 0.1 sd(pdxTrees$DBH, na.rm = TRUE)/mean(pdxTrees$DBH, na.rm = TRUE, trim = trim) ``` ``` ## [1] 0.68 ``` --- ## Write the Function Version ```r coef_of_var <- function(x, trim){ stopifnot(is.numeric(x)) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE, trim = trim) } ``` --- ## Test it ```r coef_of_var(x = pdxTrees$DBH, trim = 0.1) ``` ``` ## [1] 0.68 ``` ```r coef_of_var(x = pdxTrees$DBH) ``` ``` ## Error in mean(x, na.rm = TRUE, trim = trim): argument "trim" is missing, with no default ``` -- * Why are we getting an error? --- ## Set Default Values ```r coef_of_var <- function(x, trim = 0){ stopifnot(is.numeric(x)) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE, trim = trim) } ``` -- * Test it: ```r coef_of_var(x = pdxTrees$DBH, trim = 0.1) ``` ``` ## [1] 0.68 ``` ```r coef_of_var(x = pdxTrees$DBH) ``` ``` ## [1] 0.65 ``` -- * Why not set a default for `x`? -- * Should also put in validity checks for `trim` and check those. --- ## Set Default Values * What are our required arguments? Optional arguments? ```r coef_of_var <- function(x, trim = 0){ stopifnot(is.numeric(x)) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE, trim = trim) } ``` -- * Required: No default * Optional: Include default --- ## Naming the Function * Make the name short but explanatory -- * Generally: + Functions = verbs + Arguments/inputs = nouns -- * Break the verb rule, if you would need a very broad verb: ```r # Bad compute_coef_of_var() # Better coef_of_var() ``` --- ## Naming the Function * Don't override existing functions and variables ```r # Bad T <- FALSE c <- 2 mean <- function(x, trim){ stopifnot(is.numeric(x)) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE, trim = trim) } ``` --- ## Naming the Arguments * Avoid meaningless names ```r coef_of_var_bad <- function(thing1, thing2 = 0){ stopifnot(is.numeric(thing1)) sd(thing1, na.rm = TRUE)/mean(thing1, na.rm = TRUE, trim = thing2) } ``` -- * Painting the walls is a simple way to decorate a room. + Same goes with function writing. Arguments with clear, self-explanatory names quickly increases the readable of your function. > "Programs must be written for people to read, and only incidentally for machines to execute." -- [Hal Abelson](https://en.wikipedia.org/wiki/Hal_Abelson) --- ## Naming Arguments * If you are going to pass arguments to a built-in function, consider using the same names (unless they are awful). + EX: `trim` ```r coef_of_var <- function(x, trim = 0){ stopifnot(is.numeric(x)) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE, trim = trim) } ``` --- ## Ordering the Arguments * Generally: + Data + Details ```r coef_of_var <- function(x, trim = 0){ stopifnot(is.numeric(x)) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE, trim = trim) } ``` -- * Required arguments should come before optional arguments --- ## Readability * If you have a complicated calculation, break it up into intermediate steps. ```r coef_of_var <- function(x, trim = 0, na.rm = FALSE){ stopifnot(is.numeric(x)) std_dev <- sd(x, na.rm = na.rm) avg <- mean(x, na.rm = na.rm, trim = trim) std_dev/avg } ``` --- ## Function Output * Default: Last line of code ```r coef_of_var <- function(x, trim = 0){ stopifnot(is.numeric(x)) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE, trim = trim) } ``` -- * Can also be explicit ```r coef_of_var <- function(x, trim = 0){ stopifnot(is.numeric(x)) return(sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE, trim = trim)) } ``` --- ## Function Output * If you want to output more than the last line, use a list: ```r coef_of_var <- function(x, trim = 0, na.rm = FALSE){ stopifnot(is.numeric(x)) std_dev <- sd(x, na.rm = na.rm) avg <- mean(x, na.rm = na.rm, trim = trim) stat <- std_dev/avg return(list(stat = stat, avg = avg, std_dev = std_dev)) } coef_of_var(pdxTrees$DBH) ``` ``` ## $stat ## [1] 0.65 ## ## $avg ## [1] 21 ## ## $std_dev ## [1] 13 ``` --- ## Function Output * If possible, provide a data.frame: ```r coef_of_var <- function(x, trim = 0, na.rm = FALSE){ stopifnot(is.numeric(x)) std_dev <- sd(x, na.rm = na.rm) avg <- mean(x, na.rm = na.rm, trim = trim) stat <- std_dev/avg return(data.frame(stat = stat, avg = avg, std_dev = std_dev)) } coef_of_var(pdxTrees$DBH) ``` ``` ## stat avg std_dev ## 1 0.65 21 13 ``` --- ## Handling Missingness * General R functions behavior: + Many base R functions have an `na.rm = ` argument with default of `na.rm = FALSE` ```r sd(pdxTrees$Tree_Height) ``` ``` ## [1] NA ``` ```r mean(pdxTrees$Tree_Height) ``` ``` ## [1] NA ``` * Why is it dangerous that we hard-wired our function with `na.rm = TRUE`? + What should we do instead? --- ## Handling Missingness ```r coef_of_var <- function(x, trim = 0, na.rm = FALSE){ stopifnot(is.numeric(x)) sd(x, na.rm = na.rm)/mean(x, na.rm = na.rm, trim = trim) } ``` -- ```r coef_of_var(pdxTrees$Tree_Height) ``` ``` ## [1] NA ``` ```r coef_of_var(pdxTrees$Tree_Height, na.rm = TRUE) ``` ``` ## [1] 0.61 ``` --- ## Inheriting the Arguments of a Function ```r coef_of_var <- function(x, trim = 0, ...){ stopifnot(is.numeric(x)) sd(x, ...)/mean(x, trim = trim, ...) } ``` ```r coef_of_var(pdxTrees$Tree_Height) ``` ``` ## [1] NA ``` ```r coef_of_var(pdxTrees$Tree_Height, na.rm = TRUE) ``` ``` ## [1] 0.61 ``` ```r coef_of_var(pdxTrees$Tree_Height, na.rm = TRUE, trim = .2) ``` ``` ## [1] 0.66 ``` --- class: inverse, center, middle ## More Practice Time!