class: center, middle ### `distill` Blogs, Code Smells, and Refactoring Code <img src="img/hero_wall_pink.png" width="800px"/> ### Kelly McConville .large[Math 241 | Week 7 | Spring 2021] --- ## Announcements/Reminders * No labs due this week! * Reflecting on last week's presentations * Finally finished grading the hand drawn data visualizations. --- ## Goals for Today * Talk data blogs * Code discussions today inspired by + Jenny Bryan's [Code Smells and Feels](https://www.youtube.com/watch?reload=9&v=7oyiPBjLAWY) talk + Martin Fowler's [Refactoring](https://martinfowler.com/books/refactoring.html) > "Any fool can write code that a computer can understand. Good programmers write code that humans can understand." -- Martin Fowler --- ## Blog Posts * Why write a data blog? -- * [A blog post about why you should have a data science blog](http://varianceexplained.org/r/start-blog/) --- ## Blog Posts **Requirements for your post:** * Write a short (~ 1000 words +/ 200 words) blog post. The post should: + **Tell a story/seek to answer a question with data.** + **Explain the process of going from raw data to answer.** + Include some non-trivial data wrangling, graphs, and a table. + More details in `mini-project-2-template.Rmd`. -- **Other components:** + Come up with a catchy title. + Try to engage and entertain the reader so that they will make it to the end. + Avoid jargon. + Address data origins. --- ## Potential Components **The Hook**: Start with a hook, a way into the story. EXs: + Provocative statement or question + Unexpected phrase + Opening anecdote -- **Angle**: The post is short so it is okay to have a narrow focus/approach to the story and to give it personality. -- **Content**: The post should have some content but make sure to keep it succinct. Feel free to: + Explain by example + Reveal your thought process -- **Wrap Up**: End by recapping important takeaways. -- **Teaser/Resources**: Include links throughout and especially to other resources for the invested reader. --- ## Blog Posts: Communicating Science/Stats > "Short words are best", said Winston Churchill, "and old words when short are the best of all". > "short words are best. Plain they may be, but that is their strength. They are clear, sharp and to the point. You can get your tongue round them. You can spell them. Eye, brain and mouth work as one to greet them as friends, not foes." -- The Economist (2004) -- * Why? * Trying to write with one beat words requires you to slow down and really search for ways to say what you mean simply. * But it isn't about *dumbing down* or lying. * It is about getting rid of unnecessary details and confronting the [curse of knowledge](https://en.wikipedia.org/wiki/Curse_of_knowledge). --- ## Blog Posts: Communicating Science/Stats <img src="img/cells.png" width="1088" /> --- ## Blog Posts: Communicating Science/Stats Attempt to use only (mostly) common words. * Great Example + [Up Goer Five](https://xkcd.com/1133/) * Check your work with [https://xkcd.com/simplewriter/](https://xkcd.com/simplewriter/). + [Corresponding book](https://xkcd.com/thing-explainer/) --- ## Blog Posts > "Taking responsibility for the impression that the reader comes away with requires an understanding of how people integrate different types of information. And generally, examples are much more persuasive than statistics." -- Jonathan Stray (CJGD) * Stray advocates for the use of a *well-chosen example*. * What makes a *well-chosen example*? * Are [these](https://www.reed.edu/what-is-a-reedie/index.php?type=default)? --- ## Blog Post * Examples: + [Our Simon Explains `stacks`](https://blog.simonpcouch.com/blog/gentle-intro-stacks/) + [Julia Silge's modeling of salary and gender in tech](https://juliasilge.com/blog/salary-gender/) + [David Robinson's analysis of Love Actually](http://varianceexplained.org/r/love-actually-network/) + [Alex Cookson's data exploration of climbing mountains](https://www.alexcookson.com/post/analyzing-himalayan-peaks-first-ascents/) * Disclaimers: + Use the Instructions/Rubric, not these examples, to make sure your post meets all the project requirements + Many of these posts are much LONGER and INVOLVED than your post should be --- ## Let's orient ourselves to a `distill` website! Your project repo is based on [this template](https://github.com/Reed-Math241/blog-template) * `index.Rmd`: Homepage and output HTML is populated by previews of the posts -- * `about.Rmd`: Descriptive page -- * `_site.yml`: Controls site set-up -- * `docs/` folder: Stores knitted versions of the `Rmd`s -- * `_posts/` folder: Stores blog post `Rmd`s --- ## Data Scientists * Are data scientists *programmers*? * Are data scientists *statisticians*? * Are data scientists *computer scientists*? -- * My thoughts: + Data scientists are a mix of these identities. + The mixture can vary wildly from one job to another. + Today we are focusing on **writing better code**. --- ## Code Smells and Refactoring * Code Smells: Structures in code that suggest **refactoring** is needed. -- * Refactor: Make code + Easier to understand + Easier to contribute to + But without changing the observable behavior > [Wikipedia](https://en.wikipedia.org/wiki/Code_refactoring) defines code refactoring as "the process of restructuring existing computer code—changing the factoring—without changing its external behavior." -- * Not talking about `factor()`s! --- ## R Coding Hard Rules Do you have any hard rules? -- * Don't use + `attach()`. + `setwd()`. + A workflow that requires `rm(list=ls())` * Don't write code directly in the console. (Almost) always write it in an R script or Rmd. * Use consistent styling. --- ## Refactoring Loads of R coding *soft rules*. -- Today we are going to: * Detect a code smell. + A signal that more elegent code is needed. -- * Apply a particular refactoring. -- * Focus on code **revising**. + Even for experienced users, their first draft of new code will have smells. + To refactor, we must first to work to understand what the code does. This will help us see what assumptions we've made along the way. --- * Notice any code smells? ```r thing <- function(na.rm = FALSE, x = c(7, 1), y = c(3, 4)) { var1 <- mean(x, na.rm = TRUE) var2 <- mean(y, na.rm = TRUE) xx <- var(x, na.rm = TRUE) yy <- var(y, na.rm = TRUE) na <- length(x) nb <- length(y) df <- min(na, nb) - 1 important_bit <- (var1 - var2 - 0)/sqrt(xx/na + yy/nb) pt(q = abs(important_bit), df = df, lower.tail = FALSE)*2 } # Generate data x <- rnorm(n = 10) y <- rnorm(n = 20, mean = 2) # Test it thing(na.rm = TRUE, x = x, y = y) ``` ``` ## [1] 0.000081 ``` ```r thing(na.rm = FALSE) ``` ``` ## [1] 0.9 ``` --- * How should we refactor? ```r thing <- function(na.rm = FALSE, x = c(7, 1), y = c(3, 4)) { var1 <- mean(x, na.rm = TRUE) var2 <- mean(y, na.rm = TRUE) xx <- var(x, na.rm = TRUE) yy <- var(y, na.rm = TRUE) na <- length(x) nb <- length(y) df <- min(na, nb) - 1 important_bit <- (var1 - var2 - 0)/sqrt(xx/na + yy/nb) pt(q = abs(important_bit), df = df, lower.tail = FALSE)*2 } # Generate data x <- rnorm(n = 10) y <- rnorm(n = 20, mean = 2) # Test it thing(na.rm = TRUE, x = x, y = y) ``` ``` ## [1] 0.00045 ``` ```r thing(na.rm = FALSE) ``` ``` ## [1] 0.9 ``` --- ```r two_sample_t_test <- function(x, y, na.rm = FALSE, null = 0) { mean_x <- mean(x, na.rm = na.rm) mean_y <- mean(y, na.rm = na.rm) var_x <- var(x, na.rm = na.rm) var_y <- var(y, na.rm = na.rm) n_x <- sum(!is.na(x)) n_y <- sum(!is.na(y)) df <- min(n_x, n_y) - 1 test_stat <- (mean_x - mean_y - null)/sqrt(var_x/n_x + var_y/n_y) p_value <- pt(q = abs(test_stat), df = df, lower.tail = FALSE)*2 return(data.frame(test_stat = test_stat, df = df, p_value = p_value)) } # Test it two_sample_t_test(na.rm = TRUE, x = x, y = y) ``` ``` ## test_stat df p_value ## 1 -5.4 9 0.00045 ``` ```r two_sample_t_test(na.rm = FALSE) ``` ``` ## Error in mean(x, na.rm = na.rm): argument "x" is missing, with no default ``` --- ## Jenny's Bizarro Function ```r x <- 1:5 #x <- c(TRUE, FALSE, FALSE, TRUE, FALSE) cat( "The bizarro version of x is", -x, #!x, "\n" ) ``` ``` ## The bizarro version of x is -1 -2 -3 -4 -5 ``` --- ## Jenny's Bizarro Function ```r #x <- 1:5 x <- c(TRUE, FALSE, FALSE, TRUE, FALSE) cat( "The bizarro version of x is", #-x, !x, "\n" ) ``` ``` ## The bizarro version of x is FALSE TRUE TRUE FALSE TRUE ``` * What is the code smell? --- ## Commenting Changes Behavior * Tip: Do not comment and uncomment sections of code to alter the behavior. ```r #x <- 1:5 x <- c(TRUE, FALSE, FALSE, TRUE, FALSE) cat( "The bizarro version of x is", #-x, !x, "\n" ) ``` ``` ## The bizarro version of x is FALSE TRUE TRUE FALSE TRUE ``` --- ## Commenting Changes Behavior * Fix with `if () else ()` ```r x <- 1:5 #x <- c(TRUE, FALSE, FALSE, TRUE, FALSE) cat( "The bizarro version of x is", if (is.numeric(x)) { -x } else { !x }, "\n" ) ``` ``` ## The bizarro version of x is -1 -2 -3 -4 -5 ``` --- ## Commenting Changes Behavior * Fix with `if () else ()` but... + Use `if () else ()` in moderation ```r #x <- 1:5 x <- c(TRUE, FALSE, FALSE, TRUE, FALSE) cat( "The bizarro version of x is", if (is.numeric(x)) { -x } else { !x }, "\n" ) ``` ``` ## The bizarro version of x is FALSE TRUE TRUE FALSE TRUE ``` * Now that we have a working example, how can we fix the first comment/uncomment code smell? --- ## Tip: Use Functions! ```r bizarro <- function(x) { if (is.numeric(x)) { -x } else { !x } } # Test it! bizarro(x = 1:5) ``` ``` ## [1] -1 -2 -3 -4 -5 ``` ```r bizarro(x = c(TRUE, FALSE, FALSE, TRUE, FALSE)) ``` ``` ## [1] FALSE TRUE TRUE FALSE TRUE ``` --- * Code smell? ```r bizarro <- function(x) { if (class(x)[[1]] == "numeric" || class(x)[[1]] == "integer") { -x } else if (class(x)[[1]] == "logical") { !x } else { stop( "Don't know how to make bizzaro <", class(x)[[1]], ">") } } bizarro(c(TRUE, FALSE, FALSE, TRUE, FALSE)) ``` ``` ## [1] FALSE TRUE TRUE FALSE TRUE ``` ```r bizarro(1:5) ``` ``` ## [1] -1 -2 -3 -4 -5 ``` ```r bizarro(c("abc", "def")) ``` ``` ## Error in bizarro(c("abc", "def")): Don't know how to make bizzaro <character> ``` --- ## Tips: * Use proper functions for handling class * Use simple conditions. * Use well-written existing functions. ```r bizarro <- function(x) { if (is.numeric(x)) { -x } else if (is.logical(x)) { !x } else { stop( "Don't know how to make bizzaro <", class(x)[[1]], ">") } } bizarro(c(TRUE, FALSE, FALSE, TRUE, FALSE)) ``` ``` ## [1] FALSE TRUE TRUE FALSE TRUE ``` ```r bizarro(1:5) ``` ``` ## [1] -1 -2 -3 -4 -5 ``` --- ## Tip: Stop Early ```r bizarro <- function(x) { stopifnot(is.numeric(x) || is.logical(x)) if (is.numeric(x)) { -x } else { !x } } bizarro(c(TRUE, FALSE, FALSE, TRUE, FALSE)) ``` ``` ## [1] FALSE TRUE TRUE FALSE TRUE ``` ```r bizarro(1:5) ``` ``` ## [1] -1 -2 -3 -4 -5 ``` ```r bizarro(c("abc", "def")) ``` ``` ## Error in bizarro(c("abc", "def")): is.numeric(x) || is.logical(x) is not TRUE ``` --- ## Create some demo data ```r library(pdxTrees) library(tidyverse) pdxTrees_demo <- get_pdxTrees_parks() %>% select(Species, Tree_Height) %>% head(n = 10) pdxTrees_demo ``` ``` ## # A tibble: 10 x 2 ## Species Tree_Height ## <chr> <dbl> ## 1 PSME 105 ## 2 PSME 94 ## 3 CRLA 23 ## 4 QURU 28 ## 5 PSME 102 ## 6 PSME 95 ## 7 PSME 103 ## 8 PSME 105 ## 9 PSME 97 ## 10 PSME 112 ``` --- * Code smell? ```r pdxTrees_demo <- pdxTrees_demo %>% mutate(ht_cat = if_else(Tree_Height < 30, "short", if_else(Tree_Height < 100, "medium", if_else(Tree_Height < 110, "tall", "very tall")))) pdxTrees_demo ``` ``` ## # A tibble: 10 x 3 ## Species Tree_Height ht_cat ## <chr> <dbl> <chr> ## 1 PSME 105 tall ## 2 PSME 94 medium ## 3 CRLA 23 short ## 4 QURU 28 short ## 5 PSME 102 tall ## 6 PSME 95 medium ## 7 PSME 103 tall ## 8 PSME 105 tall ## 9 PSME 97 medium ## 10 PSME 112 very tall ``` --- ## Oniony Code ```r pdxTrees_demo <- pdxTrees_demo %>% mutate(ht_cat = if_else(Tree_Height < 30, "short", if_else(Tree_Height < 100, "medium", if_else(Tree_Height < 110, "tall", "very tall")))) pdxTrees_demo ``` ``` ## # A tibble: 10 x 3 ## Species Tree_Height ht_cat ## <chr> <dbl> <chr> ## 1 PSME 105 tall ## 2 PSME 94 medium ## 3 CRLA 23 short ## 4 QURU 28 short ## 5 PSME 102 tall ## 6 PSME 95 medium ## 7 PSME 103 tall ## 8 PSME 105 tall ## 9 PSME 97 medium ## 10 PSME 112 very tall ``` --- ## Tip: Less indents are easier to read ```r pdxTrees_demo <- pdxTrees_demo %>% mutate(ht_cat = case_when( Tree_Height < 30 ~ "short", Tree_Height < 100 ~ "medium", Tree_Height < 110 ~ "tall", TRUE ~ "very tall" )) pdxTrees_demo ``` ``` ## # A tibble: 10 x 3 ## Species Tree_Height ht_cat ## <chr> <dbl> <chr> ## 1 PSME 105 tall ## 2 PSME 94 medium ## 3 CRLA 23 short ## 4 QURU 28 short ## 5 PSME 102 tall ## 6 PSME 95 medium ## 7 PSME 103 tall ## 8 PSME 105 tall ## 9 PSME 97 medium ## 10 PSME 112 very tall ``` * Note: Like `if()`, statements are evaluated in order. --- * Code Smell? ```r avg <- 5 std_dev <- 2 z_score <- function(x){ (x - avg)/std_dev } z_score(2) ``` ``` ## [1] -1.5 ``` --- ## Tip: Beware of Global Data ```r z_score <- function(x, avg, std_dev){ (x - avg)/std_dev } z_score(x = 2, avg = 5, std_dev = 2) ``` ``` ## [1] -1.5 ``` --- * Code Smells? ```r two_sample_t_test <- function(x, y, na.rm = FALSE, null = 0) { mean_x <- sum(x)/length(x) mean_y <- sum(y)/length(y) var_x <- sum((x - sum(x)/length(x))^2/(length(x)-1)) var_y <- sum((y - sum(y)/length(y))^2/(length(y)-1)) n_x <- length(x) n_y <- length(y) df <- min(n_x, n_y) - 1 test_stat <- (mean_x - mean_y - null)/sqrt(var_x/n_x + var_y/n_y) p_value <- pt(q = abs(test_stat), df = df, lower.tail = FALSE)*2 return(data.frame(test_stat = test_stat, p_value = p_value)) } # Test it x <- rnorm(n = 10) y <- rnorm(n = 20) two_sample_t_test(na.rm = TRUE, x = x, y = y) ``` ``` ## test_stat p_value ## 1 1.6 0.15 ``` --- ## Tip: Aggressively Decomposing Work in Little Functions * Don't write monster functions! * Small well-named helper functions are better than one BIG function with loads of commented code. ```r test_stat_t <- function(x, y, na.rm = FALSE, null = null){ mean_x <- mean(x, na.rm = na.rm) mean_y <- mean(y, na.rm = na.rm) var_x <- var(x, na.rm = na.rm) var_y <- var(y, na.rm = na.rm) n_x <- sum(!is.na(x)) n_y <- sum(!is.na(y)) (mean_x - mean_y - null)/sqrt(var_x/n_x + var_y/n_y) } df_t <- function(n_x, n_y) min(n_x, n_y) - 1 p_value_t <- function(test_stat, df){ pt(q = abs(test_stat), df = df, lower.tail = FALSE)*2 } ``` --- ## Tip: Aggressively Decompose Code into Little Functions * If the helper functions are sufficiently well-named, then you likely don't need to look at the body of these functions to understand the code. ```r two_sample_t_test <- function(x, y, na.rm = FALSE, null = 0) { test_stat <- test_stat_t(x = x, y = y, na.rm = na.rm, null = null) df <- df_t(sum(!is.na(x)), sum(!is.na(y))) p_value <- p_value_t(test_stat, df) return(data.frame(test_stat = test_stat, df = df, p_value = p_value)) } # Test it two_sample_t_test(na.rm = TRUE, x = x, y = y) ``` ``` ## test_stat df p_value ## 1 1.6 9 0.15 ```