class: center, middle # Graphing with `ggplot2` <img src="img/hero_wall_pink.png" width="800px"/> ## Kelly McConville .large[Math 241 | Week 2 | Spring 2021] --- # Announcements * Sign up at [GitHub](https://github.com/) (it is free) and enter your GitHub username into [this spreadsheet](https://docs.google.com/spreadsheets/d/1nvM8jJDUvp8H5iYF59aKNdnJlqy_aW18R6d6DBUPeSU/edit?usp=sharing) by end of day **today**. + Link can also be found in our class Slack channel. * Make sure [to sign-in](https://docs.google.com/document/d/1QMXSF9TxsXj3j8M42mwTatGnB_-GayYvGmHWnnXd-sg/edit?usp=sharing). * Week 2 readings posted on the [Schedule](https://reed-statistics.github.io/math241s21/schedule.html). * Lab 1 due on Gradescope on Monday! + Colored pencils outside my office (end of 3rd floor of Library). > "The use of statistics in journalism, like the use of statistics in general, will always involve artistry." -- Jonathan Stray --- ## ["One Dataset, Visualized 25 Ways"](https://flowingdata.com/2017/01/24/one-dataset-visualized-25-ways/#jp-carousel-47350) * There are many ways to visualize a dataset. + What geometric object do I use? + How do I map the variables to the aesthetics? + What context should I provide? + Should I include animation? Interactivity? * Questions: + Which graphs are most effective at telling a story? + Any problematic graphs? * Important messages + Try many graphs before you settle on one. + Some of these are the **start** of an excellent graph. + Revise, revise, revise! --- ## Goals for Today * Review `ggplot2`. * Continue to consider best practices for data viz. --- ## Components of Data Graphics * **data**: dataset that contains the raw data * **geom**: geometric shape that the data are mapped to. + point, line, bar, text, ... * **aes**thetic: visual properties of the **geom** + x position, y position, color, fill, shape * **coord**: coordinate system + Cartesian, polar * **scale**: controls how data are mapped to the visual values of the aesthetic. + EX: particular colors, linear * **guide**: legend to help user convert visual display back to the data --- ## ggplot2 example code ```r ggplot(data = ---, mapping = aes(---)) + geom_---(---) + coord_---() + scale_---_---() + --- ``` --- #### Example: Over the course of a year, how does the daily number of births vary? <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-2-1.png" width="504" /> -- * Cycles? --- #### Example ```r # Load library that has dataset of interest library(mosaicData) # Grab data data(Births2015) # Load tidyverse (which contains ggplot2) library(tidyverse) ``` --- #### Example ```r # Create plot ggplot(data = Births2015, mapping = aes(x = date, y = births)) + geom_point() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-4-1.png" width="504" /> --- #### Example: Cycles related to day of the week? ```r # Look at structure of data with dplyr glimpse(Births2015) ``` ``` ## Rows: 365 ## Columns: 8 ## $ date <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01… ## $ births <dbl> 8068, 10850, 8328, 7065, 11892, 12425, 12141, 12094, 118… ## $ wday <ord> Thu, Fri, Sat, Sun, Mon, Tue, Wed, Thu, Fri, Sat, Sun, M… ## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 20… ## $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ day_of_year <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1… ## $ day_of_month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1… ## $ day_of_week <dbl> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2,… ``` --- #### Example: Cycles related to day of the week? ```r # Create plot ggplot(data = Births2015, mapping = aes(x = date, y = births, color = day_of_week)) + geom_point() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-6-1.png" width="504" /> * Additional aesthetic? + Why is this not what we want? What do we want? --- #### Example: Cycles related to day of the week? ```r #Create plot ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_point() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-7-1.png" width="504" /> * What happened to the aspect ratio when we added the legend? --- #### Example ```r ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_point() + theme(legend.position = "bottom") ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-8-1.png" width="504" /> --- #### Example <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-9-1.png" width="504" /> * What if we want to see the *direction* that the number of births take over time for each day of the week? + New visual cue/**geom**? --- #### Example ```r ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_line() + theme(legend.position = "bottom") ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-10-1.png" width="504" /> * What if we want both *position* and *direction*? --- #### Example ```r ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_point() + geom_line() + theme(legend.position = "bottom") ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-11-1.png" width="504" /> --- ### Coordinate System Layer ```r library(lubridate) ggplot(data = Births2015, mapping = aes(x = date, y = births)) + geom_point() + coord_cartesian(xlim = as_date(c("2015-01-01","2015-01-31"))) ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-12-1.png" width="504" /> * What did this do? --- ### Setting instead of Mapping * What if we want all the points to be colored `midnightblue`? ```r ggplot(data = Births2015, mapping = aes(x = date, y = births, color = midnightblue)) + geom_point() ``` ``` ## Error in FUN(X[[i]], ...): object 'midnightblue' not found ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-13-1.png" width="504" /> --- ### Setting instead of Mapping ```r ggplot(data = Births2015, mapping = aes(x = date, y = births, color = "midnightblue")) + geom_point() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-14-1.png" width="504" /> --- ### Setting instead of Mapping ```r ggplot(data = Births2015, mapping = aes(x = date, y = births)) + geom_point(color = "midnightblue") ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-15-1.png" width="504" /> --- ### Setting instead of Mapping ```r ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_line() + geom_point(color = "midnightblue") + theme(legend.position = "bottom") ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-16-1.png" width="504" /> --- ### Setting instead of Mapping * Consider the order of your ggplot layers! ```r ggplot(data = Births2015, mapping = aes(x = date, y = births, color = wday)) + geom_point(color = "midnightblue") + geom_line() + theme(legend.position = "bottom") ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-17-1.png" width="504" /> --- ### Let's explore other **geom**s * Many are listed on the first page of the `ggplot2` cheatsheet. * Can also ask R: ```r apropos("geom_") ``` ``` ## [1] "geom_abline" "geom_area" "geom_bar" ## [4] "geom_bin2d" "geom_blank" "geom_boxplot" ## [7] "geom_col" "geom_contour" "geom_contour_filled" ## [10] "geom_count" "geom_crossbar" "geom_curve" ## [13] "geom_density" "geom_density_2d" "geom_density_2d_filled" ## [16] "geom_density2d" "geom_density2d_filled" "geom_dotplot" ## [19] "geom_errorbar" "geom_errorbarh" "geom_freqpoly" ## [22] "geom_function" "geom_hex" "geom_histogram" ## [25] "geom_hline" "geom_jitter" "geom_label" ## [28] "geom_line" "geom_linerange" "geom_map" ## [31] "geom_path" "geom_point" "geom_pointrange" ## [34] "geom_polygon" "geom_qq" "geom_qq_line" ## [37] "geom_quantile" "geom_raster" "geom_rect" ## [40] "geom_ribbon" "geom_rug" "geom_segment" ## [43] "geom_sf" "geom_sf_label" "geom_sf_text" ## [46] "geom_smooth" "geom_spoke" "geom_step" ## [49] "geom_text" "geom_tile" "geom_violin" ## [52] "geom_vline" "update_geom_defaults" ``` --- ### Adding Curve(s) ```r ggplot(data = Births78, mapping = aes(x = date, y = births, color = wday)) + geom_point() + geom_smooth(se = FALSE) ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-19-1.png" width="504" /> --- ## Adding Curve(s) ```r ggplot(data = Births78, mapping = aes(x = date, y = births, color = wday)) + geom_point() + geom_smooth(color = "black", se = FALSE) ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-20-1.png" width="504" /> * What happened? --- ### New Example: HELPrct * Need a new dataset + `HELPrct`: Health Evaluation and Linkage to Primary Care Randomized Clinical Trial + Subjects admitted for treatment for addiction to one of three substances. ```r # Library with data library(mosaicData) # Grab data data(HELPrct) ``` --- ### HELPrct ```r glimpse(HELPrct) ``` ``` ## Rows: 453 ## Columns: 30 ## $ age <int> 37, 37, 26, 39, 32, 47, 49, 28, 50, 39, 34, 58, 58, … ## $ anysubstatus <int> 1, 1, 1, 1, 1, 1, NA, 1, 1, 1, NA, 0, 1, 1, 1, 1, 1,… ## $ anysub <fct> yes, yes, yes, yes, yes, yes, NA, yes, yes, yes, NA,… ## $ cesd <int> 49, 30, 39, 15, 39, 6, 52, 32, 50, 46, 46, 49, 22, 3… ## $ d1 <int> 3, 22, 0, 2, 12, 1, 14, 1, 14, 4, 0, 3, 5, 10, 2, 6,… ## $ daysanysub <int> 177, 2, 3, 189, 2, 31, NA, 47, 31, 115, NA, 192, 6, … ## $ dayslink <int> 225, NA, 365, 343, 57, 365, 334, 365, 365, 382, 365,… ## $ drugrisk <int> 0, 0, 20, 0, 0, 0, 0, 7, 18, 20, 8, 0, 0, 0, 0, 0, 0… ## $ e2b <int> NA, NA, NA, 1, 1, NA, 1, 8, 7, 3, NA, NA, NA, 1, NA,… ## $ female <int> 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0… ## $ sex <fct> male, male, male, female, male, female, female, male… ## $ g1b <fct> yes, yes, no, no, no, no, yes, yes, no, no, no, no, … ## $ homeless <fct> housed, homeless, housed, housed, homeless, housed, … ## $ i1 <int> 13, 56, 0, 5, 10, 4, 13, 12, 71, 20, 0, 13, 20, 13, … ## $ i2 <int> 26, 62, 0, 5, 13, 4, 20, 24, 129, 27, 0, 13, 31, 20,… ## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 1… ## $ indtot <int> 39, 43, 41, 28, 38, 29, 38, 44, 44, 44, 34, 11, 40, … ## $ linkstatus <int> 1, NA, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, … ## $ link <fct> yes, NA, no, no, yes, no, no, no, no, no, no, no, no… ## $ mcs <dbl> 25.1, 26.7, 6.8, 44.0, 21.7, 55.5, 21.8, 9.2, 22.0, … ## $ pcs <dbl> 58, 36, 75, 62, 37, 46, 25, 65, 38, 23, 60, 42, 39, … ## $ pss_fr <int> 0, 1, 13, 11, 10, 5, 1, 4, 5, 0, 0, 13, 13, 1, 1, 7,… ## $ racegrp <fct> black, white, black, white, black, black, black, whi… ## $ satreat <fct> no, no, no, yes, no, no, yes, yes, no, yes, no, yes,… ## $ sexrisk <int> 4, 7, 2, 4, 6, 5, 8, 6, 8, 0, 2, 0, 1, 4, 8, 3, 4, 4… ## $ substance <fct> cocaine, alcohol, heroin, heroin, cocaine, cocaine, … ## $ treat <fct> yes, yes, no, no, no, yes, no, yes, no, yes, yes, no… ## $ avg_drinks <int> 13, 56, 0, 5, 10, 4, 13, 12, 71, 20, 0, 13, 20, 13, … ## $ max_drinks <int> 26, 62, 0, 5, 13, 4, 20, 24, 129, 27, 0, 13, 31, 20,… ## $ hospitalizations <int> 3, 22, 0, 2, 12, 1, 14, 1, 14, 4, 0, 3, 5, 10, 2, 6,… ``` --- ### Amounts * What is a common geom for size/amount? --- ### `geom_bar` ```r ggplot(data = HELPrct, mapping = aes(x = substance)) + geom_bar() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-23-1.png" width="504" /> * Aesthetic? * Verbalize the mapping of **data** to **geom**. + How is this mapping different from the scatterplot? --- ### Another option ```r # First wrangle with dplyr HELPrct %>% count(substance) %>% ggplot(mapping = aes(x = substance, y = n)) + geom_col() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-24-1.png" width="504" /> * What changed? --- ### `geom_point`: New interpretation ```r HELPrct %>% count(substance) %>% ggplot(mapping = aes(x = substance, y = n)) + geom_point(size = 4) ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-25-1.png" width="504" /> * Pros/Cons of `geom_point` versus `geom_bar`? --- ### `geom_point`: New interpretation ```r HELPrct %>% count(substance) %>% ggplot(mapping = aes(x = substance, y = n)) + geom_segment(mapping = aes(xend = substance), yend = 0) + geom_point(size = 4, color = "orange") ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-26-1.png" width="504" /> * Pros/Cons of `geom_point` versus `geom_bar`? --- ### `geom_bar`: Two variables * Relationship between substance and racegrp? ```r ggplot(data = HELPrct, mapping = aes(x = substance, fill = racegrp)) + geom_bar() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-27-1.png" width="504" /> * Describe the mapping. --- ### `geom_bar`: Two variables * Relationship between substance and racegrp? ```r ggplot(data = HELPrct, mapping = aes(x = substance, fill = racegrp)) + geom_bar(position = "fill") ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-28-1.png" width="504" /> * Describe the mapping. --- ### `geom_bar`: Two variables * Relationship between substance and racegrp? ```r ggplot(data = HELPrct, mapping = aes(x = substance, fill = racegrp)) + geom_bar(position = "dodge") ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-29-1.png" width="504" /> * Describe the mapping. --- ### `geom_tile`: Two variables * Relationship between substance and racegrp? ```r HELPrct %>% count(substance, racegrp) %>% ggplot(mapping = aes(x = substance, y = racegrp, fill = n)) + geom_tile() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-30-1.png" width="504" /> * Describe the mapping. --- ### Distributions * What are useful graphs/geoms for visualizing distributions? --- ### `geom_histogram` * How old are participants in the study? ```r ggplot(HELPrct, aes(x = age)) + geom_histogram() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-31-1.png" width="504" /> * Verbalize the mapping. --- ### `geom_histogram` * Can modify the mapping via the `binwidth` argument ```r ggplot(HELPrct, aes(x = age)) + geom_histogram(binwidth = 2) ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-32-1.png" width="504" /> --- ### `geom_histogram` * Does the age distribution vary by substance? ```r ggplot(HELPrct, aes(x = age, fill = substance)) + geom_histogram() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-33-1.png" width="504" /> -- * Bad --- ### `geom_histogram` * Does the age distribution vary by substance? ```r ggplot(HELPrct, aes(x = age, fill = substance)) + geom_histogram(alpha = 0.4, position = "identity") ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-34-1.png" width="504" /> -- * Still bad --- ### Facetting * Does the age distribution vary by substance? ```r ggplot(HELPrct, aes(x = age)) + geom_histogram() + facet_wrap(~substance) ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-35-1.png" width="504" /> --- ### `geom_density` ```r ggplot(HELPrct, aes(x = age)) + geom_density() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-36-1.png" width="504" /> --- ### `geom_density` * Does the age distribution vary by substance? ```r ggplot(HELPrct, aes(x = age, color = substance)) + geom_density() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-37-1.png" width="504" /> --- ### `geom_density` * Does the age distribution vary by substance? ```r ggplot(HELPrct, aes(x = age, fill = substance)) + geom_density(alpha = 0.3) ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-38-1.png" width="504" /> --- ### `geom_density` * Does the age vary by substance? ```r ggplot(HELPrct, aes(x = age, fill = substance)) + geom_density(position = "fill", adjust = 2) ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-39-1.png" width="504" /> --- ### Faceting * Does the age vary by substance? ```r ggplot(HELPrct, aes(x = age)) + geom_line(stat = "density") + facet_grid(.~substance) ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-40-1.png" width="504" /> --- ### `geom_boxplot` * Does the age distribution vary by substance? ```r ggplot(HELPrct, aes(x = substance, y = age)) + geom_boxplot() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-41-1.png" width="504" /> --- ### `geom_boxplot` * Does the age distribution vary by substance? ```r ggplot(HELPrct, aes(x = substance, y = age)) + geom_boxplot(varwidth = TRUE) ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-42-1.png" width="504" /> * What does `varwidth` do? --- ### `geom_boxplot` * Does the age distribution vary by substance? ```r ggplot(HELPrct, aes(x = substance, y = age)) + geom_boxplot(notch = TRUE) ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-43-1.png" width="504" /> * Why might we add `notch = TRUE`? --- ### `geom_boxplot` * Does the age distribution vary by substance and racegrp? ```r ggplot(HELPrct, aes(x = substance, y = age, color = racegrp)) + geom_boxplot() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-44-1.png" width="504" /> --- ### `geom_boxplot` * Does the age distribution vary by substance and racegrp? ```r ggplot(HELPrct, aes(x = substance, y = age, color = racegrp)) + geom_boxplot() + coord_flip() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-45-1.png" width="504" /> --- ### `geom_violin` * Does the age distribution vary by substance? + Want to see distribution more clearly! ```r ggplot(HELPrct, aes(x = substance, y = age)) + geom_violin() ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-46-1.png" width="504" /> * Utilty of the violin over the box? --- ### `geom_violin` * Does the age distribution vary by substance? + Want to see distribution more clearly! ```r ggplot(HELPrct, aes(x = substance, y = age)) + geom_violin() + geom_jitter(alpha = .3, width = .1) ``` <img src="slidesWk2Tu_files/figure-html/unnamed-chunk-47-1.png" width="504" /> --- ## Thursday * Start adding context! + Labels + Highlighting + Useful text * Look at more `geom`s. * Explore further customizations. + Color + Themes * Learn how to ask coding questions well.