class: center, middle # Data Visualization: `ggplot2` <img src="img/DAW.png" width="500px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 2 | Fall 2020] </span> --- ## Announcements * Lab 01 is due BEFORE your Week 2 lab (regardless of what Gradescope says). + Questions on uploading to Gradescope? + Questions on exporting the pdf from RStudio? * Don't forget to come by office hours twice during the first four weeks of the semester! --- ## Week 2 Topics * **Creating** Data Visualizations --- # Goals for Today * Discuss standard graphs for numerical data: + **Histogram**: one numerical variable + **Side-by-side boxplot**: one numerical variable and one categorical variable + **Side-by-side violin plot**: one numerical variable and one categorical variable + **Scatterplot**: two numerical variables + **Linegraph**: two numerical variables -- * Learn how to build these graphs with `ggplot2`. -- * On Friday: + Bar plots for categorical data. + How to incorporate more than two variables into a plot. + Context! + Won't worry today about labs, titles, etc. --- # Load packages * `ggplot2` is part of this collection of data science packages. ```r # Load necessary packages library(tidyverse) ``` --- ## Load the [Portland Biketown data](https://www.biketownpdx.com/system-data) ```r # Import the data biketown <- read_csv("/home/courses/math141f20/Data/biketown_spring1920.csv") # Look at the data glimpse(biketown) ``` ``` ## Rows: 124,726 ## Columns: 19 ## $ RouteID <dbl> 10889392, 10889395, 10889396, 10889399, 10889407, 10… ## $ PaymentPlan <chr> "Casual", "Subscriber", "Subscriber", "Subscriber", … ## $ StartHub <chr> "N Mississippi at Beech", NA, NA, "SW Moody at Aeria… ## $ StartLatitude <dbl> 46, 45, 46, 45, 46, 46, 46, 46, 46, 46, 46, 46, 46, … ## $ StartLongitude <dbl> -123, -123, -123, -123, -123, -123, -123, -123, -123… ## $ StartDate <chr> "3/1/2019", "3/1/2019", "3/1/2019", "3/1/2019", "3/1… ## $ StartTime <time> 00:22:00, 00:30:00, 00:33:00, 00:35:00, 00:46:00, 0… ## $ EndHub <chr> "SE 29th at Stark", "NE Multnomah at Grand", NA, "NE… ## $ EndLatitude <dbl> 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, … ## $ EndLongitude <dbl> -123, -123, -123, -123, -123, -123, -123, -123, -123… ## $ EndDate <chr> "3/1/2019", "3/1/2019", "3/1/2019", "3/1/2019", "3/1… ## $ EndTime <time> 00:48:00, 01:04:00, 01:03:00, 01:06:00, 00:55:00, 0… ## $ TripType <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ BikeID <dbl> 6023, 7313, 6538, 6424, 6576, 7324, 6668, 7266, 7320… ## $ BikeName <chr> "0323 BIKETOWN", "0985 DESIGN BIKE, N PAUL BUNYAN", … ## $ Distance_Miles <dbl> 4.07, 3.61, 0.98, 3.28, 1.25, 2.70, 0.82, 0.11, 1.43… ## $ Duration <time> 00:26:21, 00:34:18, 00:30:22, 00:30:58, 00:08:56, 0… ## $ RentalAccessPath <chr> "mobile", "keypad", "keypad", "keypad", "mobile", "k… ## $ MultipleRental <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL… ``` --- ```r # Look at the data head(biketown) ``` ``` ## # A tibble: 6 x 19 ## RouteID PaymentPlan StartHub StartLatitude StartLongitude StartDate StartTime ## <dbl> <chr> <chr> <dbl> <dbl> <chr> <time> ## 1 1.09e7 Casual N Missi… 45.5 -123. 3/1/2019 22'00" ## 2 1.09e7 Subscriber <NA> 45.5 -123. 3/1/2019 30'00" ## 3 1.09e7 Subscriber <NA> 45.5 -123. 3/1/2019 33'00" ## 4 1.09e7 Subscriber SW Mood… 45.5 -123. 3/1/2019 35'00" ## 5 1.09e7 Subscriber NW 18th… 45.5 -123. 3/1/2019 46'00" ## 6 1.09e7 Subscriber <NA> 45.6 -123. 3/1/2019 53'00" ## # … with 12 more variables: EndHub <chr>, EndLatitude <dbl>, ## # EndLongitude <dbl>, EndDate <chr>, EndTime <time>, TripType <lgl>, ## # BikeID <dbl>, BikeName <chr>, Distance_Miles <dbl>, Duration <time>, ## # RentalAccessPath <chr>, MultipleRental <lgl> ``` --- ```r # To access one variable: dataset$variable head(biketown$Distance_Miles) ``` ``` ## [1] 4.07 3.61 0.98 3.28 1.25 2.70 ``` ```r # Determine type class(biketown$Distance_Miles) ``` ``` ## [1] "numeric" ``` ```r class(biketown) ``` ``` ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" ``` ```r # Variable names names(biketown) ``` ``` ## [1] "RouteID" "PaymentPlan" "StartHub" "StartLatitude" ## [5] "StartLongitude" "StartDate" "StartTime" "EndHub" ## [9] "EndLatitude" "EndLongitude" "EndDate" "EndTime" ## [13] "TripType" "BikeID" "BikeName" "Distance_Miles" ## [17] "Duration" "RentalAccessPath" "MultipleRental" ``` --- # Biketown data ```r # Remove suspect points biketown <- filter(biketown, Distance_Miles < 1000) # Display variables select(biketown, Distance_Miles, StartTime, PaymentPlan, Duration) ``` ``` ## # A tibble: 124,721 x 4 ## Distance_Miles StartTime PaymentPlan Duration ## <dbl> <time> <chr> <time> ## 1 4.07 00:22 Casual 26'21" ## 2 3.61 00:30 Subscriber 34'18" ## 3 0.98 00:33 Subscriber 30'22" ## 4 3.28 00:35 Subscriber 30'58" ## 5 1.25 00:46 Subscriber 08'56" ## 6 2.7 00:53 Subscriber 46'44" ## 7 0.82 01:33 Subscriber 06'55" ## 8 0.11 01:57 Subscriber 02'14" ## 9 1.43 02:00 Subscriber 27'53" ## 10 0.53 02:04 Casual 05'07" ## # … with 124,711 more rows ``` --- # Grammar of Graphics * **data**: data frame that contains the raw data + Variables * **geom**: geometric shape that the data are mapped to. + point, line, bar, text, ... * **aesthetic**: visual properties of the **geom** + x position, y position, color, fill, shape * **scale**: controls how data are mapped to the visual values of the aesthetic. + EX: particular colors * **guide**: legend to help user convert visual display back to the data --- # `ggplot2` example code **Guiding Principle**: We will map variables from the **data** to the **aes**thetic attributes (e.g. location, size, shape, color) of **geom**etric objects (e.g. points, lines, bars). ```r ggplot(data = ---, mapping = aes(---)) + geom_---(---) ``` * There are other layers, such as `scales_---_---()` and `labs()`, but we will wait on those. --- # Histograms <img src="wk02_wed_files/figure-html/unnamed-chunk-7-1.png" width="360" /> * Binned counts of data. * Great for assessing shape. --- # Data Shapes <img src="wk02_wed_files/figure-html/unnamed-chunk-8-1.png" width="180" /><img src="wk02_wed_files/figure-html/unnamed-chunk-8-2.png" width="180" /><img src="wk02_wed_files/figure-html/unnamed-chunk-8-3.png" width="180" /> * Shapes: + Right skewed + Bell shaped and symmetric + Left skewed --- # Histograms ```r # Create histogram ggplot(data = biketown, mapping = aes(x = Distance_Miles)) + geom_histogram() ``` <img src="wk02_wed_files/figure-html/unnamed-chunk-9-1.png" width="360" /> --- # Histograms ```r # Create histogram ggplot(data = biketown, mapping = aes(x = Distance_Miles)) + geom_histogram(color = "white", fill = "blue", bins = 70) ``` <img src="wk02_wed_files/figure-html/unnamed-chunk-10-1.png" width="360" /> * Also can set (instead of map) some of the aesthetics. If not mapping to a variable, put directly in the `geom_histogram()`. --- # Boxplots <img src="wk02_wed_files/figure-html/unnamed-chunk-11-1.png" width="360" /> * Five number summary: minimum, first quartile (Q1), median, third quartile (Q3 ), maximum * Interquartile range (IQR) `\(=\)` Q3 `\(-\)` Q1 * Outliers: unusual points + Boxplot defines unusual as being beyond `\(1.5*IQR\)` from `\(Q1\)` or `\(Q3\)`. * Whiskers: reach out to the furtherest point that is NOT an outlier --- # Boxplots ```r ggplot(data = biketown, mapping = aes(x = PaymentPlan, y = Distance_Miles)) + geom_boxplot() ``` <img src="wk02_wed_files/figure-html/unnamed-chunk-12-1.png" width="360" /> --- # Boxplots ```r ggplot(data = biketown, mapping = aes(x = PaymentPlan, y = Distance_Miles)) + geom_boxplot() + coord_cartesian(ylim = c(0,8)) ``` <img src="wk02_wed_files/figure-html/unnamed-chunk-13-1.png" width="360" /> --- # Violin plots ```r ggplot(data = biketown, mapping = aes(x = PaymentPlan, y = Distance_Miles)) + geom_violin() + coord_cartesian(ylim = c(0,8)) ``` <img src="wk02_wed_files/figure-html/unnamed-chunk-14-1.png" width="360" /> * Provides shape. --- ### Scatterplots <img src="wk02_wed_files/figure-html/unnamed-chunk-15-1.png" width="360" /> * Explore relationships between numerical variables. + We will be especially interested in **linear** relationships. --- ### Scatterplots ```r ggplot(data = biketown, mapping = aes(x = Duration, y = Distance_Miles)) + geom_point() ``` <img src="wk02_wed_files/figure-html/unnamed-chunk-16-1.png" width="360" /> --- # Fixing Over-plotting ```r ggplot(data = biketown, mapping = aes(x = Duration, y = Distance_Miles)) + geom_point(alpha = 0.1) ``` <img src="wk02_wed_files/figure-html/unnamed-chunk-17-1.png" width="360" /> --- # Scatterplots ```r #Summarize use by time biketown2 <- count(biketown, StartTime) biketown2 ``` ``` ## # A tibble: 1,440 x 2 ## StartTime n ## <time> <int> ## 1 00'00" 28 ## 2 01'00" 38 ## 3 02'00" 40 ## 4 03'00" 26 ## 5 04'00" 31 ## 6 05'00" 33 ## 7 06'00" 28 ## 8 07'00" 41 ## 9 08'00" 33 ## 10 09'00" 35 ## # … with 1,430 more rows ``` --- # Scatterplots ```r ggplot(data = biketown2, mapping = aes(x = StartTime, y = n)) + geom_point() ``` <img src="wk02_wed_files/figure-html/unnamed-chunk-19-1.png" width="360" /> --- # Linegraphs <img src="wk02_wed_files/figure-html/unnamed-chunk-20-1.png" width="360" /> * Also called time series plot if time is represented on the x axis. --- # Linegraphs ```r ggplot(data = biketown2, mapping = aes(x = StartTime, y = n)) + geom_line() ``` <img src="wk02_wed_files/figure-html/unnamed-chunk-21-1.png" width="360" /> * Also called time series plot if time is represented on the x axis. --- # Recap: `ggplot2` ```r library(tidyverse) ggplot(data = ---, mapping = aes(---)) + geom_---(---) ```