class: center, middle # Data Summarization <img src="img/DAW.png" width="500px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 3 | Fall 2020] </span> --- ## Announcements * Invited to attend a talk I am giving about data related, team science research, entitled "Reed Forestry Data Science" + This Thursday, Sept 17th 4:45 - 5:30pm PT on Zoom + For Zoom link, see message in the #outside-stats channel of our Slack Workspace -- * Don't forget that Lab 2 is due before your lab session this week! -- * At the end of class, will go through the "summarizingData.Rmd" handout in class today. Have three options: + Listen and take notes as I go through the handout + Print out PDF and take notes as I go through the handout (posted to Slack #in-class) + Run the code with me (grab handout from `/home/courses/math141f20/Handouts`) --- ## Week 3 Topics * **Data summarization** * Data wrangling * Data collection --- # Goals for Today * Measures for summarizing quantitative data + Center + Spread/variability -- * Measures for summarizing categorical data --- ## Load the [Portland Biketown data](https://www.biketownpdx.com/system-data) ```r # Load necessary package library(tidyverse) # Import the data biketown <- read_csv("/home/courses/math141f20/Data/biketown_spring1920.csv") # Remove suspect points biketown <- filter(biketown, Distance_Miles < 1000) ``` --- ## Summarizing Data <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Distance_Miles </th> <th style="text-align:left;"> PaymentPlan </th> <th style="text-align:left;"> StartHub </th> <th style="text-align:right;"> StartLatitude </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 8.06 </td> <td style="text-align:left;"> Casual </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> 46 </td> </tr> <tr> <td style="text-align:right;"> 2.00 </td> <td style="text-align:left;"> Casual </td> <td style="text-align:left;"> SW 5th at Oak </td> <td style="text-align:right;"> 46 </td> </tr> <tr> <td style="text-align:right;"> 1.59 </td> <td style="text-align:left;"> Subscriber </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> 46 </td> </tr> <tr> <td style="text-align:right;"> 0.48 </td> <td style="text-align:left;"> Casual </td> <td style="text-align:left;"> NW Station at Irving </td> <td style="text-align:right;"> 46 </td> </tr> <tr> <td style="text-align:right;"> 2.71 </td> <td style="text-align:left;"> Subscriber </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> 46 </td> </tr> <tr> <td style="text-align:right;"> 0.94 </td> <td style="text-align:left;"> Subscriber </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> 46 </td> </tr> </tbody> </table> -- * Hard to do by eyeballing a spreadsheet with many rows! --- ## Summarizing Data Visually <img src="wk03_mon_files/figure-html/unnamed-chunk-3-1.png" width="360" /> For a quantitative variable, want to answer: -- * What is an average value? -- * What is the trend/shape of the variable? -- * How much variation is there from case to case? --- ## Summarizing Quantitative Variables For a quantitative variable, want to answer: * What is an average value? * What is the trend/shape of the variable? * How much variation is there from case to case? -- Need to learn some **summary statistics**: Numerical values computed based on the observed cases. --- ## Measures of Center .pull-left[ **Mean: average of all the observations** * `\(n\)` = Number of cases (sample size) * `\(x_i\)` = value of the i-th observation * Denote by `\(\bar{x}\)` $$ \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i $$ ] .pull-right[ {{content}} ] -- <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Distance_Miles </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 8.06 </td> </tr> <tr> <td style="text-align:right;"> 2.00 </td> </tr> <tr> <td style="text-align:right;"> 1.59 </td> </tr> <tr> <td style="text-align:right;"> 0.48 </td> </tr> <tr> <td style="text-align:right;"> 2.71 </td> </tr> <tr> <td style="text-align:right;"> 0.94 </td> </tr> </tbody> </table> {{content}} -- ```r # Mean (8.06 + 2.00 + 1.59 + 0.48 + 2.71 + 0.94)/6 ``` ``` ## [1] 2.6 ``` {{content}} --- ## Measures of Center .pull-left[ #### Median: Middle value, 50% * Denote by `\(m\)` * If `\(n\)` is even, then it is the average of the middle two values ] .pull-right[ {{content}} ] -- <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Distance_Miles </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.48 </td> </tr> <tr> <td style="text-align:right;"> 0.94 </td> </tr> <tr> <td style="text-align:right;"> 1.59 </td> </tr> <tr> <td style="text-align:right;"> 2.00 </td> </tr> <tr> <td style="text-align:right;"> 2.71 </td> </tr> <tr> <td style="text-align:right;"> 8.06 </td> </tr> </tbody> </table> {{content}} -- ```r # Median (1.59 + 2.00)/2 ``` ``` ## [1] 1.8 ``` {{content}} --- ## Measures of Center .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Distance_Miles </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.48 </td> </tr> <tr> <td style="text-align:right;"> 0.94 </td> </tr> <tr> <td style="text-align:right;"> 1.59 </td> </tr> <tr> <td style="text-align:right;"> 2.00 </td> </tr> <tr> <td style="text-align:right;"> 2.71 </td> </tr> <tr> <td style="text-align:right;"> 8.06 </td> </tr> </tbody> </table> ] .pull-right[ ```r # Mean (0.48 + 0.94 + 1.59 + 2.00 + 2.71 + 8.06)/6 ``` ``` ## [1] 2.6 ``` ```r # Median (1.59 + 2.00)/2 ``` ``` ## [1] 1.8 ``` ] * Suppose the 8.06 miles was replaced with 80.6 miles. How would these summary statistics change? --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean -- .pull-left[ Here is my proposal: * Find how much each observation deviates from the mean (2.63). * Compute the average of the deviations. $$ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x}) $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Distance_Miles </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.48 </td> </tr> <tr> <td style="text-align:right;"> 0.94 </td> </tr> <tr> <td style="text-align:right;"> 1.59 </td> </tr> <tr> <td style="text-align:right;"> 2.00 </td> </tr> <tr> <td style="text-align:right;"> 2.71 </td> </tr> <tr> <td style="text-align:right;"> 8.06 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -2.15 </td> </tr> <tr> <td style="text-align:right;"> -1.69 </td> </tr> <tr> <td style="text-align:right;"> -1.04 </td> </tr> <tr> <td style="text-align:right;"> -0.63 </td> </tr> <tr> <td style="text-align:right;"> 0.08 </td> </tr> <tr> <td style="text-align:right;"> 5.43 </td> </tr> </tbody> </table> ] ] --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ Here is my proposal: * Find how much each observation deviates from the mean (2.63). * Compute the average of the deviations. $$ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x}) $$ **Problem?** ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Distance_Miles </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.48 </td> </tr> <tr> <td style="text-align:right;"> 0.94 </td> </tr> <tr> <td style="text-align:right;"> 1.59 </td> </tr> <tr> <td style="text-align:right;"> 2.00 </td> </tr> <tr> <td style="text-align:right;"> 2.71 </td> </tr> <tr> <td style="text-align:right;"> 8.06 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -2.15 </td> </tr> <tr> <td style="text-align:right;"> -1.69 </td> </tr> <tr> <td style="text-align:right;"> -1.04 </td> </tr> <tr> <td style="text-align:right;"> -0.63 </td> </tr> <tr> <td style="text-align:right;"> 0.08 </td> </tr> <tr> <td style="text-align:right;"> 5.43 </td> </tr> </tbody> </table> ] ] --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ Here is my <span style="color: orange;">NEW</span> proposal: * Find how much each observation deviates from the mean (2.63). * Compute the average of the <span style="color: orange;">squared</span> deviations. $$ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^2 $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Distance_Miles </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.48 </td> </tr> <tr> <td style="text-align:right;"> 0.94 </td> </tr> <tr> <td style="text-align:right;"> 1.59 </td> </tr> <tr> <td style="text-align:right;"> 2.00 </td> </tr> <tr> <td style="text-align:right;"> 2.71 </td> </tr> <tr> <td style="text-align:right;"> 8.06 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> <th style="text-align:right;"> Dev_sqd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -2.15 </td> <td style="text-align:right;"> 4.62 </td> </tr> <tr> <td style="text-align:right;"> -1.69 </td> <td style="text-align:right;"> 2.86 </td> </tr> <tr> <td style="text-align:right;"> -1.04 </td> <td style="text-align:right;"> 1.08 </td> </tr> <tr> <td style="text-align:right;"> -0.63 </td> <td style="text-align:right;"> 0.40 </td> </tr> <tr> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> 0.01 </td> </tr> <tr> <td style="text-align:right;"> 5.43 </td> <td style="text-align:right;"> 29.48 </td> </tr> </tbody> </table> ] ] --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ Here is my <span style="color: orange;">NEW</span> proposal: * Find how much each observation deviates from the mean (2.63). * Compute the average of the <span style="color: orange;">squared</span> deviations. $$ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^2 $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Distance_Miles </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.48 </td> </tr> <tr> <td style="text-align:right;"> 0.94 </td> </tr> <tr> <td style="text-align:right;"> 1.59 </td> </tr> <tr> <td style="text-align:right;"> 2.00 </td> </tr> <tr> <td style="text-align:right;"> 2.71 </td> </tr> <tr> <td style="text-align:right;"> 8.06 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> <th style="text-align:right;"> Dev_sqd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -2.15 </td> <td style="text-align:right;"> 4.62 </td> </tr> <tr> <td style="text-align:right;"> -1.69 </td> <td style="text-align:right;"> 2.86 </td> </tr> <tr> <td style="text-align:right;"> -1.04 </td> <td style="text-align:right;"> 1.08 </td> </tr> <tr> <td style="text-align:right;"> -0.63 </td> <td style="text-align:right;"> 0.40 </td> </tr> <tr> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> 0.01 </td> </tr> <tr> <td style="text-align:right;"> 5.43 </td> <td style="text-align:right;"> 29.48 </td> </tr> </tbody> </table> ] ] ```r # Calculate the measure of variability (4.62 + 2.86 + 1.08 + 0.40 + 0.01 + 29.48)/6 ``` ``` ## [1] 6.4 ``` --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ Here is the <span style="color: orange;">ACTUAL measure</span>: * Find how much each observation deviates from the mean (2.63). * Compute the (nearly) average of the <span style="color: orange;">squared</span> deviations. * <span style="color: orange;">Called the sample variance, </span> `\(s^2\)`. $$ s^2 = \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2 $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Distance_Miles </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.48 </td> </tr> <tr> <td style="text-align:right;"> 0.94 </td> </tr> <tr> <td style="text-align:right;"> 1.59 </td> </tr> <tr> <td style="text-align:right;"> 2.00 </td> </tr> <tr> <td style="text-align:right;"> 2.71 </td> </tr> <tr> <td style="text-align:right;"> 8.06 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> <th style="text-align:right;"> Dev_sqd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -2.15 </td> <td style="text-align:right;"> 4.62 </td> </tr> <tr> <td style="text-align:right;"> -1.69 </td> <td style="text-align:right;"> 2.86 </td> </tr> <tr> <td style="text-align:right;"> -1.04 </td> <td style="text-align:right;"> 1.08 </td> </tr> <tr> <td style="text-align:right;"> -0.63 </td> <td style="text-align:right;"> 0.40 </td> </tr> <tr> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> 0.01 </td> </tr> <tr> <td style="text-align:right;"> 5.43 </td> <td style="text-align:right;"> 29.48 </td> </tr> </tbody> </table> ] ] ```r # Calculate the measure of variability (4.62 + 2.86 + 1.08 + 0.40 + 0.01 + 29.48)/5 ``` ``` ## [1] 7.7 ``` --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ * Find how much each observation deviates from the mean (2.63). * Compute the (nearly) average of the <span style="color: orange;">squared</span> deviations. * Called the sample variance, `\(s^2\)`. * The square root of the sample variance is called <span style="color: orange;">the sample standard deviation, </span> `\(s\)`. $$ s = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2} $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Distance_Miles </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.48 </td> </tr> <tr> <td style="text-align:right;"> 0.94 </td> </tr> <tr> <td style="text-align:right;"> 1.59 </td> </tr> <tr> <td style="text-align:right;"> 2.00 </td> </tr> <tr> <td style="text-align:right;"> 2.71 </td> </tr> <tr> <td style="text-align:right;"> 8.06 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> <th style="text-align:right;"> Dev_sqd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -2.15 </td> <td style="text-align:right;"> 4.62 </td> </tr> <tr> <td style="text-align:right;"> -1.69 </td> <td style="text-align:right;"> 2.86 </td> </tr> <tr> <td style="text-align:right;"> -1.04 </td> <td style="text-align:right;"> 1.08 </td> </tr> <tr> <td style="text-align:right;"> -0.63 </td> <td style="text-align:right;"> 0.40 </td> </tr> <tr> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> 0.01 </td> </tr> <tr> <td style="text-align:right;"> 5.43 </td> <td style="text-align:right;"> 29.48 </td> </tr> </tbody> </table> ] ] ```r # Calculate the measure of variability sqrt((4.62 + 2.86 + 1.08 + 0.40 + 0.01 + 29.48)/5) ``` ``` ## [1] 2.8 ``` --- ## Measures of Variability * In addition to the sample standard deviation and the sample variance, there is the Interquartile Range (IQR): $$ \mbox{IQR} = \mbox{Q}_3 - \mbox{Q}_1 $$ * Which is more robust to outliers, the IQR or `\(s\)`? * Which is more commonly used, the IQR or `\(s\)`? --- class: center, middle, inverse # Now let's go through the Data Summarization handout!