class: center, middle ### Inference Examples and Probability Calculations <img src="img/DAW.png" width="450px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 11 | Fall 2020] </span> --- ## Announcements/Reminders Spring Schedule is updated. → Time to talk about that next stats class. -- If you want to build more flexible models: -- → Take [Math 243: Statistical Learning](https://reed-stat-learning-fall-2020.github.io/syllabus.html) in the fall. -- If you want to take your data wrangling, data viz, and R skills to the next level: -- → Take [Math 241: Data Science](https://www.reed.edu/math/241/post/). This spring's offerings of Math 241: * T/TH 8:50 - 10:10 am (Online) * T/TH 10:25 - 11:45 am (Online) -- If you want to prove some of the results we have relied on in this class: -- → Take **Math 391: Probability Theory** in the fall. It has additional pre-reqs. --- ## Week 11 Topics * Practice with Statistical Inference via Probability Models * Chi-Squared Test ********************************************* ### Goals for Today * Probability model calculations in R * Motivate sample size calculations * Paired data * See more inference examples --- ### Probability Calculations in R P% CI for parameter: $$ \mbox{statistic} \pm z^* SE $$ **Question**: How do I find the correct critical values `\((z^* \mbox{ or } t^*)\)` for the confidence interval? <img src="wk11_wed_files/figure-html/unnamed-chunk-1-1.png" width="360" style="display: block; margin: auto;" /> --- P% CI for parameter: $$ \mbox{statistic} \pm z^* SE $$ **Question**: How do I find the correct critical values `\((z^* \mbox{ or } t^*)\)` for the confidence interval? <img src="wk11_wed_files/figure-html/unnamed-chunk-2-1.png" width="360" style="display: block; margin: auto;" /> -- ```r qnorm(p = 0.975, mean = 0, sd = 1) ``` ``` ## [1] 1.96 ``` ```r qt(p = 0.975, df = 52) ``` ``` ## [1] 2.007 ``` --- P% CI for parameter: $$ \mbox{statistic} \pm z^* SE $$ **Question**: What percentile/quantile do I need for a 90% CI? <img src="wk11_wed_files/figure-html/unnamed-chunk-4-1.png" width="360" style="display: block; margin: auto;" /> -- ```r qnorm(p = 0.95, mean = 0, sd = 1) ``` ``` ## [1] 1.645 ``` ```r qt(p = 0.95, df = 52) ``` ``` ## [1] 1.675 ``` --- ### Probability Calculations in R **Question**: How do I compute probabilities in R? <img src="wk11_wed_files/figure-html/unnamed-chunk-6-1.png" width="360" style="display: block; margin: auto;" /> -- ```r pnorm(q = 1, mean = 0, sd = 1) ``` ``` ## [1] 0.8413 ``` ```r pt(q = 1, df = 52) ``` ``` ## [1] 0.839 ``` **Doesn't seem quite right**... --- ### Probability Calculations in R **Question**: How do I compute probabilities in R? <img src="wk11_wed_files/figure-html/unnamed-chunk-8-1.png" width="360" style="display: block; margin: auto;" /> -- ```r pnorm(q = 1, mean = 0, sd = 1, lower.tail = FALSE) ``` ``` ## [1] 0.1587 ``` ```r pt(q = 1, df = 52, lower.tail = FALSE) ``` ``` ## [1] 0.161 ``` --- ### Probability Calculations in R **To help you remember**: Want a **P**robability? -- → use `pnorm()`, `pt()`, ... -- Want a **Q**uantile (i.e. percentile)? -- → use `qnorm()`, `qt()`, ... --- ### Probability Calculations in R **Question**: When might I want to do probability calculations in R? -- → Computed a test statistic that is approximated by a named random variable. Want to compute the p-value. -- → Compute a confidence interval. -- → To do a **Sample Size Calculation**. --- ### Sample Size Calculations * Very important part of the data analysis process! -- * Happens BEFORE you collect data. -- * You determine how large your sample size needs for a desired precision in your CI. + Will do sample size calculations in lab this week! + (There is also a hypothesis test version that we won't be covering in Math 141.) --- ### Sample Size Calculations **Question**: Why do we need sample size calculations? -- **Example**: Let's return to the dolphins for treating depression example. -- With a sample size of 30 and 95% confidence, we estimate that the improvement rate for depression is between 14.5 percentage points and 75 percentage points higher if you swim with a dolphin instead of doing yoga. -- With a wide of 60.5 percentage points, this 95% CI is a **wide**/very imprecise interval. -- **Question**: How could we make it narrower? How could we decrease the Margin of Error (ME)? -- → Decrease the confidence level! -- → Increase the sample size! --- ### Paired Data: Mean Difference **Example**: Is the mean number of free throw attempts awarded to the Miami Heat during games different from the mean number attempted by their opponents? ```r library(tidyverse) library(Lock5Data) # Data data("MiamiHeat") select(MiamiHeat, Game, Location, Opp, FTA, OppFTA) %>% slice(1:6) ``` ``` ## Game Location Opp FTA OppFTA ## 1 1 Away BOS 25 25 ## 2 2 Away PHI 31 11 ## 3 3 Home ORL 27 34 ## 4 4 Away NJN 34 23 ## 5 5 Home MIN 31 38 ## 6 6 Away NOH 24 17 ``` * Variables of interest: <br> * Parameter of interest: --- ### Paired Data: Mean Difference ```r select(MiamiHeat, Game, Location, Opp, FTA, OppFTA) %>% slice(1:6) ``` ``` ## Game Location Opp FTA OppFTA ## 1 1 Away BOS 25 25 ## 2 2 Away PHI 31 11 ## 3 3 Home ORL 27 34 ## 4 4 Away NJN 34 23 ## 5 5 Home MIN 31 38 ## 6 6 Away NOH 24 17 ``` What are the cases? -- How could we control for case-to-case variability? --- ### Paired Data: Mean Difference * Paired data: repeated observations on the same case * Paired data should not be treated as independent observations + Why not? * By accounting for the case-to-case variability, any differences we see are more directly related to the explanatory variable. --- ### Paired Data: Mean Difference ```r # Calculate Difference MiamiHeat <- MiamiHeat %>% mutate(diff_FTA = FTA - OppFTA) # Visualize ggplot(data = MiamiHeat, mapping = aes(x = diff_FTA)) + geom_histogram() ``` <img src="wk11_wed_files/figure-html/unnamed-chunk-12-1.png" width="360" style="display: block; margin: auto;" /> --- ```r # One-sample t-test t.test(x = MiamiHeat$diff_FTA) ``` ``` ## ## One Sample t-test ## ## data: MiamiHeat$diff_FTA ## t = 3.5, df = 81, p-value = 0.0008 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 1.618 5.895 ## sample estimates: ## mean of x ## 3.756 ``` ```r # Ignoring the pairing t.test(x = MiamiHeat$FTA, y = MiamiHeat$OppFTA) ``` ``` ## ## Welch Two Sample t-test ## ## data: MiamiHeat$FTA and MiamiHeat$OppFTA ## t = 3.3, df = 159, p-value = 0.001 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 1.483 6.029 ## sample estimates: ## mean of x mean of y ## 27.90 24.15 ``` --- ### Assumptions All these methods that rely on the CLT, assume: * The sample size is large. -- * The sample is a random sample. + Observations are independent of each other. --- class: inverse, center, middle ### Let's finish going through more examples!