class: center, middle ### Inference for Categorical Variables <img src="img/DAW.png" width="450px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 11 | Fall 2020] </span> --- ## Announcements/Reminders * Extra Credit Assignment: Write a stats poem. + Due December 2nd * Lab Next Week: + If have a Friday afternoon lab, can attend a TH or Friday morning session. + Can see the times at https://solar.reed.edu/class_schedule/ + MUST inform both lab instructors of which lab you are in and which one you will be attending. --- ## Week 11 Topics * Practice with Statistical Inference via Probability Models * Chi-Squared Test ********************************************* ### Goals for Today * The Chi-Squared Test --- ### Inference for Categorical Variables Consider the situation where: * Response variable: categorical * Explanatory variable: categorical -- * Parameter of interest: `\(p_1 - p_2\)` -- This parameter of interest only makes sense if both variables are restricted to two categories. -- It is time to learn how to study the relationship between two categorical variables when at least one have more than two categories. --- ### Hypotheses Consider the situation where: * Response variable: categorical * Explanatory variable: categorical -- `\(H_o\)`: The two variables are independent. `\(H_a\)`: The two variables are dependent. --- ### Let's Return to the Eyesight Example Near-sightedness typically develops during the childhood years. Quinn, Shin, Maguire, and Stone (1999) examined the type of light children were exposed to and their eye health based on questionnaires filled out by the children's parents at a university pediatric ophthalmology clinic. ```r library(tidyverse) library(infer) # Import data eye_data <- read_csv("/home/courses/math141f19/Data/eye_lighting.csv") ``` --- ### Eyesight Example ```r eye_data %>% count(Lighting, Eye) ``` ``` ## # A tibble: 9 x 3 ## Lighting Eye n ## <chr> <chr> <int> ## 1 dark Far 40 ## 2 dark Near 18 ## 3 dark Normal 114 ## 4 night Far 39 ## 5 night Near 78 ## 6 night Normal 115 ## 7 room Far 12 ## 8 room Near 41 ## 9 room Normal 22 ``` * **Cases**: * **Variables of interest**: * **Hypotheses**: --- ### Eyesight Example Does there appear to be a relationship/dependence? ```r ggplot(data = eye_data, mapping = aes(x = Lighting, fill = Eye)) + geom_bar(position = "fill") ``` <img src="wk11_fri_files/figure-html/unnamed-chunk-3-1.png" width="360" style="display: block; margin: auto;" /> --- ### Eyesight Example Need a test statistic! -- * Won't be a single sample statistic. -- * Needs to measure the discrepancy between the observed sample and the sample we'd expect to see if `\(H_o\)` were true -- * Would be nice if its null distribution could be approximated by a known probability model --- #### Table of Observed Results ```r table(eye_data$Eye, eye_data$Lighting) %>% addmargins() %>% kable(format = "html") ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 78 </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 114 </td> <td style="text-align:right;"> 115 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 232 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> -- **Question**: If `\(H_o\)` were correct, what table would we expect to see? -- <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Far </th> <th style="text-align:right;"> Near </th> <th style="text-align:right;"> Normal </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 159 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 159 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 159 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 159 </td> <td style="text-align:right;"> 159 </td> <td style="text-align:right;"> 159 </td> <td style="text-align:right;"> 477 </td> </tr> </tbody> </table> --- #### Table of Observed Results <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 78 </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 114 </td> <td style="text-align:right;"> 115 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 232 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> **Question**: If `\(H_o\)` were correct, what table would we expect to see? Want a `\(H_o\)` table that respects the marginal proportions: `$$\hat{p}_{far} = 91/479$$` `$$\hat{p}_{nor} = 251/479$$` `$$\hat{p}_{nea} = 137/479$$` --- #### Table of Observed Results <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 78 </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 114 </td> <td style="text-align:right;"> 115 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 232 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> **Question**: If `\(H_o\)` were correct, what table would we expect to see? <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> (91/479)172 </td> <td style="text-align:right;"> (91/479)232 </td> <td style="text-align:right;"> (91/479)75 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> (137/479)172 </td> <td style="text-align:right;"> (137/479)232 </td> <td style="text-align:right;"> (137/479)75 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> (251/479)172 </td> <td style="text-align:right;"> (251/479)232 </td> <td style="text-align:right;"> (251/479)75 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 232 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> -- * Still have the same totals but distributed the values differently within the table --- #### Table of Observed Results <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 78 </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 114 </td> <td style="text-align:right;"> 115 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 232 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> **Question**: If `\(H_o\)` were correct, what table would we expect to see? <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 32.68 </td> <td style="text-align:right;"> 44.08 </td> <td style="text-align:right;"> 14.25 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 49.19 </td> <td style="text-align:right;"> 66.35 </td> <td style="text-align:right;"> 21.45 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 90.13 </td> <td style="text-align:right;"> 121.57 </td> <td style="text-align:right;"> 39.30 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172.00 </td> <td style="text-align:right;"> 232.00 </td> <td style="text-align:right;"> 75.00 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> --- ### Expected Table * How does this table represent `\(H_o\)`? <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 32.68 </td> <td style="text-align:right;"> 44.08 </td> <td style="text-align:right;"> 14.25 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 49.19 </td> <td style="text-align:right;"> 66.35 </td> <td style="text-align:right;"> 21.45 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 90.13 </td> <td style="text-align:right;"> 121.57 </td> <td style="text-align:right;"> 39.30 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172.00 </td> <td style="text-align:right;"> 232.00 </td> <td style="text-align:right;"> 75.00 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> -- <img src="wk11_fri_files/figure-html/unnamed-chunk-11-1.png" width="360" style="display: block; margin: auto;" /> --- ### Test Statistic Want the test statistic to quantify the difference between the observed table and the expected table. <table style="display: inline-block;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 78 </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 114 </td> <td style="text-align:right;"> 115 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 232 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> <table style="display: inline-block;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 32.68 </td> <td style="text-align:right;"> 44.08 </td> <td style="text-align:right;"> 14.25 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 49.19 </td> <td style="text-align:right;"> 66.35 </td> <td style="text-align:right;"> 21.45 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 90.13 </td> <td style="text-align:right;"> 121.57 </td> <td style="text-align:right;"> 39.30 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172.00 </td> <td style="text-align:right;"> 232.00 </td> <td style="text-align:right;"> 75.00 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> -- For each cell: Compute a Z-score! -- `\begin{align*} \mbox{Z-score} &= \frac{\mbox{stat - mean}}{\mbox{SE}} \\ & = \frac{\mbox{observed - expected}}{\sqrt{\mbox{expected}}} \end{align*}` --- ### Test Statistic Want the test statistic to quantify the difference between the observed table and the expected table. <table style="display: inline-block;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 78 </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 114 </td> <td style="text-align:right;"> 115 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 232 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> <table style="display: inline-block;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 32.68 </td> <td style="text-align:right;"> 44.08 </td> <td style="text-align:right;"> 14.25 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 49.19 </td> <td style="text-align:right;"> 66.35 </td> <td style="text-align:right;"> 21.45 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 90.13 </td> <td style="text-align:right;"> 121.57 </td> <td style="text-align:right;"> 39.30 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172.00 </td> <td style="text-align:right;"> 232.00 </td> <td style="text-align:right;"> 75.00 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> **Test Statistic:** `\begin{align*} \chi^2 = \sum \left(\frac{\mbox{observed - expected}}{\sqrt{\mbox{expected}}} \right)^2 \end{align*}` -- → Large test statistics signify that results are unusual if `\(H_o\)` is true. --- ### Test Statistic Want the test statistic to quantify the difference between the observed table and the expected table. <table style="display: inline-block;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 78 </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 114 </td> <td style="text-align:right;"> 115 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 232 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> <table style="display: inline-block;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> dark </th> <th style="text-align:right;"> night </th> <th style="text-align:right;"> room </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Far </td> <td style="text-align:right;"> 32.68 </td> <td style="text-align:right;"> 44.08 </td> <td style="text-align:right;"> 14.25 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> Near </td> <td style="text-align:right;"> 49.19 </td> <td style="text-align:right;"> 66.35 </td> <td style="text-align:right;"> 21.45 </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> Normal </td> <td style="text-align:right;"> 90.13 </td> <td style="text-align:right;"> 121.57 </td> <td style="text-align:right;"> 39.30 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 172.00 </td> <td style="text-align:right;"> 232.00 </td> <td style="text-align:right;"> 75.00 </td> <td style="text-align:right;"> 479 </td> </tr> </tbody> </table> ```r library(infer) #Compute Chi-square test stat test_stat <- eye_data %>% specify(Eye ~ Lighting) %>% calculate(stat = "Chisq") test_stat ``` ``` ## # A tibble: 1 x 1 ## stat ## <dbl> ## 1 56.5 ``` -- Is 56.5 large? Is 56.5 unusual under `\(H_o\)`? --- ### Generating the Null Distribution ``` ## # A tibble: 10 x 2 ## Eye Lighting ## <chr> <chr> ## 1 Normal dark ## 2 Normal dark ## 3 Normal dark ## 4 Near room ## 5 Near room ## 6 Near night ## 7 Normal night ## 8 Far night ## 9 Normal night ## 10 Normal dark ``` -- **Steps**: 1. Shuffle lighting. 2. Compute the new observed table. 3. Compute the test statistic. 4. Repeat 1 - 3 many times. --- ### Generating the Null Distribution ```r # Construct null distribution null_dist <- eye_data %>% specify(Eye ~ Lighting) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "Chisq") visualize(null_dist) ``` <img src="wk11_fri_files/figure-html/unnamed-chunk-14-1.png" width="360" style="display: block; margin: auto;" /> --- ### The Null Distribution <img src="wk11_fri_files/figure-html/unnamed-chunk-15-1.png" width="360" style="display: block; margin: auto;" /> **Key Observations**: * Smallest possible value? <br> * Shape? --- ### The Null Distribution <img src="wk11_fri_files/figure-html/unnamed-chunk-16-1.png" width="360" style="display: block; margin: auto;" /> **Key Observations**: * Smallest possible value? <br> * Shape? <br> * Is our observed test statistic of 56.5 unusual? --- ### The P-value ```r # Compute p-value null_dist %>% get_pvalue(obs_stat = test_stat, direction = "greater") ``` ``` ## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0 ``` --- ### Approximating the Null Distribution ```r visualize(null_dist, method = "both") ``` <img src="wk11_fri_files/figure-html/unnamed-chunk-18-1.png" width="360" style="display: block; margin: auto;" /> If there are at least 5 observations in each cell, then $$ \mbox{test statistic} \sim \chi^2(df = (\mbox{# of rows} - 1)(\mbox{# of columns} - 1)) $$ -- The `\(df\)` controls the center and spread of the distribution. --- ### The Chi-Squared Test ```r chisq.test(table(eye_data$Eye, eye_data$Lighting)) ``` ``` ## ## Pearson's Chi-squared test ## ## data: table(eye_data$Eye, eye_data$Lighting) ## X-squared = 57, df = 4, p-value = 2e-11 ``` -- Conclusions? -- * Causation? -- * Decisions, decisions...