class: center, middle ### Logistic Regression <img src="img/DAW.png" width="450px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 13 | Fall 2020] </span> --- ## Announcements/Reminders * Due Dates: + Wed, Dec 9th: Final Project Assignment due. Submit to [this Ensemble dropbox folder](https://ensemble.reed.edu/Dropbox/Math141F20StudentSubmissions). + Thur/Fri, Dec 10th/11th: Final Exam Takehome and Oral. + Sign up for an oral slot [here](https://docs.google.com/spreadsheets/d/1UvAKO7UzxWm0m5mKv-VzMLawDeJZdzr1q73VCAMqH8Y/edit?usp=sharing). + Takehome due on Gradescope. --- ## Week 13 + Monday Topics * Inference for Linear Regression * Logistic Regression * Wrap-up ********************************************* ### Goals for Today Logistic Regression Model * How do we fit the model in R? * How do we interpret the coefficients? * How do we determine the quality of the model? --- ### Logistic Regression **Response variable**: A categorical variable with two categories -- `\begin{align*} Y = \left\{ \begin{array}{ll} 1 & \mbox{success} \\ 0 & \mbox{failure} \\ \end{array} \right. \end{align*}` -- `\(Y \sim\)` Bernoulli `\((p)\)` where `\(p = P(Y = 1) = P(\mbox{success})\)`. -- **Explanatory variables**: Can be a mix of categorical and quantitative -- Goal: Build a model for `\(P(Y = 1)\)`. #### Logistic Regression Model `\begin{align*} \log\left(\frac{P(Y = 1)}{1 - P(Y = 1)} \right) &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p \end{align*}` and so -- `\begin{align*} P(Y = 1) = \frac{\exp(\beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p )}{1 + \exp(\beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p )} \end{align*}` --- ### Example: Palmer Penguins. **Goal**: Build a model that predicts whether a penguin is a Chinstrap or a Adelie. <img src="img/penguins.png" width="70%" style="display: block; margin: auto;" /> ```r library(palmerpenguins) penguins2 <- penguins %>% filter(species != "Gentoo") %>% mutate(species_num = if_else(species == "Chinstrap", 1, 0)) count(penguins2, species) ``` ``` ## # A tibble: 2 x 2 ## species n ## <fct> <int> ## 1 Adelie 152 ## 2 Chinstrap 68 ``` --- ### Model: Predict species with flipper length ```r ggplot(data = penguins2, mapping = aes(x = flipper_length_mm, y = species_num)) + geom_jitter(size = 3, alpha = .4, width = 0, height = .05) + labs(y = "P(Y = 1)") ``` <img src="wk13_fri_files/figure-html/unnamed-chunk-3-1.png" width="360" style="display: block; margin: auto;" /> --- ### Model: Predict species with flipper length ```r ggplot(data = penguins2, mapping = aes(x = flipper_length_mm, y = species_num)) + geom_jitter(size = 3, alpha = .4, width = 0, height = .05) + labs(y = "P(Y = 1)") + geom_smooth(method = 'glm', method.args = list(family = "binomial")) ``` <img src="wk13_fri_files/figure-html/unnamed-chunk-4-1.png" width="360" style="display: block; margin: auto;" /> **Question**: How do we fit the model (the blue curve) in R? --- ### Model: Predict species with flipper length ```r mod <- glm(species ~ flipper_length_mm, family = "binomial", data = penguins2) library(broom) tidy(mod) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -25.8 4.90 -5.26 0.000000141 ## 2 flipper_length_mm 0.129 0.0253 5.12 0.000000299 ``` -- But what do the coefficient estimates `\(\hat{\beta}_o = -25.8\)` and `\(\hat{\beta}_1 = 0.129\)` represent? --- ### Interpretation of Coefficients #### Define two concepts: (1) Odds: $$ \mbox{odds} = \frac{\mbox{prob of success}}{\mbox{prob of failure}} = \frac{P(Y = 1)}{1 - P(Y = 1)} $$ -- Recall: `\begin{align*} \log\left(\frac{P(Y = 1)}{1 - P(Y = 1)} \right) &= \beta_o + \beta_1 x_1 \end{align*}` -- So then $$ \mbox{odds} = \exp \left( \beta_o + \beta_1 x_1 \right) $$ --- ### Interpretation of Coefficients #### Define two concepts: (2) Odds ratio: Comparison of odds between two groups $$ \mbox{odds ratio} = \frac{\mbox{odds of group 1}}{\mbox{odds of group 2}} $$ -- **Interpretation of odds ratio:** The odds of success in group 1 are _insert #_ times the odds of success in group 2. -- **Note:** Also useful property: $$ \exp(a + b) = \exp(a) \exp(b) $$ --- ### Interpretation of Coefficients (estimated) odds for `\(x_1 = t\)` mm: `$$\exp \left( \hat{\beta}_o + \hat{\beta}_1 t \right)$$` -- (estimated) odds for `\(x_1 = t + 1\)` mm: `$$\exp \left( \hat{\beta}_o + \hat{\beta}_1 (t + 1) \right)$$` -- odds ratio: `\begin{align*} \frac{\exp \left( \hat{\beta}_o + \hat{\beta}_1 (t + 1) \right)}{\exp \left( \hat{\beta}_o + \hat{\beta}_1 t \right)} &= \frac{\exp \left( \hat{\beta}_o\right)\exp \left( \hat{\beta}_1 (t + 1) \right)}{\exp \left( \hat{\beta}_o\right)\exp \left( \hat{\beta}_1 t \right)} \\ & = \exp \left( \hat{\beta}_1 \right) \end{align*}` --- ### Interpretation of Coefficients How do we interpret `\(\hat{\beta}_1 = 0.129\)`? ```r tidy(mod) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -25.8 4.90 -5.26 0.000000141 ## 2 flipper_length_mm 0.129 0.0253 5.12 0.000000299 ``` -- $$ \exp \left( \hat{\beta}_1 \right) = \frac{\mbox{odds of being Chinstrap for flipper length t + 1 mm}}{\mbox{odds of being Chinstrap for flipper length t mm}} $$ -- ```r exp(0.129) ``` ``` ## [1] 1.138 ``` -- The odds that a penguin in Chinstrap instead of Adelie is 1.138 times the odds of a penguin whose flipper is 1 mm shorter. --- ### Prediction How can I use this model to now predict, for a given flipper_length, whether a penguin is a Chinstrap or an Adelie? <img src="wk13_fri_files/figure-html/unnamed-chunk-8-1.png" width="360" style="display: block; margin: auto;" /> `\begin{align*} \widehat{P(Y = 1)} \geq 0.5 &\rightarrow \hat{y} = 1 \\ \widehat{P(Y = 1)} < 0.5 &\rightarrow \hat{y} = 0 \\ \end{align*}` --- ### Prediction Accuracy ```r penguins2 <- penguins2 %>% add_predictions(mod, type = "response") penguins2 %>% select(species, pred) ``` ``` ## # A tibble: 220 x 2 ## species pred ## <fct> <dbl> ## 1 Adelie 0.0886 ## 2 Adelie 0.157 ## 3 Adelie 0.373 ## 4 Adelie NA ## 5 Adelie 0.315 ## 6 Adelie 0.238 ## 7 Adelie 0.0886 ## 8 Adelie 0.373 ## 9 Adelie 0.315 ## 10 Adelie 0.238 ## # … with 210 more rows ``` --- ### Prediction Accuracy ```r penguins2 <- penguins2 %>% mutate(pred_species = case_when( pred >= 0.5 ~ "Chinstrap", pred < 0.5 ~ "Adelie" )) penguins2 %>% count(species, pred_species) ``` ``` ## # A tibble: 5 x 3 ## species pred_species n ## <fct> <chr> <int> ## 1 Adelie Adelie 138 ## 2 Adelie Chinstrap 13 ## 3 Adelie <NA> 1 ## 4 Chinstrap Adelie 46 ## 5 Chinstrap Chinstrap 22 ``` --- ### Multiple Logistic Regression * Another good explanatory variable: `bill_length_mm` <img src="wk13_fri_files/figure-html/unnamed-chunk-11-1.png" width="360" style="display: block; margin: auto;" /> ```r mod2 <- glm(species ~ flipper_length_mm + bill_length_mm, family = "binomial", data = penguins2) tidy(mod2) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -21.8 11.2 -1.95 0.0515 ## 2 flipper_length_mm -0.171 0.0740 -2.32 0.0205 ## 3 bill_length_mm 1.25 0.228 5.47 0.0000000451 ``` --- ### Multiple Logistic Regression ```r penguins2 <- penguins2 %>% add_predictions(mod2, type = "response") %>% mutate(pred_species = case_when( pred >= 0.5 ~ "Chinstrap", pred < 0.5 ~ "Adelie" )) ``` -- ```r penguins2 %>% count(species, pred_species) ``` ``` ## # A tibble: 5 x 3 ## species pred_species n ## <fct> <chr> <int> ## 1 Adelie Adelie 148 ## 2 Adelie Chinstrap 3 ## 3 Adelie <NA> 1 ## 4 Chinstrap Adelie 6 ## 5 Chinstrap Chinstrap 62 ``` -- * Still problems with over-fitting! + Need to learn about splitting your data into testing and training sets. --- ### Notes for Monday: Wrap-Up! Two activities: * Crowd source summarization of our data analysis process * Practice problems -- (Logistic regression won't be on the final exam.)