class: center, middle ### Logistic Regression <img src="img/DAW.png" width="450px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 13 | Fall 2020] </span> --- ## Announcements/Reminders * Due Dates: + Wed, Dec 2nd: Extra Credit on Gradescope + Thur/Fri, Dec 3rd/4th: Lab 11 (last lab due!) on Gradescope + Will receive a practice lab assignment in Lab this week. + Wed, Dec 9th: Final Project Assignment due. Submit to [this Ensemble dropbox folder](https://ensemble.reed.edu/Dropbox/Math141F20StudentSubmissions). + Thur/Fri, Dec 10th/11th: Final Exam Takehome and Oral. + Sign up for an oral slot [here](https://docs.google.com/spreadsheets/d/1UvAKO7UzxWm0m5mKv-VzMLawDeJZdzr1q73VCAMqH8Y/edit?usp=sharing). + Takehome due on Gradescope. --- ## Week 13 + Monday Topics * Inference for Linear Regression * Logistic Regression * Wrap-up ********************************************* ### Goals for Today * Motivate Logistic Regression * Logistic Regression Model --- ### Comparing Models with the Adjusted `\(R^2\)` **Strategy**: Compute the adjusted `\(R^2\)` value for each model and pick the one with the highest adjusted `\(R^2\)`. `\begin{align*} \mbox{adj} R^2 &= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2} \left(\frac{n - 1}{n - p - 1} \right) \end{align*}` -- Let's look at one more example. --- ### Comparing Models with the Adjusted `\(R^2\)` <img src="wk13_wed_files/figure-html/unnamed-chunk-1-1.png" width="720" style="display: block; margin: auto;" /> ```r mod1 <- lm(Margin ~ Days*factor(Charlie), data = Pollster08) mod2 <- lm(Margin ~ Days*factor(Meltdown), data = Pollster08) mod3 <- lm(Margin ~ poly(Days, degree = 2, raw = TRUE), data = Pollster08) mod4 <- lm(Margin ~ poly(Days, degree = 15, raw = TRUE), data = Pollster08) ``` --- <style type="text/css"> .tiny .remark-code { /*Change made here*/ font-size: 80% !important; } </style> .tiny[ ```r library(broom) glance(mod1) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.417 0.399 2.87 23.4 1.71e-11 3 -250. 510. 523. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r glance(mod2) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.324 0.303 3.09 15.7 2.16e-8 3 -258. 525. 539. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r glance(mod3) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.349 0.336 3.01 26.6 5.71e-10 2 -256. 519. 530. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r glance(mod4) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.483 0.407 2.85 6.33 2.89e-8 13 -244. 518. 557. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ] --- ### Logistic Regression **Response variable**: A categorical variable with two categories -- `\begin{align*} Y = \left\{ \begin{array}{ll} 1 & \mbox{success} \\ 0 & \mbox{failure} \\ \end{array} \right. \end{align*}` -- `\(Y \sim\)` Bernoulli `\((p)\)` where `\(p = P(Y = 1) = P(\mbox{success})\)`. -- **Explanatory variables**: Can be a mix of categorical and quantitative -- Goal: Build a model for `\(P(Y = 1)\)`. --- ### Why not use linear regression? <img src="wk13_wed_files/figure-html/unnamed-chunk-5-1.png" width="360" style="display: block; margin: auto;" /> * Regression line = estimated probability of success -- * For valid values of `\(x\)`, we are predicting the probability is less than 0 or greater than 1. --- ### Why not use linear regression? <img src="wk13_wed_files/figure-html/unnamed-chunk-6-1.png" width="360" style="display: block; margin: auto;" /> * Linear regression line = estimated probability of success * For valid values of `\(x\)`, we are predicting the probability is less than 0 or greater than 1. * Logistic regression is bounded by 0 and 1. -- What does the model look like? --- ### Logistic Regression Model `\begin{align*} \log\left(\frac{P(Y = 1)}{1 - P(Y = 1)} \right) &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p \end{align*}` -- Left hand side has many interpretations: -- `\begin{align*} \mbox{LHS} &= \log\left(\frac{P(Y = 1)}{1 - P(Y = 1)} \right)\\ &= \log \left( \mbox{odds (of success)} \right)\\ &= \mbox{logit}(P(Y = 1)) \end{align*}` Note: $$ \mbox{odds} = \frac{\mbox{prob of success}}{\mbox{prob of failure}} $$ -- The odds will be important when we go to interpret the coefficients! --- ### Probability of Success But I don't want an equation for the log odds! -- Want an equation for `\(P(Y = 1)\)`! -- Take `\begin{align*} \log\left(\frac{P(Y = 1)}{1 - P(Y = 1)} \right) &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p \end{align*}` and solve for only `\(P(Y = 1)\)` on the LHS: -- `\begin{align*} P(Y = 1) = \frac{\exp(\beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p )}{1 + \exp(\beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p )} \end{align*}` --- ### Probability of Success Have: `\begin{align*} P(Y = 1) = \frac{\exp(\beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p )}{1 + \exp(\beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p )} \end{align*}` Notes: -- → Bounded by 0 and 1. -- → Estimate `\(\beta\)`'s with a method called "Maximum Likelihood Estimation". -- **Fitted Model**: `\begin{align*} \widehat{P(Y = 1)} = \frac{\exp(\hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p )}{1 + \exp(\hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p )} \end{align*}` --- ### Remaining Topics * How do we fit the model in R? * How do we interpret the coefficients? * How do we determine the quality of the model?