class: center, middle

### Logistic Regression

<span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 13 | Fall 2020] </span>

---

## Announcements/Reminders

* Due Dates:
    + Wed, Dec 2nd: Extra Credit on Gradescope
    + Thur/Fri, Dec 3rd/4th: Lab 11 (last lab due!) on Gradescope
        + Will receive a practice lab assignment in Lab this week.
    + Wed, Dec 9th: Final Project Assignment due.  Submit to [this Ensemble dropbox folder](https://ensemble.reed.edu/Dropbox/Math141F20StudentSubmissions).
    + Thur/Fri, Dec 10th/11th: Final Exam Takehome and Oral.
        + Sign up for an oral slot [here](https://docs.google.com/spreadsheets/d/1UvAKO7UzxWm0m5mKv-VzMLawDeJZdzr1q73VCAMqH8Y/edit?usp=sharing).
        + Takehome due on Gradescope.

---

## Week 13 + Monday Topics

* Inference for Linear Regression
* Logistic Regression
* Wrap-up

*********************************************

### Goals for Today

* Motivate Logistic Regression

* Logistic Regression Model

---

### Comparing Models with the Adjusted `$R^2$`

**Strategy**: Compute the adjusted `$R^2$` value for each model and pick the one with the highest adjusted `$R^2$`.

`\begin{align*}
\mbox{adj} R^2 &= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2} \left(\frac{n - 1}{n - p - 1} \right)
\end{align*}`

Let's look at one more example.

---

### Comparing Models with the Adjusted `$R^2$`

```r
mod1 <- lm(Margin ~ Days*factor(Charlie), data = Pollster08)
mod2 <- lm(Margin ~ Days*factor(Meltdown), data = Pollster08)
mod3 <- lm(Margin ~ poly(Days, degree = 2, 
                         raw = TRUE), data = Pollster08)
mod4 <- lm(Margin ~ poly(Days, degree = 15, 
                         raw = TRUE), data = Pollster08)
```

---

.tiny .remark-code { /*Change made here*/
  font-size: 80% !important;
}
</style>

.tiny[

```r
library(broom)
glance(mod1)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.417         0.399  2.87      23.4 1.71e-11     3  -250.  510.  523.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

```r
glance(mod2)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.324         0.303  3.09      15.7 2.16e-8     3  -258.  525.  539.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

```r
glance(mod3)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.349         0.336  3.01      26.6 5.71e-10     2  -256.  519.  530.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

```r
glance(mod4)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.483         0.407  2.85      6.33 2.89e-8    13  -244.  518.  557.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```
]

---

### Logistic Regression

**Response variable**: A categorical variable with two categories

`\begin{align*}
Y =   \left\{
\begin{array}{ll}
      1 & \mbox{success} \\
      0 & \mbox{failure} \\
\end{array} 
\right.  
\end{align*}`

`$Y \sim$` Bernoulli `$(p)$` where `$p = P(Y = 1) = P(\mbox{success})$`.

**Explanatory variables**: Can be a mix of categorical and quantitative

Goal: Build a model for `$P(Y = 1)$`.

---

### Why not use linear regression?

* Regression line = estimated probability of success

* For valid values of `$x$`, we are predicting the probability is less than 0 or greater than 1.

---

### Why not use linear regression?

* Linear regression line = estimated probability of success

* For valid values of `$x$`, we are predicting the probability is less than 0 or greater than 1.

* Logistic regression is bounded by 0 and 1.

What does the model look like?

---

### Logistic Regression Model

`\begin{align*}
\log\left(\frac{P(Y = 1)}{1  - P(Y = 1)}  \right) &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p 
\end{align*}`

Left hand side has many interpretations:

`\begin{align*}
\mbox{LHS} &= \log\left(\frac{P(Y = 1)}{1  - P(Y = 1)}  \right)\\ 
&= \log \left( \mbox{odds (of success)}  \right)\\
&= \mbox{logit}(P(Y = 1))
\end{align*}`

Note:

$$ 
\mbox{odds} = \frac{\mbox{prob of success}}{\mbox{prob of failure}}
$$
--

The odds will be important when we go to interpret the coefficients!

---

### Probability of Success

But I don't want an equation for the log odds!

Want an equation for `$P(Y = 1)$`!

Take

`\begin{align*}
\log\left(\frac{P(Y = 1)}{1  - P(Y = 1)}  \right) &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p 
\end{align*}`

and solve for only `$P(Y = 1)$` on the LHS:

`\begin{align*}
P(Y = 1)  = 
\frac{\exp(\beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p )}{1  + \exp(\beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p )}
\end{align*}`

---

### Probability of Success

Have:

`\begin{align*}
P(Y = 1)  = 
\frac{\exp(\beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p )}{1  + \exp(\beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p )}
\end{align*}`

Notes:

&rarr; Bounded by 0 and 1.

&rarr; Estimate `$\beta$`'s with a method called "Maximum Likelihood Estimation".

**Fitted Model**:

`\begin{align*}
\widehat{P(Y = 1)} = 
\frac{\exp(\hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p )}{1  + \exp(\hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p )}
\end{align*}`

---

### Remaining Topics

* How do we fit the model in R?

* How do we interpret the coefficients?

* How do we determine the quality of the model?