class: center, middle

### Inference for Linear Regression

.large[Kelly McConville | Math 141 | Week 13 | Fall 2020]

---

## Announcements/Reminders

* Discuss the final exam.

* Due Dates:
    + Wed, Dec 2nd: Extra Credit on Gradescope
    + Thur/Fri, Dec 3rd/4th: Lab 11 (last lab due!) on Gradescope
        + Will receive a practice lab assignment in Lab this week.
    + Wed, Dec 9th: Final Project Assignment due.  Submit to [this Ensemble dropbox folder](https://ensemble.reed.edu/Dropbox/Math141F20StudentSubmissions).
    + Thur/Fri, Dec 10th/11th: Final Exam Takehome and Oral.
        + Sign up for an oral slot [here](https://docs.google.com/spreadsheets/d/1UvAKO7UzxWm0m5mKv-VzMLawDeJZdzr1q73VCAMqH8Y/edit?usp=sharing).
        + Takehome due on Gradescope.

---

## Week 13 + Monday Topics

* Inference for Linear Regression
* Logistic Regression
* Wrap-up

*********************************************

### Goals for Today

* Finish up discussing hypothesis testing for linear regression

* Estimation and prediction inference for linear regression

---

### Recap: Multiple Linear Regression

**Form of the Model:**

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

**Fitted Model:** Using the Method of Least Squares,

$$ 
`\begin{align}
\hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p 
\end{align}`
$$

#### Typical Inferential Questions:

(1) Should `$x_2$` be in the model that already contains `$x_1$` and `$x_3$`?

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon
\end{align}`
$$

In other words, should `$\beta_2 = 0$`?

---

### Multiple Linear Regression

**Form of the Model:**

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

**Fitted Model:** Using the Method of Least Squares,

$$ 
`\begin{align}
\hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p 
\end{align}`
$$

#### Typical Inferential Questions:

(2) Can we estimate `$\beta_3$` with a confidence interval?

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon
\end{align}`
$$

---

### Multiple Linear Regression

**Form of the Model:**

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

**Fitted Model:** Using the Method of Least Squares,

$$ 
`\begin{align}
\hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p 
\end{align}`
$$

#### Typical Inferential Questions:

(3) While `$\hat{y}$` is a point estimate for `$y$`, can we also get an interval estimate for `$y$`?

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon
\end{align}`
$$

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

**In General**:

$$
H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model}
$$
$$
H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model}
$$

Test Statistic:

$$
t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} \sim t(df = n - \mbox{# of predictors})
$$

when `$H_o$` is true and the model assumptions are met.

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

**For the Meadowfoam Example**:

**Row 2**:

$$
H_o: \beta_1 = 0 \quad \mbox{given timing is already in the model}
$$
$$
H_a: \beta_1 \neq 0 \quad \mbox{given timing is already in the model}
$$

Test Statistic:

$$
t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} = \frac{-0.04 - 0}{0.005} = -7.89
$$

with p-value `$= P(t \leq -7.89) + P(t \geq 7.89) \approx 0.$`

There is evidence that the lighting intensity adds useful information to the linear regression model for flower production that already contains the timing of the lighting.

---

### MeadowFoam: Different Slopes Model?

Do we have evidence that we should allow the slopes to vary?

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

```r
# Different slopes model
modFlowersInt <- lm(Flowers ~ Intensity * TimeCat, data = case0901)
get_regression_table(modFlowersInt)
```

```
## # A tibble: 4 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 83.1 4.34 19.1 0 74.1 92.2 
## 2 Intensity -0.04 0.007 -5.36 0 -0.055 -0.024
## 3 TimeCatLate -11.5 6.14 -1.88 0.075 -24.3 1.29 
## 4 Intensity:TimeCatLate -0.001 0.011 -0.115 0.91 -0.023 0.021
```

---

### One more Example: COVID and Candle Ratings

[Kate Petrova created a dataset](https://twitter.com/kate_ptrv/status/1332398768659050496) that has been making the rounds on Twitter:

---

### One more Example: COVID and Candle Ratings

She posted all her data and code to GitHub and I did some light wrangling so that we could answer the question:

&rarr; Do we have evidence that we should allow the slopes to vary?

```r
ggplot(data = all, mapping = aes(x = as.Date(Date), y = Rating,
                                 color = Type)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = lm)
```

---

Do we have evidence that we should allow the slopes to vary?

```r
mod <- lm(Rating ~ as.Date(Date)*Type, data = all)
get_regression_table(mod)
```

```
## # A tibble: 4 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 52.3 9.14 5.72 0 34.4 70.3 
## 2 as.Date(Date) -0.003 0 -5.31 0 -0.004 -0.002
## 3 Typeunscented -32.3 13.0 -2.48 0.013 -57.8 -6.72 
## 4 as.Date(Date):Typeunsc… 0.002 0.001 2.54 0.011 0 0.003
```

---

### Estimation

#### Typical Inferential Questions:

(2) Can we estimate `$\beta_j$` with a confidence interval?

Confidence Interval Formula:

`\begin{align*}
\mbox{statistic} & \pm ME \\
\hat{\beta}_j & \pm t^* SE(\hat{\beta}_j)
\end{align*}`

```r
# Equal slopes model
modFlowers <- lm(Flowers ~ Intensity + TimeCat, data = case0901)
get_regression_table(modFlowers)
```

```
## # A tibble: 3 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 83.5 3.27 25.5 0 76.7 90.3 
## 2 Intensity -0.04 0.005 -7.89 0 -0.051 -0.03
## 3 TimeCatLate -12.2 2.63 -4.62 0 -17.6 -6.69
```

---

### Prediction

#### Typical Inferential Questions:

(3) While `$\hat{y}$` is a point estimate for `$y$`, can we also get an interval estimate for `$y$`?

#### Two Types:

.pull-left[

**Confidence Interval for the Mean Response**

&rarr; Defined at given values of the explanatory variables

&rarr; Estimates the average response

&rarr; Centered at `$\hat{y}$`

&rarr; Smaller SE

]

.pull-right[

**Prediction Interval for an Individual Response**

&rarr; Defined at given values of the explanatory variables

&rarr; Predicts the response of a single, new observation

&rarr; Centered at `$\hat{y}$`

&rarr; Larger SE

]

---

### CI for mean response at a given level of X:

We want to construct a 95% CI for the average number of flowers when the lighting intensity is 500 `$mmol/m^2/sec$` and the timing is early.

```r
new <- data.frame(Intensity = 500, TimeCat = "Early")
predict.lm(modFlowers, new, interval="confidence", level=.95)
```

```
##     fit   lwr  upr
## 1 63.23 59.35 67.1
```

**Interpretation:**  We are 95% confident that the average number of flowers will be between 59 and 67 for a meadowfoam plant raised at a lighting intensity of 500 `$mmol/m^2/sec$` and at an early onset of the lighting.

---

### PI for a new Y at a given level of X:

We want to construct a 95% PI for the number of flowers of a single Meadomfoam plant when the lighting intensity is 500 `$mmol/m^2/sec$` and the timing is early.

```r
predict.lm(modFlowers, new, interval="prediction", level=.95)
```

```
##     fit   lwr   upr
## 1 63.23 49.28 77.17
```

**Interpretation:** For meadomfoam plants raised with a raised at a lighting intensity of 500 `$mmol/m^2/sec$` and at an early onset of the lighting, we expect 95$\%$ of the plants to have between 49 and 77 flowers.

---

### Comparing Models

Suppose I built 5 different model. **Which is best?**

* Big question!  Take Math 243: Statistical Learning to learn systematic model selection techniques.

* We will explore one approach.  (But there are many possible approaches!)

---

### Comparing Models

Suppose I built 5 different model. **Which is best?**

&rarr; Pick the best model based on some measure of quality.

**Measure of quality**: `$R^2$` (Coefficient of Determination)

`\begin{align*}
R^2 &= \mbox{% of total variation in y explained by the model}\\
&= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2}
\end{align*}`

**Strategy**: Compute the `$R^2$` value for each model and pick the one with the highest `$R^2$`.

---

### Comparing Models with `$R^2$`

**Strategy**: Compute the `$R^2$` value for each model and pick the one with the highest `$R^2$`.

```r
library(broom)
glance(modFlowers)
```

```
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.799 0.780 6.44 41.8 4.79e-8 2 -77.2 162. 167.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

```r
glance(modFlowersInt)
```

```
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.799 0.769 6.60 26.5 3.55e-7 3 -77.1 164. 170.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

**Problem:** As we add predictors, the `$R^2$` value will only increase.

---

### Comparing Models with `$R^2$`

**Problem:** As we add predictors, the `$R^2$` value will only increase.

And in [Week 6, we said](https://reed-statistics.github.io/math141f20/wk06_mon_wed.html#27):

**Guiding Principle**: Occam's Razor for Modeling

> "All other things being equal, simpler models are to be preferred over complex ones." -- ModernDive

---

### Comparing Models with the Adjusted `$R^2$`

**New Measure of quality**: Adjusted `$R^2$` (Coefficient of Determination)

`\begin{align*}
\mbox{adj} R^2 &= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2} \left(\frac{n - 1}{n - p - 1} \right)
\end{align*}`

where `$p$` is the number of explanatory variables in the model.

Now we will penalize larger models.

**Strategy**: Compute the adjusted `$R^2$` value for each model and pick the one with the highest adjusted `$R^2$`.

---

### Comparing Models with the Adjusted `$R^2$`

**Strategy**: Compute the adjusted `$R^2$` value for each model and pick the one with the highest adjusted `$R^2$`.

```r
glance(modFlowers)
```

```r
glance(modFlowersInt)
```