### Inference for Linear Regression

<span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 12 | Fall 2020] </span>

---

## Announcements/Reminders

* Extra Credit Assignment: Write a stats poem.
    + Due December 2nd

* Project Assignment 3 due on Friday!
    
* Final Project Assignment (pafinal.Rmd) is in the shared folder.
    + Due Wednesday, December 9th

---

## Week 12 Topics

* ANOVA Test
* Simulation Methods versus Probability Model Methods for Inference
* Inference for Linear Regression

*********************************************

### Goals for Today

* Recap Linear Regression

XXX

---

## Multiple Linear Regression

Linear regression is a flexible class of models that allow for:

* Both quantitative and categorical explanatory variables.

* Multiple explanatory variables.

* Curved relationships between the response variable and the explanatory variable.

* BUT the response variable is quantitative.

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

**Fitted Model:** Using the Method of Least Squares,

$$ 
`\begin{align}
\hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p 
\end{align}`
$$

#### Typical Inferential Questions:

(1) Should `$x_2$` be in the model that already contains `$x_1$` and `$x_3$`?

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon
\end{align}`
$$

In other words, should `$\beta_2 = 0$`?

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

**Fitted Model:** Using the Method of Least Squares,

$$ 
`\begin{align}
\hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p 
\end{align}`
$$

#### Typical Inferential Questions:

(2) Can we estimate `$\beta_3$` with a confidence interval?

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon
\end{align}`
$$

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

**Fitted Model:** Using the Method of Least Squares,

$$ 
`\begin{align}
\hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p 
\end{align}`
$$

#### Typical Inferential Questions:

(3) While `$\hat{y}$` is a point estimate for `$y$`, can we also get an interval estimate for `$y$`?

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon
\end{align}`
$$

To answer these questions, we need to add some assumptions to our linear regression model.

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

**Additional Assumptions:**

$$
\epsilon \overset{\mbox{ind}}{\sim} N (\mu = 0, \sigma = \sigma_{\epsilon})
$$

`$\sigma_{\epsilon}$` = typical deviations from the model

Let's unpack these assumptions!

&rarr; Observations are independent.

&rarr;  The standard deviation is constant.

&rarr; The model form is appropriate.

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\color{orange}{\mbox{ind}}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

&rarr; The cases are independent of each other.

**Question**: How do we check this assumption?

Look at how the data were collected.  Generally relies on random sampling or random assignment being utilized.

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}\color{orange}{N}\left(0, \sigma_{\epsilon} \right)
\end{align*}`

&rarr; The errors are normally distributed.

**Question**: How do we check this assumption?

Recall the residual: `$e = y - \hat{y}$`

**QQ-plot:** Plot the residuals against the quantiles of a normal distribution!

]

]

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(\color{orange}{0}, \sigma_{\epsilon} \right)
\end{align*}`

&rarr; The points will, on average, fall on the line.

**Question**: How do we check this assumption?

If you use the Method of Least Squares, then you don't have to check.

It will be true by construction:

$$
\sum e = 0
$$

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \color{orange}{\sigma_{\epsilon}} \right)
\end{align*}`

&rarr; The variability in the errors is constant.

**Question**: How do we check this assumption?

**One option**: Scatterplot

]

]

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \color{orange}{\sigma_{\epsilon}} \right)
\end{align*}`

&rarr; The variability in the errors is constant.

**Question**: How do we check this assumption?

**Better option** (especially when have more than 1 explanatory variable): Residual Plot

]

]

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \color{orange}{\beta_o + \beta_1 x_1} + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

&rarr; The model form is appropriate.

**Question**: How do we check this assumption?

**One option**: Scatterplot

]

]

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \color{orange}{\beta_o + \beta_1 x_1} + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

&rarr; The model form is appropriate.

**Question**: How do we check this assumption?

**Better option** (especially when have more than 1 explanatory variable): Residual Plot

]

]

---

### Assumptions

**Question**: What if the assumptions aren't all satisfied?

&rarr; Try transforming the data and building the model again.

&rarr; Use a modeling technique beyond linear regression.

---

**Question**: What if the assumptions are all (roughly) satisfied?

&rarr; Can now start answering your inference questions!

---

### Let's practice checking linear regression assumptions with the "inferenceModeling.Rmd" handout!

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

**In General**:

$$
H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model}
$$
$$
H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model}
$$
**For the Meadowfoam Example**:

**Row 2**:

$$
H_o: \beta_1 = 0 \quad \mbox{given timing is already in the model}
$$
$$
H_a: \beta_1 \neq 0 \quad \mbox{given timing is already in the model}
$$

**Row 3**:

$$
H_o: \beta_2 = 0 \quad \mbox{given intensity is already in the model}
$$
$$
H_a: \beta_2 \neq 0 \quad \mbox{given intensity is already in the model}
$$

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

**In General**:

$$
H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model}
$$
$$
H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model}
$$

Test Statistic:

$$
t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} \sim t(df = n - \mbox{# of predictors})
$$

when `$H_o$` is true and the model assumptions are met.

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

**For the Meadowfoam Example**:

**Row 2**:

$$
H_o: \beta_1 = 0 \quad \mbox{given timing is already in the model}
$$
$$
H_a: \beta_1 \neq 0 \quad \mbox{given timing is already in the model}
$$

Test Statistic:

$$
t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} = \frac{-0.04 - 0}{0.005} = -7.89
$$

with p-value `$= P(t \leq -7.89) + P(t \geq 7.89) \approx 0.$`

There is evidence that the lighting intensity adds useful information to the linear regression model for flower production that already contains the timing of the lighting.

---

### MeadowFoam: Different Slopes Model?

Do we have evidence that we should allow the slopes to vary?

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

```r
# Different slopes model
modFlowersInt <- lm(Flowers ~ Intensity * TimeCat, data = case0901)
get_regression_table(modFlowersInt)
```

```
## # A tibble: 4 x 7
##   term                  estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept               83.1       4.34     19.1     0       74.1     92.2  
## 2 Intensity               -0.04      0.007    -5.36    0       -0.055   -0.024
## 3 TimeCatLate            -11.5       6.14     -1.88    0.075  -24.3      1.29 
## 4 Intensity:TimeCatLate   -0.001     0.011    -0.115   0.91    -0.023    0.021
```

---

## We will discuss estimation-related inference questions for linear regression after the break!

##  Travel Safely Everyone!