class: center, middle

### The Simple Linear Regression Model

.large[Kelly McConville | Math 141 | Week 5 | Fall 2020]

---

## Announcements/Reminders

* Project Assignment 1 is due on Friday October 2nd (end of day) on Gradescope.

* Lab 4 due before your lab this week.
    + No coding, just narrative.

---

## Week 5 Topics

* **Modeling**

---

# Goals for Today

* Simple linear regression model
    + Estimating the slope and intercept terms
    + Prediction

---

### Simple Linear Regression

Consider this model when:

* Response variable `$(y)$`: quantitative

* Explanatory variable `$(x)$`: quantitative
    + Have only ONE explanatory variable.

--
    
* AND, `$f()$` can be approximated by a line.

---

### Simple Linear Regression

Let's return to the Example: Trees at the Woodstock CC

* A line is a reasonable model form.

* Where should the line be?
    + Slope? Intercept?
    
---

###  Form of the SLR Model

$$ 
`\begin{align}
y &= f(x) + \epsilon \\
y &= \beta_o + \beta_1 x + \epsilon
\end{align}`
$$

**Need to determine the best estimates of `$\beta_o$` and `$\beta_1$`.**

*****************************

#### Distinguishing between the population and the sample

* Parameters: 
    + Based on the population
    + Unknown then if don't have data on the whole population
    + EX: `$\beta_o$` and `$\beta_1$`

* Statistics: 
    + Based on the sample data
    + Known
    + Usually estimate a population parameter
    + EX: `$\hat{\beta}_o$` and `$\hat{\beta}_1$`

---

### Method of Least Squares

Need two key definitions:

* Fitted value: The *estimated* value of the `$i$`-th case

$$
\hat{y}_i = \hat{\beta}_o + \hat{\beta}_1 x_i
$$
--

* Residuals: The *observed* error term for the `$i$`-th case

$$
e_i = y_i - \hat{y}_i
$$

**Goal**: Pick values for `$\hat{\beta}_o$` and `$\hat{\beta}_1$`  so that the residuals are small!

---

### Method of Least Squares

* Let's focus on the orange line.

* Want residuals to be small.

* Minimize some function of the residuals.

---

### Method of Least Squares

Minimize:

$$
\sum_{i = 1}^n e^2_i
$$

Get the following equations:

$$ 
`\begin{align}
\hat{\beta}_1 &= \frac{ \sum_{i = 1}^n (x_i - \bar{x}) (y_i - \bar{y})}{ \sum_{i = 1}^n (x_i - \bar{x})^2} \\
\hat{\beta}_o &= \bar{y} - \hat{\beta}_1 \bar{x}
\end{align}`
$$
where

$$
`\begin{align}
\bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i \quad \mbox{and} \quad \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i
\end{align}`
$$

---

## Method of Least Squares

Once we have the estimated intercept `$(\hat{\beta}_o)$` and the estimated slope `$(\hat{\beta}_1)$`, we can estimate the whole function:

$$
\hat{y} = \hat{\beta}_o + \hat{\beta}_1 x
$$

Called the **least squares line** or the **line of best fit**.

---

### Method of Least Squares

`ggplot2` will compute the line and add it to your plot using `geom_smooth(method = "lm")`

But what are the **exact** values of `$\hat{\beta}_o$` and `$\hat{\beta}_1$`?

---

### Constructing the Simple Linear Regression Model in R

```r
mod <- lm(Tree_Height ~ DBH, data = woodstock_cc)

library(moderndive)
get_regression_table(mod)
```

```
## # A tibble: 2 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 11.0 12.7 0.865 0.42 -20.1 42.1 
## 2 DBH 2.46 0.681 3.62 0.011 0.797 4.13
```

---

### Interpretation

Slope:

Intercept:

---

### Prediction

```r
new_cases <- data.frame(DBH = c(10, 25, 40))
predict(mod, newdata = new_cases)
```

```
##   1   2   3 
##  36  73 110
```

We didn't have any trees in our sample with a diameter of 10 inches.  Can we still make this prediction?

&rarr;  Called *interpolation*

We didn't have any trees in our sample with a diameter above 30 inches, can we make a prediction at 40 inches?

&rarr;  Called *extrapolation*

---

### Cautions

1. Careful to only predict values within the range of `$x$` values in the sample.

2. Make sure to investigate **influential points**.

What is an **outlier**?