class: center, middle ### The Simple Linear Regression Model <img src="img/DAW.png" width="500px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 5 | Fall 2020] </span> --- ## Announcements/Reminders * Project Assignment 1 is due on Friday October 2nd (end of day) on Gradescope. -- * Lab 4 due before your lab this week. + No coding, just narrative. --- ## Week 5 Topics * **Modeling** --- # Goals for Today * Simple linear regression model + Estimating the slope and intercept terms + Prediction --- ### Simple Linear Regression Consider this model when: -- * Response variable `\((y)\)`: quantitative -- * Explanatory variable `\((x)\)`: quantitative + Have only ONE explanatory variable. -- * AND, `\(f()\)` can be approximated by a line. --- ### Simple Linear Regression Let's return to the Example: Trees at the Woodstock CC <img src="wk05_mon_files/figure-html/unnamed-chunk-1-1.png" width="360" /> * A line is a reasonable model form. * Where should the line be? + Slope? Intercept? --- ### Form of the SLR Model $$ `\begin{align} y &= f(x) + \epsilon \\ y &= \beta_o + \beta_1 x + \epsilon \end{align}` $$ **Need to determine the best estimates of `\(\beta_o\)` and `\(\beta_1\)`.** -- ***************************** #### Distinguishing between the population and the sample -- * Parameters: + Based on the population + Unknown then if don't have data on the whole population + EX: `\(\beta_o\)` and `\(\beta_1\)` -- * Statistics: + Based on the sample data + Known + Usually estimate a population parameter + EX: `\(\hat{\beta}_o\)` and `\(\hat{\beta}_1\)` --- ### Method of Least Squares Need two key definitions: -- * Fitted value: The *estimated* value of the `\(i\)`-th case $$ \hat{y}_i = \hat{\beta}_o + \hat{\beta}_1 x_i $$ -- * Residuals: The *observed* error term for the `\(i\)`-th case $$ e_i = y_i - \hat{y}_i $$ **Goal**: Pick values for `\(\hat{\beta}_o\)` and `\(\hat{\beta}_1\)` so that the residuals are small! --- ### Method of Least Squares * Let's focus on the orange line. <img src="wk05_mon_files/figure-html/unnamed-chunk-2-1.png" width="360" /> * Want residuals to be small. * Minimize some function of the residuals. --- ### Method of Least Squares Minimize: $$ \sum_{i = 1}^n e^2_i $$ -- Get the following equations: $$ `\begin{align} \hat{\beta}_1 &= \frac{ \sum_{i = 1}^n (x_i - \bar{x}) (y_i - \bar{y})}{ \sum_{i = 1}^n (x_i - \bar{x})^2} \\ \hat{\beta}_o &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align}` $$ where $$ `\begin{align} \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i \quad \mbox{and} \quad \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i \end{align}` $$ --- ## Method of Least Squares Once we have the estimated intercept `\((\hat{\beta}_o)\)` and the estimated slope `\((\hat{\beta}_1)\)`, we can estimate the whole function: -- $$ \hat{y} = \hat{\beta}_o + \hat{\beta}_1 x $$ Called the **least squares line** or the **line of best fit**. --- ### Method of Least Squares `ggplot2` will compute the line and add it to your plot using `geom_smooth(method = "lm")` <img src="wk05_mon_files/figure-html/unnamed-chunk-3-1.png" width="360" /> -- But what are the **exact** values of `\(\hat{\beta}_o\)` and `\(\hat{\beta}_1\)`? --- ### Constructing the Simple Linear Regression Model in R ```r mod <- lm(Tree_Height ~ DBH, data = woodstock_cc) library(moderndive) get_regression_table(mod) ``` ``` ## # A tibble: 2 x 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 11.0 12.7 0.865 0.42 -20.1 42.1 ## 2 DBH 2.46 0.681 3.62 0.011 0.797 4.13 ``` --- ### Interpretation Slope: <br><br><br> <br> <br> <br> Intercept: --- ### Prediction ```r new_cases <- data.frame(DBH = c(10, 25, 40)) predict(mod, newdata = new_cases) ``` ``` ## 1 2 3 ## 36 73 110 ``` We didn't have any trees in our sample with a diameter of 10 inches. Can we still make this prediction? -- → Called *interpolation* We didn't have any trees in our sample with a diameter above 30 inches, can we make a prediction at 40 inches? -- → Called *extrapolation* --- ### Cautions 1. Careful to only predict values within the range of `\(x\)` values in the sample. -- 2. Make sure to investigate **influential points**. -- What is an **outlier**? -- <img src="wk05_mon_files/figure-html/unnamed-chunk-6-1.png" width="360" />