class: center, middle ### Inference for Linear Regression <img src="img/DAW.png" width="450px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 12 | Fall 2020] </span> --- ## Announcements/Reminders * Extra Credit Assignment: Write a stats poem. + Due December 2nd * Project Assignment 3 due on Friday! * Final Project Assignment (pafinal.Rmd) is in the shared folder. + Due Wednesday, December 9th --- ## Week 12 Topics * ANOVA Test * Simulation Methods versus Probability Model Methods for Inference * Inference for Linear Regression ********************************************* ### Goals for Today * Recap Linear Regression XXX --- ## Multiple Linear Regression Linear regression is a flexible class of models that allow for: * Both quantitative and categorical explanatory variables. * Multiple explanatory variables. * Curved relationships between the response variable and the explanatory variable. -- * BUT the response variable is quantitative. --- ### Multiple Linear Regression **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ -- **Fitted Model:** Using the Method of Least Squares, $$ `\begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align}` $$ -- #### Typical Inferential Questions: (1) Should `\(x_2\)` be in the model that already contains `\(x_1\)` and `\(x_3\)`? $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align}` $$ In other words, should `\(\beta_2 = 0\)`? --- ### Multiple Linear Regression **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ **Fitted Model:** Using the Method of Least Squares, $$ `\begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align}` $$ #### Typical Inferential Questions: (2) Can we estimate `\(\beta_3\)` with a confidence interval? $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align}` $$ --- ### Multiple Linear Regression **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ **Fitted Model:** Using the Method of Least Squares, $$ `\begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align}` $$ #### Typical Inferential Questions: (3) While `\(\hat{y}\)` is a point estimate for `\(y\)`, can we also get an interval estimate for `\(y\)`? $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align}` $$ -- To answer these questions, we need to add some assumptions to our linear regression model. --- ### Multiple Linear Regression **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ **Additional Assumptions:** $$ \epsilon \overset{\mbox{ind}}{\sim} N (\mu = 0, \sigma = \sigma_{\epsilon}) $$ `\(\sigma_{\epsilon}\)` = typical deviations from the model -- Let's unpack these assumptions! -- → Observations are independent. → The standard deviation is constant. → The model form is appropriate. --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\color{orange}{\mbox{ind}}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` -- → The cases are independent of each other. -- **Question**: How do we check this assumption? -- Look at how the data were collected. Generally relies on random sampling or random assignment being utilized. --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}\color{orange}{N}\left(0, \sigma_{\epsilon} \right) \end{align*}` -- → The errors are normally distributed. -- **Question**: How do we check this assumption? -- Recall the residual: `\(e = y - \hat{y}\)` -- **QQ-plot:** Plot the residuals against the quantiles of a normal distribution! .pull-left[ <img src="wk12_fri_files/figure-html/unnamed-chunk-1-1.png" width="360" style="display: block; margin: auto;" /> ] .pull-right[ <img src="wk12_fri_files/figure-html/unnamed-chunk-2-1.png" width="360" style="display: block; margin: auto;" /> ] --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(\color{orange}{0}, \sigma_{\epsilon} \right) \end{align*}` -- → The points will, on average, fall on the line. -- **Question**: How do we check this assumption? -- If you use the Method of Least Squares, then you don't have to check. It will be true by construction: $$ \sum e = 0 $$ --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \color{orange}{\sigma_{\epsilon}} \right) \end{align*}` -- → The variability in the errors is constant. -- **Question**: How do we check this assumption? -- **One option**: Scatterplot .pull-left[ <img src="wk12_fri_files/figure-html/unnamed-chunk-3-1.png" width="360" style="display: block; margin: auto;" /> ] .pull-right[ <img src="wk12_fri_files/figure-html/unnamed-chunk-4-1.png" width="360" style="display: block; margin: auto;" /> ] --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \color{orange}{\sigma_{\epsilon}} \right) \end{align*}` → The variability in the errors is constant. **Question**: How do we check this assumption? **Better option** (especially when have more than 1 explanatory variable): Residual Plot .pull-left[ <img src="wk12_fri_files/figure-html/unnamed-chunk-5-1.png" width="360" style="display: block; margin: auto;" /> ] .pull-right[ <img src="wk12_fri_files/figure-html/unnamed-chunk-6-1.png" width="360" style="display: block; margin: auto;" /> ] --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \color{orange}{\beta_o + \beta_1 x_1} + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` -- → The model form is appropriate. -- **Question**: How do we check this assumption? -- **One option**: Scatterplot .pull-left[ <img src="wk12_fri_files/figure-html/unnamed-chunk-7-1.png" width="360" style="display: block; margin: auto;" /> ] .pull-right[ <img src="wk12_fri_files/figure-html/unnamed-chunk-8-1.png" width="360" style="display: block; margin: auto;" /> ] --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \color{orange}{\beta_o + \beta_1 x_1} + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` → The model form is appropriate. **Question**: How do we check this assumption? **Better option** (especially when have more than 1 explanatory variable): Residual Plot .pull-left[ <img src="wk12_fri_files/figure-html/unnamed-chunk-9-1.png" width="360" style="display: block; margin: auto;" /> ] .pull-right[ <img src="wk12_fri_files/figure-html/unnamed-chunk-10-1.png" width="360" style="display: block; margin: auto;" /> ] --- ### Assumptions **Question**: What if the assumptions aren't all satisfied? -- → Try transforming the data and building the model again. -- → Use a modeling technique beyond linear regression. --- **Question**: What if the assumptions are all (roughly) satisfied? -- → Can now start answering your inference questions! --- class: inverse, center, middle ### Let's practice checking linear regression assumptions with the "inferenceModeling.Rmd" handout! --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? -- **In General**: $$ H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model} $$ $$ H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model} $$ **For the Meadowfoam Example**: **Row 2**: $$ H_o: \beta_1 = 0 \quad \mbox{given timing is already in the model} $$ $$ H_a: \beta_1 \neq 0 \quad \mbox{given timing is already in the model} $$ -- **Row 3**: $$ H_o: \beta_2 = 0 \quad \mbox{given intensity is already in the model} $$ $$ H_a: \beta_2 \neq 0 \quad \mbox{given intensity is already in the model} $$ --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? -- **In General**: $$ H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model} $$ $$ H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model} $$ Test Statistic: -- $$ t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} \sim t(df = n - \mbox{# of predictors}) $$ when `\(H_o\)` is true and the model assumptions are met. --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? -- **For the Meadowfoam Example**: **Row 2**: $$ H_o: \beta_1 = 0 \quad \mbox{given timing is already in the model} $$ $$ H_a: \beta_1 \neq 0 \quad \mbox{given timing is already in the model} $$ Test Statistic: -- $$ t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} = \frac{-0.04 - 0}{0.005} = -7.89 $$ with p-value `\(= P(t \leq -7.89) + P(t \geq 7.89) \approx 0.\)` -- There is evidence that the lighting intensity adds useful information to the linear regression model for flower production that already contains the timing of the lighting. --- ### MeadowFoam: Different Slopes Model? Do we have evidence that we should allow the slopes to vary? `\begin{align*} y = \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` ```r # Different slopes model modFlowersInt <- lm(Flowers ~ Intensity * TimeCat, data = case0901) get_regression_table(modFlowersInt) ``` ``` ## # A tibble: 4 x 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 83.1 4.34 19.1 0 74.1 92.2 ## 2 Intensity -0.04 0.007 -5.36 0 -0.055 -0.024 ## 3 TimeCatLate -11.5 6.14 -1.88 0.075 -24.3 1.29 ## 4 Intensity:TimeCatLate -0.001 0.011 -0.115 0.91 -0.023 0.021 ``` --- class: center, inverse, middle ## We will discuss estimation-related inference questions for linear regression after the break! ## Travel Safely Everyone!