class: center, middle ### Inference for Linear Regression <img src="img/DAW.png" width="450px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 13 | Fall 2020] </span> --- ## Announcements/Reminders * Discuss the final exam. * Due Dates: + Wed, Dec 2nd: Extra Credit on Gradescope + Thur/Fri, Dec 3rd/4th: Lab 11 (last lab due!) on Gradescope + Will receive a practice lab assignment in Lab this week. + Wed, Dec 9th: Final Project Assignment due. Submit to [this Ensemble dropbox folder](https://ensemble.reed.edu/Dropbox/Math141F20StudentSubmissions). + Thur/Fri, Dec 10th/11th: Final Exam Takehome and Oral. + Sign up for an oral slot [here](https://docs.google.com/spreadsheets/d/1UvAKO7UzxWm0m5mKv-VzMLawDeJZdzr1q73VCAMqH8Y/edit?usp=sharing). + Takehome due on Gradescope. --- ## Week 13 + Monday Topics * Inference for Linear Regression * Logistic Regression * Wrap-up ********************************************* ### Goals for Today * Finish up discussing hypothesis testing for linear regression * Estimation and prediction inference for linear regression --- ### Recap: Multiple Linear Regression **Form of the Model:** `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` **Fitted Model:** Using the Method of Least Squares, $$ `\begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align}` $$ #### Typical Inferential Questions: (1) Should `\(x_2\)` be in the model that already contains `\(x_1\)` and `\(x_3\)`? $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align}` $$ In other words, should `\(\beta_2 = 0\)`? --- ### Multiple Linear Regression **Form of the Model:** `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` **Fitted Model:** Using the Method of Least Squares, $$ `\begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align}` $$ #### Typical Inferential Questions: (2) Can we estimate `\(\beta_3\)` with a confidence interval? $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align}` $$ --- ### Multiple Linear Regression **Form of the Model:** `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` **Fitted Model:** Using the Method of Least Squares, $$ `\begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align}` $$ #### Typical Inferential Questions: (3) While `\(\hat{y}\)` is a point estimate for `\(y\)`, can we also get an interval estimate for `\(y\)`? $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align}` $$ --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? -- **In General**: $$ H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model} $$ $$ H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model} $$ Test Statistic: -- $$ t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} \sim t(df = n - \mbox{# of predictors}) $$ when `\(H_o\)` is true and the model assumptions are met. --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? -- **For the Meadowfoam Example**: **Row 2**: $$ H_o: \beta_1 = 0 \quad \mbox{given timing is already in the model} $$ $$ H_a: \beta_1 \neq 0 \quad \mbox{given timing is already in the model} $$ Test Statistic: -- $$ t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} = \frac{-0.04 - 0}{0.005} = -7.89 $$ with p-value `\(= P(t \leq -7.89) + P(t \geq 7.89) \approx 0.\)` -- There is evidence that the lighting intensity adds useful information to the linear regression model for flower production that already contains the timing of the lighting. --- ### MeadowFoam: Different Slopes Model? Do we have evidence that we should allow the slopes to vary? `\begin{align*} y = \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` ```r # Different slopes model modFlowersInt <- lm(Flowers ~ Intensity * TimeCat, data = case0901) get_regression_table(modFlowersInt) ``` ``` ## # A tibble: 4 x 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 83.1 4.34 19.1 0 74.1 92.2 ## 2 Intensity -0.04 0.007 -5.36 0 -0.055 -0.024 ## 3 TimeCatLate -11.5 6.14 -1.88 0.075 -24.3 1.29 ## 4 Intensity:TimeCatLate -0.001 0.011 -0.115 0.91 -0.023 0.021 ``` --- ### One more Example: COVID and Candle Ratings [Kate Petrova created a dataset](https://twitter.com/kate_ptrv/status/1332398768659050496) that has been making the rounds on Twitter: <img src="img/kate_petrova_candles.jpg" width="550px"/> --- ### One more Example: COVID and Candle Ratings She posted all her data and code to GitHub and I did some light wrangling so that we could answer the question: → Do we have evidence that we should allow the slopes to vary? -- ```r ggplot(data = all, mapping = aes(x = as.Date(Date), y = Rating, color = Type)) + geom_point(alpha = 0.4) + geom_smooth(method = lm) ``` <img src="wk13_mon_files/figure-html/unnamed-chunk-3-1.png" width="360" style="display: block; margin: auto;" /> --- Do we have evidence that we should allow the slopes to vary? <img src="wk13_mon_files/figure-html/unnamed-chunk-4-1.png" width="360" style="display: block; margin: auto;" /> ```r mod <- lm(Rating ~ as.Date(Date)*Type, data = all) get_regression_table(mod) ``` ``` ## # A tibble: 4 x 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 52.3 9.14 5.72 0 34.4 70.3 ## 2 as.Date(Date) -0.003 0 -5.31 0 -0.004 -0.002 ## 3 Typeunscented -32.3 13.0 -2.48 0.013 -57.8 -6.72 ## 4 as.Date(Date):Typeunsc… 0.002 0.001 2.54 0.011 0 0.003 ``` --- ### Estimation #### Typical Inferential Questions: (2) Can we estimate `\(\beta_j\)` with a confidence interval? -- Confidence Interval Formula: -- `\begin{align*} \mbox{statistic} & \pm ME \\ \hat{\beta}_j & \pm t^* SE(\hat{\beta}_j) \end{align*}` -- ```r # Equal slopes model modFlowers <- lm(Flowers ~ Intensity + TimeCat, data = case0901) get_regression_table(modFlowers) ``` ``` ## # A tibble: 3 x 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 83.5 3.27 25.5 0 76.7 90.3 ## 2 Intensity -0.04 0.005 -7.89 0 -0.051 -0.03 ## 3 TimeCatLate -12.2 2.63 -4.62 0 -17.6 -6.69 ``` --- ### Prediction #### Typical Inferential Questions: (3) While `\(\hat{y}\)` is a point estimate for `\(y\)`, can we also get an interval estimate for `\(y\)`? #### Two Types: -- .pull-left[ **Confidence Interval for the Mean Response** → Defined at given values of the explanatory variables → Estimates the <span style="color: orange;">average</span> response → Centered at `\(\hat{y}\)` → <span style="color: orange;">Smaller</span> SE ] .pull-right[ **Prediction Interval for an Individual Response** → Defined at given values of the explanatory variables → Predicts the response of a <span style="color: orange;">single, </span> new observation → Centered at `\(\hat{y}\)` → <span style="color: orange;">Larger</span> SE ] --- ### CI for mean response at a given level of X: We want to construct a 95% CI for the <span style="color: orange;">average</span> number of flowers when the lighting intensity is <span style="color: orange;">500</span> `\(mmol/m^2/sec\)` and the timing is <span style="color: orange;">early</span>. -- ```r new <- data.frame(Intensity = 500, TimeCat = "Early") predict.lm(modFlowers, new, interval="confidence", level=.95) ``` ``` ## fit lwr upr ## 1 63.23 59.35 67.1 ``` -- **Interpretation:** We are 95% confident that the average number of flowers will be between 59 and 67 for a meadowfoam plant raised at a lighting intensity of 500 `\(mmol/m^2/sec\)` and at an early onset of the lighting. --- ### PI for a new Y at a given level of X: We want to construct a 95% PI for the <span style="color: orange;">number of flowers </span> of a single Meadomfoam plant when the lighting intensity is <span style="color: orange;">500</span> `\(mmol/m^2/sec\)` and the timing is <span style="color: orange;">early</span>. -- ```r predict.lm(modFlowers, new, interval="prediction", level=.95) ``` ``` ## fit lwr upr ## 1 63.23 49.28 77.17 ``` -- **Interpretation:** For meadomfoam plants raised with a raised at a lighting intensity of 500 `\(mmol/m^2/sec\)` and at an early onset of the lighting, we expect 95$\%$ of the plants to have between 49 and 77 flowers. --- ### Comparing Models Suppose I built 5 different model. **Which is best?** -- * Big question! Take Math 243: Statistical Learning to learn systematic model selection techniques. -- * We will explore one approach. (But there are many possible approaches!) --- ### Comparing Models Suppose I built 5 different model. **Which is best?** -- → Pick the best model based on some measure of quality. -- **Measure of quality**: `\(R^2\)` (Coefficient of Determination) `\begin{align*} R^2 &= \mbox{% of total variation in y explained by the model}\\ &= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2} \end{align*}` -- **Strategy**: Compute the `\(R^2\)` value for each model and pick the one with the highest `\(R^2\)`. --- ### Comparing Models with `\(R^2\)` **Strategy**: Compute the `\(R^2\)` value for each model and pick the one with the highest `\(R^2\)`. ```r library(broom) glance(modFlowers) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.799 0.780 6.44 41.8 4.79e-8 2 -77.2 162. 167. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r glance(modFlowersInt) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.799 0.769 6.60 26.5 3.55e-7 3 -77.1 164. 170. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` -- **Problem:** As we add predictors, the `\(R^2\)` value will only increase. --- ### Comparing Models with `\(R^2\)` **Problem:** As we add predictors, the `\(R^2\)` value will only increase. And in [Week 6, we said](https://reed-statistics.github.io/math141f20/wk06_mon_wed.html#27): **Guiding Principle**: Occam's Razor for Modeling > "All other things being equal, simpler models are to be preferred over complex ones." -- ModernDive --- ### Comparing Models with the Adjusted `\(R^2\)` **New Measure of quality**: Adjusted `\(R^2\)` (Coefficient of Determination) `\begin{align*} \mbox{adj} R^2 &= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2} \left(\frac{n - 1}{n - p - 1} \right) \end{align*}` where `\(p\)` is the number of explanatory variables in the model. -- Now we will penalize larger models. -- **Strategy**: Compute the adjusted `\(R^2\)` value for each model and pick the one with the highest adjusted `\(R^2\)`. --- ### Comparing Models with the Adjusted `\(R^2\)` **Strategy**: Compute the adjusted `\(R^2\)` value for each model and pick the one with the highest adjusted `\(R^2\)`. ```r glance(modFlowers) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.799 0.780 6.44 41.8 4.79e-8 2 -77.2 162. 167. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r glance(modFlowersInt) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.799 0.769 6.60 26.5 3.55e-7 3 -77.1 164. 170. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ```