class: center, middle ### Multiple Linear Regression <img src="img/DAW.png" width="450px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 5 | Fall 2020] </span> --- ## Announcements/Reminders * Project Assignment 1 is due on Friday October 2nd (end of day) on Gradescope. -- * Please fill out [this feedback survey](https://docs.google.com/forms/d/e/1FAIpQLSev9ZsXp7P9vIpwH66EZ_yVbm8_v3v_wnS_8Ts5aL7Jm39NLA/viewform?usp=sf_link) (also on the website and in the announcements channel) by next Wednesday. + We will give groups one set of feedback on Gradescope but will take your feedback into account when doing our final assessments. -- * In Lab 5 you are building linear regression models. --- ## Week 5 Topics * **Modeling** --- # Goals for Today * Broadening our idea of linear regression models * Discuss multiple linear regression models * Explore interaction terms --- ## Multiple Linear Regression Linear regression is a flexible class of models that allow for: * Both quantitative and categorical explanatory variables. -- * Multiple explanatory variables. -- * Curved relationships between the response variable and the explanatory variable. -- * BUT the response variable is quantitative. --- ### Multiple Linear Regression **Form of the Model:** -- $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ -- **How does extending to more predictors change our process?** -- **What doesn't change:** -- → Still use **Method of Least Squares** to estimate coefficients -- → Still use `lm()` to fit the model and `predict()` for prediction -- **What does change:** -- → Meaning of the coefficients are more complicated and depend on other variables in the model -- → Need to decide which variables to include and how (linear term, squared term...) --- ### Multiple Linear Regression * We are going to see a few examples today and Monday. -- * We will need to return to modeling later in the course to more definitively answer questions about **model selection**. --- ## Example Meadowfoam is a plant that grows in the Pacific Northwest and is harvested for its seed oil. In a randomized experiment, researchers at Oregon State University looked at how two light-related factors influenced the number of flowers per meadowfoam plant, the primary measure of productivity for this plant. The two light measures were light intensity (in mmol/ `\(m^2\)` /sec) and the timing of onset of the light (early or late in terms of photo periodic floral induction). * **Response variable**: * **Explanatory variables**: <br> <br> <br> **Model Form:** --- ### Data Loading and Wrangling ```r library(tidyverse) library(Sleuth3) data(case0901) # Recode the timing variable case0901 <- case0901 %>% mutate(TimeCat = factor(case_when( Time == 1 ~ "Late", Time == 2 ~ "Early" ))) ``` --- ### Visualizing the Data ```r ggplot(case0901, aes(x = Intensity, y = Flowers, color = TimeCat)) + geom_point() ``` <img src="wk05_fri_files/figure-html/unnamed-chunk-2-1.png" width="360" /> --- ### Building the Linear Regression Model ```r modFlowers <- lm(Flowers ~ Intensity + TimeCat, data = case0901) library(moderndive) get_regression_table(modFlowers) ``` ``` ## # A tibble: 3 x 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 83.5 3.27 25.5 0 76.7 90.3 ## 2 Intensity -0.04 0.005 -7.89 0 -0.051 -0.03 ## 3 TimeCatLate -12.2 2.63 -4.62 0 -17.6 -6.69 ``` --- ### Interpreting the Coefficients ```r ggplot(case0901, aes(x = Intensity, y = Flowers, color = TimeCat)) + geom_point() + geom_smooth(method = "lm", se = FALSE) ``` <img src="wk05_fri_files/figure-html/unnamed-chunk-4-1.png" width="360" /> <br> <br> <br> <br> -- Is the assumption of equal slopes reasonable here? --- ### Prediction ```r flowersNew <- data.frame(Intensity = 700, TimeCat = "Early") predict(modFlowers, newdata = flowersNew) ``` ``` ## 1 ## 55 ``` --- ### New Example For this example, we will use data collected by the website pollster.com, which aggregated 102 presidential polls from August 29th, 2008 through the end of September. We want to determine the best model to explain the variable `Margin`, measured by the difference in preference between Barack Obama and John McCain. Our potential predictors are `Days` (the number of days after the Democratic Convention) and `Charlie` (indicator variable on whether poll was conducted before or after the first ABC interview of Sarah Palin with Charlie Gibson). * **Response variable**: * **Explanatory variables**: --- ### Loading and Visualizing the data ```r Pollster08 <- read_csv("/home/courses/math141f18/Data/Pollster08.csv") ggplot(Pollster08, aes(x = Days, y = Margin, color = factor(Charlie))) + geom_point() ``` <img src="wk05_fri_files/figure-html/unnamed-chunk-6-1.png" width="360" /> -- Is the assumption of equal slopes reasonable here? --- ### Model Forms **Same Slopes Model:** <br> <br> <br> **Different Slopes Model:** * Line for `\(x_2 = 1\)`: <br> <br> <br> * Line for `\(x_2 = 0\)`: --- ### Fit the linear regression model ```r modPoll <- lm(Margin ~ Days*factor(Charlie), data = Pollster08) get_regression_table(modPoll) ``` ``` ## # A tibble: 4 x 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 5.57 1.09 5.11 0 3.40 7.73 ## 2 Days -0.598 0.121 -4.96 0 -0.838 -0.359 ## 3 factor(Charlie)1 -10.1 1.92 -5.25 0 -13.9 -6.29 ## 4 Days:factor(Charlie)1 0.921 0.136 6.75 0 0.65 1.19 ``` --- ### Adding the Regression Model to the Curve ```r ggplot(Pollster08, aes(x = Days, y = Margin, color = factor(Charlie))) + geom_point() + stat_smooth(method = lm, se = FALSE) ``` <img src="wk05_fri_files/figure-html/unnamed-chunk-8-1.png" width="360" /> -- Is our modeling goal here predictive or descriptive?