class: center, middle ### More Multiple Linear Regression <img src="img/DAW.png" width="450px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 6 | Fall 2020] </span> --- ## Announcements/Reminders * For Project Assignment 1, please fill out [this feedback survey](https://docs.google.com/forms/d/e/1FAIpQLSev9ZsXp7P9vIpwH66EZ_yVbm8_v3v_wnS_8Ts5aL7Jm39NLA/viewform?usp=sf_link) (also on the website and in the announcements channel) by next Wednesday. + We will give groups one set of feedback on Gradescope but will take your feedback into account when doing our final assessments. -- * In Lab 5 you are building linear regression models. --- ## Week 6 Topics * **Modeling** * Sampling Distributions --- ## Goals for Today and Wednesday * Discuss PA 2. * Practice interpreting model coefficients. * Continue discussing multiple linear regression models. * Explore polynomial terms. * Consider categorical explanatory variables with more than 2 categories. * Discuss guiding principles for model building. --- ## Project Assignment 2 * Create a *data biography* by answering the following key questions about the data: + Where did the data come from? + When were the data collected? + Why were the data collected? + How were the data collected? + Who are the data supposed to represent? + Who is present? Who is absent? + What evidence is there that the data are representative? What evidence is there that the data are not representative? -- * **Goal:** Better understand the context of our data to reduce the assumptions and biases we are placing on the data. --- ## Multiple Linear Regression Linear regression is a flexible class of models that allow for: * Both quantitative and categorical explanatory variables. * Multiple explanatory variables. * Curved relationships between the response variable and the explanatory variable. * BUT the response variable is quantitative. **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p \epsilon_p + \epsilon \end{align}` $$ --- ### New Example: Movies Let's model a movie's critic rating using the audience rating and the movie's genre. ```r library(tidyverse) library(Lock5Data) movies <- HollywoodMovies # Restrict our attention to dramas, horrors, and actions movies2 <- movies %>% filter(Genre %in% c("Drama", "Horror", "Action")) %>% drop_na(Genre, AudienceScore, RottenTomatoes) ``` * **Response variable:** * **Explanatory variables:** --- #### How should we encode a categorical variable with more than 2 categories? --- ```r ggplot(data = movies2, mapping = aes(x = AudienceScore, y = RottenTomatoes, color = Genre)) + geom_point(alpha = 0.5) + stat_smooth(method = lm, se = FALSE) + geom_abline(slope = 1, intercept = 0) ``` <img src="wk06_mon_wed_files/figure-html/unnamed-chunk-2-1.png" width="468" /> * Trends? * Include interaction terms? --- #### Side-bar: Identify Outliers on a Graph ```r outliers <- movies2 %>% mutate(DiffScore = AudienceScore - RottenTomatoes) %>% filter(DiffScore > 50 | DiffScore < -30) %>% select(Movie, DiffScore, AudienceScore, RottenTomatoes, Genre) outliers ``` ``` ## Movie DiffScore AudienceScore RottenTomatoes ## 1 Saw IV 52 70 18 ## 2 Step Up 2: The Streets 55 81 26 ## 3 Kit Kittredge: An American Girl -52 26 78 ## 4 Stop-Loss -38 27 65 ## 5 Transformers: Revenge of the Fallen 56 76 20 ## 6 The Twilight Saga: New Moon 51 78 27 ## 7 Drag Me to Hell -31 61 92 ## 8 The Last Exorcism -41 32 73 ## 9 Haywire -40 40 80 ## Genre ## 1 Horror ## 2 Drama ## 3 Drama ## 4 Drama ## 5 Action ## 6 Drama ## 7 Horror ## 8 Drama ## 9 Action ``` --- #### Side-bar: Identify Outliers on a Graph ```r library(ggrepel) ggplot(data = movies2, mapping = aes(x = AudienceScore, y = RottenTomatoes, color = Genre)) + geom_point(alpha = 0.5) + stat_smooth(method = lm, se = FALSE) + geom_abline(slope = 1, intercept = 0) + geom_text_repel(data = outliers, mapping = aes(label = Movie), force = 10) ``` <img src="wk06_mon_wed_files/figure-html/unnamed-chunk-4-1.png" width="540" /> --- ### Building the Model: Full model form: ```r mod <- lm(RottenTomatoes ~ AudienceScore*Genre, data = movies2) library(moderndive) get_regression_table(mod) ``` ``` ## # A tibble: 6 x 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept -15.0 5.27 -2.85 0.005 -25.4 -4.67 ## 2 AudienceScore 1.01 0.085 11.8 0 0.84 1.18 ## 3 GenreDrama 22.8 8.94 2.55 0.011 5.23 40.4 ## 4 GenreHorror -15.2 11.0 -1.39 0.165 -36.8 6.32 ## 5 AudienceScore:GenreDra… -0.253 0.136 -1.86 0.065 -0.522 0.015 ## 6 AudienceScore:GenreHor… 0.365 0.206 1.77 0.078 -0.04 0.771 ``` --- ```r library(ggrepel) ggplot(data = movies2, mapping = aes(x = AudienceScore, y = RottenTomatoes, color = Genre)) + geom_point(alpha = 0.5) + stat_smooth(method = lm, se = FALSE) + geom_abline(slope = 1, intercept = 0) + geom_text_repel(data = outliers, mapping = aes(label = Movie), force = 10) ``` <img src="wk06_mon_wed_files/figure-html/unnamed-chunk-6-1.png" width="468" /> * Evidence of curvature? --- ```r ggplot(data = movies2, mapping = aes(x = AudienceScore, y = RottenTomatoes, color = Genre)) + geom_point(alpha = 0.5) + stat_smooth(method = lm, se = FALSE, formula = y ~ poly(x, degree = 2)) + geom_text_repel(data = outliers, mapping = aes(label = Movie), force = 10) ``` <img src="wk06_mon_wed_files/figure-html/unnamed-chunk-7-1.png" width="540" /> --- ### Fitting the New Model ```r mod2 <- lm(RottenTomatoes ~ poly(AudienceScore, degree = 2, raw = TRUE)*Genre, data = movies2) get_regression_table(mod2) ``` ``` ## # A tibble: 9 x 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 9.92 14.9 0.668 0.505 -19.3 39.1 ## 2 poly(AudienceScore, de… 0.098 0.515 0.191 0.849 -0.916 1.11 ## 3 poly(AudienceScore, de… 0.008 0.004 1.79 0.075 -0.001 0.016 ## 4 GenreDrama 88.9 24.5 3.62 0 40.6 137. ## 5 GenreHorror -23.8 31.1 -0.765 0.445 -84.9 37.3 ## 6 poly(AudienceScore, de… -2.61 0.84 -3.11 0.002 -4.26 -0.956 ## 7 poly(AudienceScore, de… 0.019 0.007 2.78 0.006 0.006 0.032 ## 8 poly(AudienceScore, de… 0.574 1.22 0.469 0.639 -1.83 2.98 ## 9 poly(AudienceScore, de… -0.001 0.012 -0.061 0.951 -0.024 0.022 ``` --- ### Considering Other Explanatory Variables ```r movies2 %>% select(RottenTomatoes, AudienceScore, OpeningWeekend, DomesticGross) %>% na.omit() %>% cor() ``` ``` ## RottenTomatoes AudienceScore OpeningWeekend DomesticGross ## RottenTomatoes 1.00 0.68 0.13 0.23 ## AudienceScore 0.68 1.00 0.31 0.42 ## OpeningWeekend 0.13 0.31 1.00 0.88 ## DomesticGross 0.23 0.42 0.88 1.00 ``` --- class: center, middle, inverse ### Let's Practice! --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? -- **Guiding Principle**: Capture the general trend, not the noise. $$ `\begin{align} y &= f(x) + \epsilon \\ y &= \mbox{TREND} + \mbox{NOISE} \end{align}` $$ -- Returning the 2008 Election Example: <img src="wk06_mon_wed_files/figure-html/unnamed-chunk-10-1.png" width="360" /> --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? **Guiding Principle**: Capture the general trend, not the noise. $$ `\begin{align} y &= f(x) + \epsilon \\ y &= \mbox{TREND} + \mbox{NOISE} \end{align}` $$ Returning the 2008 Election Example: <img src="wk06_mon_wed_files/figure-html/unnamed-chunk-11-1.png" width="360" /> --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? **Guiding Principle**: Capture the general trend, not the noise. $$ `\begin{align} y &= f(x) + \epsilon \\ y &= \mbox{TREND} + \mbox{NOISE} \end{align}` $$ Returning the 2008 Election Example: <img src="wk06_mon_wed_files/figure-html/unnamed-chunk-12-1.png" width="360" /> --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? **Guiding Principle**: Capture the general trend, not the noise. $$ `\begin{align} y &= f(x) + \epsilon \\ y &= \mbox{TREND} + \mbox{NOISE} \end{align}` $$ Returning the 2008 Election Example: <img src="wk06_mon_wed_files/figure-html/unnamed-chunk-13-1.png" width="360" /> --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? -- **Guiding Principle**: Occam's Razor for Modeling > "All other things being equal, simpler models are to be preferred over complex ones." -- ModernDive -- <img src="wk06_mon_wed_files/figure-html/unnamed-chunk-14-1.png" width="360" /> --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? -- **Guiding Principle**: Include explanatory variables that attempt to explain **different** aspects of the variation in the response variable. ``` ## RottenTomatoes AudienceScore OpeningWeekend DomesticGross ## RottenTomatoes 1.00 0.68 0.13 0.23 ## AudienceScore 0.68 1.00 0.31 0.42 ## OpeningWeekend 0.13 0.31 1.00 0.88 ## DomesticGross 0.23 0.42 0.88 1.00 ``` --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? -- **Guiding Principle**: Use your modeling motivation to determine how much you weight **interpretability** versus **prediction accuracy**. <img src="wk06_mon_wed_files/figure-html/unnamed-chunk-16-1.png" width="720" /> --- ### Model Building * We will come back to methods for model selection. * Key ideas: + Determining the response variable and the potential explanatory variable(s) + Writing out the model form and understanding the terms + Building and visualizing linear regression models in R + Comparing different potential models