class: center, middle

### More Multiple Linear Regression

<span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 6 | Fall 2020] </span>

---

## Announcements/Reminders

* For Project Assignment 1, please fill out [this feedback survey](https://docs.google.com/forms/d/e/1FAIpQLSev9ZsXp7P9vIpwH66EZ_yVbm8_v3v_wnS_8Ts5aL7Jm39NLA/viewform?usp=sf_link) (also on the website and in the announcements channel) by next Wednesday.
    + We will give groups one set of feedback on Gradescope but will take your feedback into account when doing our final assessments.

* In Lab 5 you are building linear regression models.

---

## Week 6 Topics

* **Modeling**

* Sampling Distributions

---

## Goals for Today and Wednesday

* Discuss PA 2.

* Practice interpreting model coefficients.

* Continue discussing multiple linear regression models.

* Explore polynomial terms.

* Consider categorical explanatory variables with more than 2 categories.

* Discuss guiding principles for model building.

---

## Project Assignment 2

* Create a *data biography* by answering the following key questions about the data:
    + Where did the data come from?
    + When were the data collected?
    + Why were the data collected?
    + How were the data collected?
    + Who are the data supposed to represent?
        + Who is present?  Who is absent?
        + What evidence is there that the data are representative?  What evidence is there that the data are not representative?

* **Goal:** Better understand the context of our data to reduce the assumptions and biases we are placing on the data.

---

## Multiple Linear Regression

Linear regression is a flexible class of models that allow for:

* Both quantitative and categorical explanatory variables.

* Multiple explanatory variables.

* Curved relationships between the response variable and the explanatory variable.

* BUT the response variable is quantitative.

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p \epsilon_p + \epsilon
\end{align}`
$$

---

### New Example: Movies

Let's model a movie's critic rating using the audience rating and the movie's genre.

```r
library(tidyverse)
library(Lock5Data)
movies <- HollywoodMovies

# Restrict our attention to dramas, horrors, and actions
movies2 <- movies %>%
  filter(Genre %in% c("Drama", "Horror", "Action")) %>%
  drop_na(Genre, AudienceScore, RottenTomatoes)
```

* **Response variable:**

* **Explanatory variables:**

---

#### How should we encode a categorical variable with more than 2 categories?

---

```r
ggplot(data = movies2, mapping = aes(x = AudienceScore, 
                                     y = RottenTomatoes,
                                     color = Genre)) +
  geom_point(alpha = 0.5) +
  stat_smooth(method = lm, se = FALSE) +
  geom_abline(slope = 1, intercept = 0)
```

* Trends?

* Include interaction terms?

---

#### Side-bar: Identify Outliers on a Graph

```r
outliers <- movies2 %>%
  mutate(DiffScore = AudienceScore - RottenTomatoes) %>%
  filter(DiffScore > 50 | DiffScore < -30) %>%
  select(Movie, DiffScore, AudienceScore, RottenTomatoes, Genre)
outliers
```

```
##                                 Movie DiffScore AudienceScore RottenTomatoes
## 1                              Saw IV        52            70             18
## 2              Step Up 2: The Streets        55            81             26
## 3     Kit Kittredge: An American Girl       -52            26             78
## 4                           Stop-Loss       -38            27             65
## 5 Transformers: Revenge of the Fallen        56            76             20
## 6         The Twilight Saga: New Moon        51            78             27
## 7                     Drag Me to Hell       -31            61             92
## 8                   The Last Exorcism       -41            32             73
## 9                             Haywire       -40            40             80
##    Genre
## 1 Horror
## 2  Drama
## 3  Drama
## 4  Drama
## 5 Action
## 6  Drama
## 7 Horror
## 8  Drama
## 9 Action
```

---

#### Side-bar: Identify Outliers on a Graph

```r
library(ggrepel)
ggplot(data = movies2, mapping = aes(x = AudienceScore, 
                                     y = RottenTomatoes, 
                                     color = Genre)) +
  geom_point(alpha = 0.5) +
  stat_smooth(method = lm, se = FALSE) +
  geom_abline(slope = 1, intercept = 0) +
  geom_text_repel(data = outliers, mapping = aes(label = Movie),
                  force = 10)
```

---

### Building the Model:

Full model form:

```r
mod <- lm(RottenTomatoes ~ AudienceScore*Genre, data = movies2)
library(moderndive)
get_regression_table(mod) 
```

```
## # A tibble: 6 x 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                -15.0       5.27      -2.85   0.005  -25.4     -4.67 
## 2 AudienceScore              1.01      0.085     11.8    0        0.84     1.18 
## 3 GenreDrama                22.8       8.94       2.55   0.011    5.23    40.4  
## 4 GenreHorror              -15.2      11.0       -1.39   0.165  -36.8      6.32 
## 5 AudienceScore:GenreDra…   -0.253     0.136     -1.86   0.065   -0.522    0.015
## 6 AudienceScore:GenreHor…    0.365     0.206      1.77   0.078   -0.04     0.771
```

---

* Evidence of curvature?

---

```r
ggplot(data = movies2, mapping = aes(x = AudienceScore, 
                                     y = RottenTomatoes, 
                                     color = Genre)) +
  geom_point(alpha = 0.5) +
  stat_smooth(method = lm, se = FALSE, 
              formula = y ~ poly(x, degree = 2)) +
  geom_text_repel(data = outliers, mapping = aes(label = Movie),
                  force = 10)
```

---

### Fitting the New Model

```r
mod2 <- lm(RottenTomatoes ~ poly(AudienceScore, 
                                 degree = 2, 
                                 raw = TRUE)*Genre, 
           data = movies2)
get_regression_table(mod2) 
```

```
## # A tibble: 9 x 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                  9.92     14.9       0.668   0.505  -19.3     39.1  
## 2 poly(AudienceScore, de…    0.098     0.515     0.191   0.849   -0.916    1.11 
## 3 poly(AudienceScore, de…    0.008     0.004     1.79    0.075   -0.001    0.016
## 4 GenreDrama                88.9      24.5       3.62    0       40.6    137.   
## 5 GenreHorror              -23.8      31.1      -0.765   0.445  -84.9     37.3  
## 6 poly(AudienceScore, de…   -2.61      0.84     -3.11    0.002   -4.26    -0.956
## 7 poly(AudienceScore, de…    0.019     0.007     2.78    0.006    0.006    0.032
## 8 poly(AudienceScore, de…    0.574     1.22      0.469   0.639   -1.83     2.98 
## 9 poly(AudienceScore, de…   -0.001     0.012    -0.061   0.951   -0.024    0.022
```

---

### Considering Other Explanatory Variables

```r
movies2 %>%
  select(RottenTomatoes, AudienceScore, OpeningWeekend, 
         DomesticGross) %>%
  na.omit() %>%
  cor()
```

```
##                RottenTomatoes AudienceScore OpeningWeekend DomesticGross
## RottenTomatoes           1.00          0.68           0.13          0.23
## AudienceScore            0.68          1.00           0.31          0.42
## OpeningWeekend           0.13          0.31           1.00          0.88
## DomesticGross            0.23          0.42           0.88          1.00
```

---

class: center, middle, inverse

### Let's Practice!

---

###  Model Building Guidance