class: center, middle ### Linear Regression With a Categorical Explanatory Variable <img src="img/DAW.png" width="450px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 5 | Fall 2020] </span> --- ## Announcements/Reminders * Project Assignment 1 is due on Friday October 2nd (end of day) on Gradescope. -- * Lab 4 due before your lab this week. + No coding, just narrative. --- ## Week 5 Topics * **Modeling** --- # Goals for Today * Data Ethics * Linear regression model when the predictor is categorical + Impact on the meaning of the coefficients --- ### Data Ethics > "Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations." -- Committee on Professional Ethics of the American Statistical Association (ASA) -- The ASA have created ["Ethical Guidelines for Statistical Practice"](https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx) -- → These guidelines are for EVERYONE doing statistical work. -- → There are ethical decisions at all steps of the Data Analysis Process. -- → We will periodically refer to specific guidelines throughout this class. -- > "Above all, professionalism in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical ends is inherently unethical." --- class: inverse, center, middle ## Responsibilities to Research Subjects > "The ethical statistician protects and respects the rights and interests of human and animal subjects at all stages of their involvement in a project. This includes respondents to the census or to surveys, those whose data are contained in administrative records, and subjects of physically or psychologically invasive research." --- ## Responsibilities to Research Subjects > "Protects the privacy and confidentiality of research subjects and data concerning them, whether obtained from the subjects directly, other persons, or existing records." ### NHANES <img src="wk05_wed_files/figure-html/unnamed-chunk-1-1.png" width="360" /> --- ### Simple Linear Regression Consider this model when: -- * Response variable `\((y)\)`: quantitative -- * Explanatory variable `\((x)\)`: quantitative + Have only ONE explanatory variable. -- * AND, `\(f()\)` can be approximated by a line. --- ### Linear Regression Linear regression is a flexible class of models that allow for: * Both quantitative and categorical explanatory variables. -- * Multiple explanatory variables. -- * Curved relationships between the response variable and the explanatory variable. -- * BUT the response variable is quantitative. ******************** ### Explore today: Linear Regression where -- * Response variable `\((y)\)`: quantitative -- * Have 1 categorical explanatory variable `\((x)\)` with two categories. --- ### Example: The Smile-Leniency Effect **Can a simple smile have an effect on punishment assigned following an infraction?** In a 1995 study, Hecht and LeFrance examined the effect of a smile on the leniency of disciplinary action for wrongdoers. Participants in the experiment took on the role of members of a college disciplinary panel judging students accused of cheating. For each suspect, along with a description of the offense, a picture was provided with either a smile or neutral facial expression. A leniency score was calculated based on the disciplinary decisions made by the participants. * Response variable: * Explanatory variable: --- ### Model Form $$ `\begin{align} y &= \beta_o + \beta_1 x + \epsilon \end{align}` $$ -- First, need to convert the categories of `\(x\)` to numbers. -- Before building the model, let's explore and visualize the data! ```r library(tidyverse) # Load data smiles <- read_csv("/home/courses/math141f18/Data/Smiles.csv") glimpse(smiles) ``` ``` ## Rows: 68 ## Columns: 2 ## $ Leniency <dbl> 7.0, 3.0, 6.0, 4.5, 3.5, 4.0, 3.0, 3.0, 3.5, 4.5, 7.0, 5.0, … ## $ Group <chr> "smile", "smile", "smile", "smile", "smile", "smile", "smile… ``` * What `dplyr` functions should I use to find the mean and sd of `Leniency` by the categories of `Group`? * What graph should we use to visualize the `Leniency` scores by `Group`? --- ```r # Summarize smiles %>% group_by(Group) %>% summarize(count = n(), mean_len = mean(Leniency), sd_len = sd(Leniency)) ``` ``` ## # A tibble: 2 x 4 ## Group count mean_len sd_len ## <chr> <int> <dbl> <dbl> ## 1 neutral 34 4.12 1.52 ## 2 smile 34 4.91 1.68 ``` ```r # Visualize ggplot(smiles, aes(x = Group, y = Leniency)) + geom_boxplot() + stat_summary(fun = mean, geom = "point", color = "purple") ``` <img src="wk05_wed_files/figure-html/unnamed-chunk-4-1.png" width="360" /> --- ## Side-bar: Double Encoding ```r # Visualize ggplot(smiles, aes(x = Group, y = Leniency, fill = Group)) + geom_boxplot() + stat_summary(fun = mean, geom = "point", color = "purple") + guides(fill = FALSE) ``` <img src="wk05_wed_files/figure-html/unnamed-chunk-5-1.png" width="360" /> --- ### Fit the Linear Regression Model Model Form: $$ `\begin{align} y &= \beta_o + \beta_1 x + \epsilon \end{align}` $$ -- When `\(x = 0\)`: -- <br> When `\(x = 1\)`: -- ```r mod <- lm(Leniency ~ Group, data = smiles) library(moderndive) get_regression_table(mod) ``` ``` ## # A tibble: 2 x 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 4.12 0.275 15.0 0 3.57 4.67 ## 2 Groupsmile 0.794 0.389 2.04 0.045 0.017 1.57 ``` --- ### Notes 1. When the explanatory variable is categorical, `\(\beta_o\)` and `\(\beta_1\)` no longer represent the interceopt and slope. -- 2. Now `\(\beta_o\)` represents the (population) mean of the response variable when `\(x = 0\)`. -- 3. And, `\(\beta_1\)` represents the change in the (population) mean response going from `\(x = 0\)` to `\(x = 1\)`. -- 4. Can also do prediction: ```r new <- data.frame(Group = c("smile", "neutral")) predict(mod, newdata = new) ``` ``` ## 1 2 ## 4.9 4.1 ``` --- ## Survey You should have received an email from Lauren that contains * A randomly generated number * A link to a short (less than 2 minute) survey Please complete this survey by **3pm today**. Please do NOT ask anyone or the internet for help with these questions. It is okay if you don't know the answers. We want your guesses. (You will in no way be graded on the accuracy of your guesses!)