3 - Linear Regression Continued

Alex John Quijano

09/17/2021

Previously on Linear Regression…


Previously on Least Squares…


Least-Squares Example Visualization: Shown here is some data (orange dots) and the best fit linear model (red line) y = 5.37 + 0.62*x . You can try this [least-squares regression interactive demo](https://phet.colorado.edu/sims/html/least-squares-regression/latest/least-squares-regression_en.html){target=_blank} to visualize on how it works.

Least-Squares Example Visualization: Shown here is some data (orange dots) and the best fit linear model (red line) y = 5.37 + 0.62*x . You can try this least-squares regression interactive demo to visualize on how it works.

Linear Regression Contininued…



In this lecture, we will learn about:

Outliers in Linear Regression


Three plots, each with a least squares line and corresponding residual plot. Each dataset has at least one outlier. Image Source: [OpenIntro: IMS Section 7.3.](https://openintro-ims.netlify.app/model-slr.html#outliers-in-regression){target=_blank}

Three plots, each with a least squares line and corresponding residual plot. Each dataset has at least one outlier. Image Source: OpenIntro: IMS Section 7.3.

Outliers in Linear Regression


Types of outliers.

We must be cautious on removing outliers in our modeling. Sometimes outliers are interesting cases that might be worth investigating and it might even make amodel much better.


Try out this least-squares regression interactive demo to play around with outliers in least squares regression.

Categorical Predictors with Two Levels

Example:

Total auction prices for the video game Mario Kart, divided into used ($x = 0$) and new ($x = 1$) condition games. The least squares regression line is also shown.

Total auction prices for the video game Mario Kart, divided into used (\(x = 0\)) and new (\(x = 1\)) condition games. The least squares regression line is also shown.

\[\widehat{\texttt{price}} = 42.87 + 10.9 \times \texttt{condnew}\]

Categorical Predictors with Two Levels


Interpreting the slope and intercept for two-level categorical predictor:

example:

\[\widehat{\texttt{price}} = 42.87 + 10.9 \times \texttt{condnew}\]

The average selling price of a used version of the game is \(42.9\). The slope indicates that, on average, new games sell for about $10.9 more than used games.

Linear Regression Model with Multiple Predictors


Since the beginning of introducing the Simple Linear Regression (SLR) and the least squares regression, we have been using one predictor and one outcome.


A multiple regression model (MLR) is a linear model with many predictors. In general, we write the model as

\[\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k\]

when there are \(k\) predictors. We always calculate \(b_i\) using R.

Linear Regression Model with Multiple Predictors


Key points:

R-Squared vs Adjusted R-Squared

The R-squared is the a measure of how well the simple linear regression fits the data but this is a biased estimate of the amount of variability explained by the model when applied to model with more than one predictor.

\[R^2 = 1 - \frac{SSE}{SST}\]

The adjusted R-squared describes the power of the regression model that contain different numbers of predictors.

This is computed as

\[ \begin{aligned} R_{adj}^{2} &= 1 - \frac{s_{\text{residuals}}^2 / (n-k-1)} {s_{\text{outcome}}^2 / (n-1)} \\ &= 1 - \frac{s_{\text{residuals}}^2}{s_{\text{outcome}}^2} \times \frac{n-1}{n-k-1} \end{aligned} \]

where \(n\) is the number of observations used to fit the model and \(k\) is the number of predictor variables in the model.

The term \(k-1\) is called the degrees of freedom.

R-Squared vs Adjusted R-Squared

Key Points:

Additional Thoughts:

Summary

In this lecture, we talked about the following:

Looking ahead: In the next few weeks, we will talk about using the concept of probability to know if the sample statistics (e.g. mean, slope, intercept, coefficients, etc) are likely not caused by chance for a given statistical significance level.


In the next lecture, we will talk about:

Today’s Activity

Work with your group to discuss answers for the following problems. Consider a SLR with one categorical predictor with two levels. We want to build a model that predicts a numerical response variable called “complexity” with a given predictor either group “A” or group “B”.

  1. We can simplify the model by indicating that it is either group B or not group B (which means group A). Write the linear model equation in this context with the variable \(x_B\) with intercept \(b_0\) and slope \(b_1\). Explain the meaning of the variable \(x_B\) and its inputs and Interpret the slope and intercept in this particular context.

  2. Suppose that we fit this model into the data - which is shown as boxplots in the right, the least squares regression method yields the intercept \(b_0 =\) 150.174 and the slope \(b_1 =\) 20.557. If the value of \(x_{B} = 0\) - which means that the input is group A, what is the estimated mean complexity? If the value of \(x_{B} = 1\) - which means that the input is group B, what is the estimated mean complexity? Does it fall close to the actual means?