Alex John Quijano
09/17/2021
The Simple Linear Regression with the equation we fit into the data given as \[ \hat{y} = b_0 + b_1 x + e \] where \(y\) is the predicted/estimated outcome, \(x\) is the predictor, \(b_0\) is the intercept, and \(b_1\) is the slope.
The Residuals where it is the difference between the observed (\(y_i\)) values and estimated/predicted \(\hat{y}_i\) values. \[ e_i = y_i - \hat{y}_i \]
The Correlation which is a metric describing the strength of the linear association of two numerical variables where it can have values between -1 (perfect negative) and +1 (perfect positive).
Evaluating linear models by using residual scatter plots with idealized examples of baised vs unbiased patterns, and heteroscedastic errors vs homoscedastic errors.
The derivation of the ordinary least squares method to find the best fit linear model into data.
The coefficient of determination \(R^2\) for evaluating the linear model.
Least-Squares Example Visualization: Shown here is some data (orange dots) and the best fit linear model (red line) y = 5.37 + 0.62*x . You can try this least-squares regression interactive demo to visualize on how it works.
In this lecture, we will learn about:
Weaknesses of the linear regression model - involving outliers.
Linear regression with categorical predictor.
Linear regression with multiple predictors.
Three plots, each with a least squares line and corresponding residual plot. Each dataset has at least one outlier. Image Source: OpenIntro: IMS Section 7.3.
A: There is one outlier far from the other points, though it only appears to slightly influence the line.
B: There is one outlier on the right, though it is quite close to the least squares line, which suggests it wasn’t very influential.
C: There is one point far away from the cloud, and this outlier appears to pull the least squares line up on the right; examine how the line around the primary cloud doesn’t appear to fit very well.
Note that the residual plots here are the predicted y vs residuals instead of the explanatory variables vs residuals. These are two different ways to look at residuals for evaluating a linear model.
Types of outliers.
A point (or a group of points) that stands out from the rest of the data is called an outlier.
Outliers that fall horizontally away from the center of the cloud of points are called leverage points.
Outliers that influence on the slope of the line are called influential points.
We must be cautious on removing outliers in our modeling. Sometimes outliers are interesting cases that might be worth investigating and it might even make amodel much better.
Try out this least-squares regression interactive demo to play around with outliers in least squares regression.
Example:
Total auction prices for the video game Mario Kart, divided into used (\(x = 0\)) and new (\(x = 1\)) condition games. The least squares regression line is also shown.
Categorical variables are also useful in predicting outcomes. Here we consider a categorical predictor with two levels (recall that a level is the same as a category).
Since the predictor is a categorical variable with text as levels, we can simplify the model by indicating that it is either new or not new (which means old). This indicator variable takes in a value of 1 (new) and 0 (old).
Using this indicator variable and using least squares to estimate the intercept and slope, the linear model may be written as
\[\widehat{\texttt{price}} = 42.87 + 10.9 \times \texttt{condnew}\]
Interpreting the slope and intercept for two-level categorical predictor:
The estimated intercept is the value of the outcome variable for the first category (i.e., the category corresponding to an indicator value of 0).
The estimated slope is the average change in the outcome variable between the two categories.
example:
\[\widehat{\texttt{price}} = 42.87 + 10.9 \times \texttt{condnew}\]
The average selling price of a used version of the game is \(42.9\). The slope indicates that, on average, new games sell for about $10.9 more than used games.
Since the beginning of introducing the Simple Linear Regression (SLR) and the least squares regression, we have been using one predictor and one outcome.
A multiple regression model (MLR) is a linear model with many predictors. In general, we write the model as
\[\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k\]
when there are \(k\) predictors. We always calculate \(b_i\) using R.
Key points:
Multicolinearity is a common issue and it complicates model estimation. This means that if two or more independent variables are correlated (or collinear), it can cause problems when fitting the model (becomes very sensitive to small changes) and interpreting the results. Collinear explanatory variables reduces the precision of the estimated coefficients, which weakens the statistical power of your regression model.
However, MLR can be a good model for a more complicated data set that needs more than one explanatory variable to model and interpret a more complicated system. The interpretation of the coefficients (\(b_0\), \(b_1\), \(\cdots\), \(b_k\)): The mean change in the dependent variable for each 1 unit change in the explanatory variable when you hold all of the other independent variables constant.
We will revisit the Multiple Linear Regression (MLM) later in the course after we learn about probability, which can help us decide which explanatory variables to keep.
The R-squared is the a measure of how well the simple linear regression fits the data but this is a biased estimate of the amount of variability explained by the model when applied to model with more than one predictor.
\[R^2 = 1 - \frac{SSE}{SST}\]
The adjusted R-squared describes the power of the regression model that contain different numbers of predictors.
This is computed as
\[ \begin{aligned} R_{adj}^{2} &= 1 - \frac{s_{\text{residuals}}^2 / (n-k-1)} {s_{\text{outcome}}^2 / (n-1)} \\ &= 1 - \frac{s_{\text{residuals}}^2}{s_{\text{outcome}}^2} \times \frac{n-1}{n-k-1} \end{aligned} \]
where \(n\) is the number of observations used to fit the model and \(k\) is the number of predictor variables in the model.
The term \(k-1\) is called the degrees of freedom.
Key Points:
Both \(R^2\) and \(R_{adj}^{2}\) tells you the percentage of variation in the data explained by the model.
Using either \(R^2\) or \(R_{adj}^{2}\) depends on the number of samples and the number of predictors.
If you have enough observations and small number of predictors (degrees of freedom), the difference between \(R^2\) and \(R_{adj}^{2}\) are small. If you have a lot of predictors, it is common to use the \(R_{adj}^{2}\).
Additional Thoughts:
\(R^2\) and \(R_{adj}^{2}\) does not tell you if the models error is either homoscedastic or heteroscedasitic. You can look at the residual scatterplots for checking homoscedasticity or heteroscedasiticity
\(R^2\) and \(R_{adj}^{2}\) does not tell you if the two or more explanatory variables are collinear. You can use the correlation coefficient to check if two ro more variables are collinear.
In this lecture, we talked about the following:
Types of outliers in the data and how it can affect the least squares regression model.
Categorical Predictors with Two Levels in a simple regression model and its interpretations.
A short exposure to the Multiple Linear Regression (MLR) model.
The difference between R-squared and the Adjusted R-squared.
Looking ahead: In the next few weeks, we will talk about using the concept of probability to know if the sample statistics (e.g. mean, slope, intercept, coefficients, etc) are likely not caused by chance for a given statistical significance level.
In the next lecture, we will talk about:
Work with your group to discuss answers for the following problems. Consider a SLR with one categorical predictor with two levels. We want to build a model that predicts a numerical response variable called “complexity” with a given predictor either group “A” or group “B”.
We can simplify the model by indicating that it is either group B or not group B (which means group A). Write the linear model equation in this context with the variable \(x_B\) with intercept \(b_0\) and slope \(b_1\). Explain the meaning of the variable \(x_B\) and its inputs and Interpret the slope and intercept in this particular context.
Suppose that we fit this model into the data - which is shown as boxplots in the right, the least squares regression method yields the intercept \(b_0 =\) 150.174 and the slope \(b_1 =\) 20.557. If the value of \(x_{B} = 0\) - which means that the input is group A, what is the estimated mean complexity? If the value of \(x_{B} = 1\) - which means that the input is group B, what is the estimated mean complexity? Does it fall close to the actual means?