Alex John Quijano
09/15/2021
A linear model is written as
\[ y = \beta_0 + \beta_1 x + e \]
where \(y\) is the outcome, \(x\) is the predictor, \(\beta_0\) is the intercept, and \(\beta_1\) is the slope. The notation \(e\) is the model’s error.
Notation:
Population Parameters: \(\beta_0\) and \(\beta_1\)
Sample statistics (point estimates for the parameters): \(b_0\) and \(b_1\)
Estimated/Predicted outcome: \(\hat{y} = b_0 + b_1 x\)
Residuals are the leftover variation in the data after accounting for the model fit:
A Residual of the \(i^{th}\) observation \((x_i,y_i)\) is the difference between the observed (\(y_i\)) and estimated/predicted \(\hat{y}_i\).
\[ e_i = y_i - \hat{y}_i \]
In this lecture, we will learn about:
Objectively making a linear regression model that best fit the data using ordinary least squares method.
Least-squares derivation.
Residual analysis describing the strength of the linear regression fit.
Sample data with their best fitting lines (top row) and their corresponding residual plots (bottom row).
Terms:
Homoscedastic residuals \(\longrightarrow\) constant variance.
Heteroscedastic residuals \(\longrightarrow\) non-constant variance.
The data is randomly sampled independent observations.
The distribution of the residuals is normally distributed and the residuals exhibits homoscedasticity.
The independent variables are uncorrelated with the error term.
The regression model is linear in the coefficients and the error term.
The independent variables should not be collinear - or correlated to each other.
The Sum of Squared Error (SSE) is a metric of left-over variability in the \(y\) values if we know \(x\).
\[ SSE = \sum_{i=1}^n (e_i)^2 \]
The Total Sum of Squares (SST) is a metric measure the variability in the \(y\) values by how far they tend to fall from their mean, \(\bar{y}\).
\[ SST = \sum_{i=1}^n (y_i - \bar{y})^2 \]
where \(\bar{y} = \frac{1}{n} \sum_{i-1}^n y_i\), and \(n\) is the number of observations.
To find the best linear fit, we minimize the SSE.
\[ SSE = \sum_{i=1}^n (e_i)^2 = \sum_{i=1}^n (y_i - \hat{y_i})^2 \] Plugging-in the linear equation \(\hat{y_i} = b_0 + b_1 x\), we have
\[ SSE = \sum_{i=1}^n (y_i - (b_0 + b_1 x)^2. \] Minimizing the above equation over all possible values of \(b_0\) and \(b_1\) is a calculus problem. Take the derivative of SSE with respect to \(b_1\), set it equal to zero, and solve for \(b_1\).
Long story short,
\[ \begin{aligned} b_1 = & \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \\ b_0 = & \bar{y} - b_1 \bar{x} \end{aligned} \] where \(\bar{x}\) and \(\bar{y}\) are the mean of \(x\) and \(y\) respectively.
We can rewrite the slope as follows \[ \begin{aligned} b_1 & = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \\ & = \frac{\sqrt{\frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2}}{\sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}} \frac{\sum_{i=1}^n{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}} \\ b_1 & = \frac{s_y}{s_x} r \end{aligned} \] where \(s_y\) and \(s_x\) are the standard deviations of \(x\) and \(y\) respectively, and \(r\) is the pearson correlation coefficient of \(x\) and \(y\).
Least-Squares Example Visualization: Shown here is some data (orange dots) and the best fit linear model (red line) y = 5.37 + 0.62*x . You can try this least-squares regression interactive demo to visualize on how it works.
To find the best fit linear model to data, we compute the slope and intercept by using the correlation and standard deviations.
\[ \begin{aligned} \text{mean of x} \longrightarrow & \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \\ \text{mean of y} \longrightarrow & \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i \\ \text{standard deviation of x} \longrightarrow & s_x = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2} \\ \text{standard deviation of y} \longrightarrow & s_y = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2} \\ \text{correlation of x and y} \longrightarrow & r = \frac{\sum_{i=1}^n{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}} \\ \text{best fit slope} \longrightarrow & b_1 = \frac{s_y}{s_x} r \\ \text{best fit intercept} \longrightarrow & b_0 = \bar{y} - b_1 \bar{x} \end{aligned} \] Note that the \(n-1\) term - known as Bessel’s correction - in the sample variance \(s^2\) is the unbiased estimator for the population variance \(\sigma^2\).
Using least-squares regression is typically the most helpful when dealing with data.
The slope in terms of the standard deviations and correlation helps us interpret it more precisely.
The math helps us understand more thoroughly of how linear regression works.
The ratio \(\frac{s_y}{s_x}\) in the least squares slope \(b_1\) tells us the average change in the predicted values of the response variable when the explanatory variable increases by 1 unit.
\[ b_1 = \frac{s_y}{s_x} r \]
The coefficient of determination can then be calculated as
\[ R^2 = \frac{SST - SSE}{SST} = 1 - \frac{SSE}{SST} \]
where
\[ SSE = \sum_{i=1}^n (e_i)^2 \hspace{10px} \text{ and } \hspace{10px} SST = \sum_{i=1}^n (y_i - \bar{y})^2. \] The range of \(R^2\) is from 0 to 1. \(R^2\) is the a measure of how well the linear regression fits the data.
Interpretation:
Value of 0 means 0% of the variance in the data is captured by the model. (bad and not a “good” fit)
Value of 1 means 100% of the variance in the data is captured by the model. (over-fitting and impacts the ability for the model to generalize about the population)
Ideally, you want an \(R^2\) value close to 1.
In the case for a linear model with one predictor and one outcome, the relationship between the correlation and the coefficient of determination is \(R^2 = r^2\).
In this lecture, we talked about the following:
Evaluating linear models by using residual scatter plots with idealized examples of baised vs unbiased patterns, and heteroscedastic errors vs homoscedastic errors.
The derivation of the ordinary least squares method to find the best fit linear model into data.
The coefficient of determination \(R^2\) for evaluating the linear model.
You can try this least squares regression interactive demo to visualize on how it works.
In the next lecture, we will talk about:
Weaknesses of the linear regression model - involving outliers.
Linear regression with categorical predictors.
Linear regression with multiple predictors.
Work within your group to answer the following problems. Consider the plots below.
Scatterplot of variables x and y. The red line is the best fit linear model of the data.
Histogram and scatterplot of the residuals.
The correlation coefficient of the two numerical variables \(x\) and \(y\) is \(r = -0.8355\). The standard deviation of \(x\) is \(s_x = 0.3061\). The standard deviation of \(y\) is \(s_y = 1.2469\). Compute and interpret the slope estimate of the linear fit.
The mean of \(x\) is \(\bar{x} = 0.5240\). The mean of \(y\) is \(\bar{y} = -1.7337\). Apply the point-slope equation using the means \((\bar{x},\bar{y})\) and the slope to write the equation of the red line shown in the scatter plot. What is the equation of the linear model? What is the intercept of the linear model?
The SST of the model is \(153.9204\) and the SSE of the model is \(46.48605\). Compute the coefficient of determination (\(R^2\)). Interpret this value in the context of the model.
Consider the histogram of residuals and the scatterplot of residuals. Explain why a linear regression model is appropriate for the data it represents.