3 - Introduction to Linear Regression

Alex John Quijano

09/13/2021

Previously on the Terminologies…

Modeling Numerical Variables


In this lecture, we will learn about:

High School Graduation and Poverty

The scatterplot on the right shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

What is the response variables, explanatory variable? What is the relationship shown?

The Linear Model


A linear model is written as

\[ y = \beta_0 + \beta_1 x + e \]

where \(y\) is the outcome, \(x\) is the predictor, \(\beta_0\) is the intercept, and \(\beta_1\) is the slope. The notation \(e\) is the model’s error.

Notation:

The Linear Model



Key Points:

Using a Linear Regression to Predict Poverty

Example Model:

The linear model for predicting poverty from high school graduation rate in the US is

\[ \hat{poverty} = 64.78 - 0.62 * HS_{grad} \]

where the sample statistics are the slope is \(b_1 = - 0.62\) and the intercept is \(b_0 = 64.78\).

The “hat” in the \(\hat{poverty}\) indicates an estimated/predicted outcome.

Example Question:

The high school graduate rate in Georgia is 85.1%.

What poverty level does the model predict for this state?

Interpreting the Linear Model

The linear model for predicting poverty from high school graduation rate in the US is

\[ \hat{poverty} = 64.78 - 0.62 * HS_{grad} \]

where the sample statistics are the slope is \(b_1 = - 0.62\) and the intercept is \(b_0 = 64.78\).

Eyeballing the line


Which of the following appears to be the line that best fits the linear relationship between % in poverty and % HS grad? Choose one.

Residuals


Residuals are the leftover variation in the data after accounting for the model fit:

Data = Fit + Residual

A Residual of the \(i^{th}\) observation \((x_i,y_i)\) is the difference between the observed (\(y_i\)) and estimated/predicted \(\hat{y}_i\).

\[ e_i = y_i - \hat{y}_i \]

Residuals


Example:

Error/Residuals Terminologies


Quantifying the Relationship with Correlation

The relationship of two numerical variables shown in the right is moderately strong linear negative relationship.

Correlation (notation: \(r\)) describes the strength of the linear association between two numerical variables.

Quantifying the Relationship with Correlation

Example:

Which of the following is the best guess for the correlation between % in poverty and % HS grad?

(a)\(r=0.6\)
(b)\(r=-0.75\)
(c)\(r=-0.1\)
(d)\(r=0.02\)
(e)\(r=-1.5\)

Assessing the Correlation



Which of the following has the strongest correlation, i.e. correlation coefficient closest to +1 or -1?

More Correlation Examples


Sample scatterplots and their correlations. The first row shows variables with a positive relationship, represented by the trend up and to the right. The second row shows variables with a negative trend, where a large value in one variable is associated with a lower value in the other.

Sample scatterplots and their correlations. The first row shows variables with a positive relationship, represented by the trend up and to the right. The second row shows variables with a negative trend, where a large value in one variable is associated with a lower value in the other.

More Correlation Examples



 Sample scatterplots and their correlations. In each case, there is a strong relationship between the variables. However, because the relationship is not linear, the correlation is relatively weak.

Sample scatterplots and their correlations. In each case, there is a strong relationship between the variables. However, because the relationship is not linear, the correlation is relatively weak.

Summary

In this lecture, we talked about the following:

Today’s Activity

Discuss the answers with your group for the following exercise. This exercise problem is partially taken and slightly modified from the OpenIntro: Introduction to Modern Statistics Section 7.5.

The Coast Starlight, regression. The Coast Starlight Amtrak train runs from Seattle to Los Angeles. The scatterplot below displays the distance between each stop (in miles) and the amount of time it takes to travel from one stop to another (in minutes). The correlation between travel time and distance is 0.636. The plot also shows the best linear fit (red line) of the data where the slope is estimated to be 0.7259 and the intercept is 50.8842.

  1. Write the equation of the regression line for predicting travel time.
  2. Interpret the slope and the intercept in this context. Explain why the intercept interpretation does not make any sense in this context and it should not be used.
  3. Explain the correlation in this context.
  4. The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to estimate the time it takes for the Starlight to travel between these two cities.
  5. It actually takes about 168 mins to travel from Santa Barbara to Los Angeles. Calculate the residual and explain the meaning of this residual value.