Previously on the Terminologies…

Population and Sample
- A parameter refers to the population
- A statistic refers to the sample
Association vs Independence
Explanatory and Response variables
Numerical and Categorical Variables
- Subtypes of numerical variables are Discrete and Continuous
- Subtypes of categorical variables are Ordinal and Nominal
Linear vs Non-Linear associations of two numerical variables
Experimental vs Observational studies

Modeling Numerical Variables

In this lecture, we will learn about:

Simple Linear Regression where we quantify the relationship between two numerical variables.
Statistical modeling numerical response variables using a numerical explanatory variable.
The linear model of one predictor and one outcome.
The concept of correlation and residuals.

High School Graduation and Poverty

The scatterplot on the right shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

What is the response variables, explanatory variable? What is the relationship shown?

Response Variable (outcome): % in poverty

Explanatory Variable (predictor): % HS grad

Relationship: linear, negative, moderately strong

The Linear Model

A linear model is written as

\[ y = \beta_0 + \beta_1 x + e \]

where $y$ is the outcome, $x$ is the predictor, $\beta_0$ is the intercept, and $\beta_1$ is the slope. The notation $e$ is the model’s error.

Notation:

Population Parameters: $\beta_0$ and $\beta_1$
Sample statistics (point estimates for the parameters): $b_0$ and $b_1$
Estimated/Predicted outcome: $\hat{y} = b_0 + b_1 x + e$

The Linear Model

Key Points:

Inference: We can use the sample statistics $b_0$ and $b_1$ as point estimates to infer the true value of the population parameters $\beta_0$ and $\beta_1$. We will discuss in detail about inference for linear regression in a few weeks.
For now, let’s focus on understanding linear regression.

Using a Linear Regression to Predict Poverty

Example Model:

The linear model for predicting poverty from high school graduation rate in the US is

\[ \hat{poverty} = 64.78 - 0.62 * HS_{grad} \]

where the sample statistics are the slope is $b_1 = - 0.62$ and the intercept is $b_0 = 64.78$.

The “hat” in the $\hat{poverty}$ indicates an estimated/predicted outcome.

Example Question:

The high school graduate rate in Georgia is 85.1%.

What poverty level does the model predict for this state?

Answer: The poverty estimate/prediction for Georgia with graduate rate of 85.1% is \[ \hat{poverty} = 64.78 - 0.62 * 85.1 = 12.018 \]

Interpreting the Linear Model

The linear model for predicting poverty from high school graduation rate in the US is

\[ \hat{poverty} = 64.78 - 0.62 * HS_{grad} \]

where the sample statistics are the slope is $b_1 = - 0.62$ and the intercept is $b_0 = 64.78$.

Interpreting the slope: If the high school gradutate rate increases by 1%, then the model predicts that the poverty rate will decrease by approximately 0.62%.

Interpreting the intercept: If the high school graduation rate is 0, then the model predicts that the poverty rate is approximately 64.78%.

It is necessary to understand - at least partially - the units in which the variables are measured in order to correctly interpret the slope and intercept.

It is good to understand data thoroughly and to understand the structure of the linear model.

Eyeballing the line

Which of the following appears to be the line that best fits the linear relationship between % in poverty and % HS grad? Choose one.

Answer: (a) because this line appears to be minimizing most of the distances between the data points and the line.

These distances from the linear model and the data points are called residuals.

Residuals

Residuals are the leftover variation in the data after accounting for the model fit:

Data = Fit + Residual

A Residual of the $i^{th}$ observation $(x_i,y_i)$ is the difference between the observed ($y_i$) and estimated/predicted $\hat{y}_i$.

\[ e_i = y_i - \hat{y}_i \]

Residuals

Example:

Living in poverty in DC is 5.44% more than predicted.
Living in poverty in RI is 4.16% less than predicted.

Error/Residuals Terminologies

The error - denoted as $e$ in the general form of the linear model below - can refer to the deviation of the observed values (samples) from the true values in the population (often unobserved). \[ y = \beta_0 + \beta_1 x + e \]
The residual - which is also the model’s error - refers to the deviation from the estimated/predicted value and data (samples).

Data = Fit + Residual

The difference between the observed ($y_i$) and estimated/predicted $\hat{y}_i$. \[ e_i = y_i - \hat{y}_i \]

Quantifying the Relationship with Correlation

The relationship of two numerical variables shown in the right is moderately strong linear negative relationship.

Correlation (notation: $r$) describes the strength of the linear association between two numerical variables.

It can have values between -1 (perfect negative) and +1 (perfect positive).
A value of 0 indicates no linear association.
Correlation has not units.

Quantifying the Relationship with Correlation

Example:

Which of the following is the best guess for the correlation between % in poverty and % HS grad?

(a)$r=0.6$
(b)$r=-0.75$
(c)$r=-0.1$
(d)$r=0.02$
(e)$r=-1.5$

Answer: (b) $r=-0.75$ because the association appears to be negative and the association seems to be strong.

Assessing the Correlation

Which of the following has the strongest correlation, i.e. correlation coefficient closest to +1 or -1?

Answer: (b) $\rightarrow$ correlation means linear association and - when fitting a linear model into data - we try minimize the residuals.

More Correlation Examples

Sample scatterplots and their correlations. The first row shows variables with a positive relationship, represented by the trend up and to the right. The second row shows variables with a negative trend, where a large value in one variable is associated with a lower value in the other.

More Correlation Examples

Sample scatterplots and their correlations. In each case, there is a strong relationship between the variables. However, because the relationship is not linear, the correlation is relatively weak.

Summary

In this lecture, we talked about the following:

The Simple Linear Regression with the equation we fit into the data given as \[ y = \beta_0 + \beta_1 x + e \] where $y$ is the outcome, $x$ is the predictor, $\beta_0$ is the intercept, and $\beta_1$ is the slope.
- Population Parameters: $\beta_0$ and $\beta_1$
- Sample statistics (point estimates for the parameters): $b_0$ and $b_1$
- Estimated/Predicted outcome: $\hat{y} = b_0 + b_1 x + e$
The Residuals where it is the difference between the observed ($y_i$) values and estimated/predicted $\hat{y}_i$ values. \[ e_i = y_i - \hat{y}_i \]
The Correlation which is a metric describing the strength of the linear association of two numerical variables where it can have values between -1 (perfect negative) and +1 (perfect positive).

Today’s Activity

Discuss the answers with your group for the following exercise. This exercise problem is partially taken and slightly modified from the OpenIntro: Introduction to Modern Statistics Section 7.5.

The Coast Starlight, regression. The Coast Starlight Amtrak train runs from Seattle to Los Angeles. The scatterplot below displays the distance between each stop (in miles) and the amount of time it takes to travel from one stop to another (in minutes). The correlation between travel time and distance is 0.636. The plot also shows the best linear fit (red line) of the data where the slope is estimated to be 0.7259 and the intercept is 50.8842.

Write the equation of the regression line for predicting travel time.
Interpret the slope and the intercept in this context. Explain why the intercept interpretation does not make any sense in this context and it should not be used.
Explain the correlation in this context.
The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to estimate the time it takes for the Starlight to travel between these two cities.
It actually takes about 168 mins to travel from Santa Barbara to Los Angeles. Calculate the residual and explain the meaning of this residual value.

3 - Introduction to Linear Regression

Previously on the Terminologies…

Modeling Numerical Variables

High School Graduation and Poverty

The Linear Model

The Linear Model

Using a Linear Regression to Predict Poverty

Interpreting the Linear Model

Eyeballing the line

Residuals

Residuals

Error/Residuals Terminologies

Quantifying the Relationship with Correlation

Quantifying the Relationship with Correlation

Assessing the Correlation

More Correlation Examples

More Correlation Examples

Summary

Today’s Activity