Alex John Quijano
09/13/2021
Population and Sample
A parameter refers to the population
A statistic refers to the sample
Association vs Independence
Explanatory and Response variables
Numerical and Categorical Variables
Subtypes of numerical variables are Discrete and Continuous
Subtypes of categorical variables are Ordinal and Nominal
Linear vs Non-Linear associations of two numerical variables
Experimental vs Observational studies
In this lecture, we will learn about:
Simple Linear Regression where we quantify the relationship between two numerical variables.
Statistical modeling numerical response variables using a numerical explanatory variable.
The linear model of one predictor and one outcome.
The concept of correlation and residuals.
The scatterplot on the right shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).
What is the response variables, explanatory variable? What is the relationship shown?
A linear model is written as
\[ y = \beta_0 + \beta_1 x + e \]
where \(y\) is the outcome, \(x\) is the predictor, \(\beta_0\) is the intercept, and \(\beta_1\) is the slope. The notation \(e\) is the model’s error.
Notation:
Population Parameters: \(\beta_0\) and \(\beta_1\)
Sample statistics (point estimates for the parameters): \(b_0\) and \(b_1\)
Estimated/Predicted outcome: \(\hat{y} = b_0 + b_1 x + e\)
Key Points:
Inference: We can use the sample statistics \(b_0\) and \(b_1\) as point estimates to infer the true value of the population parameters \(\beta_0\) and \(\beta_1\). We will discuss in detail about inference for linear regression in a few weeks.
For now, let’s focus on understanding linear regression.
Example Model:
The linear model for predicting poverty from high school graduation rate in the US is
\[ \hat{poverty} = 64.78 - 0.62 * HS_{grad} \]
where the sample statistics are the slope is \(b_1 = - 0.62\) and the intercept is \(b_0 = 64.78\).
The “hat” in the \(\hat{poverty}\) indicates an estimated/predicted outcome.
Example Question:
The high school graduate rate in Georgia is 85.1%.
What poverty level does the model predict for this state?
The linear model for predicting poverty from high school graduation rate in the US is
\[ \hat{poverty} = 64.78 - 0.62 * HS_{grad} \]
where the sample statistics are the slope is \(b_1 = - 0.62\) and the intercept is \(b_0 = 64.78\).
Which of the following appears to be the line that best fits the linear relationship between % in poverty and % HS grad? Choose one.
Residuals are the leftover variation in the data after accounting for the model fit:
A Residual of the \(i^{th}\) observation \((x_i,y_i)\) is the difference between the observed (\(y_i\)) and estimated/predicted \(\hat{y}_i\).
\[ e_i = y_i - \hat{y}_i \]
Example:
Living in poverty in DC is 5.44% more than predicted.
Living in poverty in RI is 4.16% less than predicted.
The error - denoted as \(e\) in the general form of the linear model below - can refer to the deviation of the observed values (samples) from the true values in the population (often unobserved). \[ y = \beta_0 + \beta_1 x + e \]
The residual - which is also the model’s error - refers to the deviation from the estimated/predicted value and data (samples).
Data = Fit + Residual
The difference between the observed (\(y_i\)) and estimated/predicted \(\hat{y}_i\). \[ e_i = y_i - \hat{y}_i \]
The relationship of two numerical variables shown in the right is moderately strong linear negative relationship.
Correlation (notation: \(r\)) describes the strength of the linear association between two numerical variables.
It can have values between -1 (perfect negative) and +1 (perfect positive).
A value of 0 indicates no linear association.
Correlation has not units.
Example:
Which of the following is the best guess for the correlation between % in poverty and % HS grad?
(a)\(r=0.6\)
(b)\(r=-0.75\)
(c)\(r=-0.1\)
(d)\(r=0.02\)
(e)\(r=-1.5\)
Which of the following has the strongest correlation, i.e. correlation coefficient closest to +1 or -1?
Sample scatterplots and their correlations. The first row shows variables with a positive relationship, represented by the trend up and to the right. The second row shows variables with a negative trend, where a large value in one variable is associated with a lower value in the other.
Sample scatterplots and their correlations. In each case, there is a strong relationship between the variables. However, because the relationship is not linear, the correlation is relatively weak.
In this lecture, we talked about the following:
The Simple Linear Regression with the equation we fit into the data given as \[ y = \beta_0 + \beta_1 x + e \] where \(y\) is the outcome, \(x\) is the predictor, \(\beta_0\) is the intercept, and \(\beta_1\) is the slope.
Population Parameters: \(\beta_0\) and \(\beta_1\)
Sample statistics (point estimates for the parameters): \(b_0\) and \(b_1\)
Estimated/Predicted outcome: \(\hat{y} = b_0 + b_1 x + e\)
The Residuals where it is the difference between the observed (\(y_i\)) values and estimated/predicted \(\hat{y}_i\) values. \[ e_i = y_i - \hat{y}_i \]
The Correlation which is a metric describing the strength of the linear association of two numerical variables where it can have values between -1 (perfect negative) and +1 (perfect positive).
Discuss the answers with your group for the following exercise. This exercise problem is partially taken and slightly modified from the OpenIntro: Introduction to Modern Statistics Section 7.5.
The Coast Starlight, regression. The Coast Starlight Amtrak train runs from Seattle to Los Angeles. The scatterplot below displays the distance between each stop (in miles) and the amount of time it takes to travel from one stop to another (in minutes). The correlation between travel time and distance is 0.636. The plot also shows the best linear fit (red line) of the data where the slope is estimated to be 0.7259 and the intercept is 50.8842.