Alex John Quijano
11/12/2021
Inference for one proportion
Hypothesis testing
Confidence intervals
Today, we will discuss the following:
The problem shown below was taken and slightly modified from your textbook OpenIntro: Introduction to Modern Statistics Section 17.5.
Consider the research study described below.
A Kaiser Family Foundation poll for a random sample of US adults in 2019 found that 79% of Democrats, 55% of Independents, and 24% of Republicans supported a generic “National Health Plan.” There were 347 Democrats, 298 Republicans, and 617 Independents surveyed. K. F. Foundation 2019
Claim: There is significant difference between the proportion of Democrats who support NHP and Independents who support the NHP. (Hypothesis Testing)
Question: What is the feasible range of values for the true difference in proportion between the Democrats and Independents? (Confidence Interval)
Null Hypothesis There is NO significant difference in proportions of all democrats who support the NHP and all democrats who support the NHP.
\[H_0: p_D - p_I = 0\]
Alternative Hypothesis 1 There is a significant difference in proportions of all democrats who support the NHP and all democrats who support the NHP. \[H_{A1}: p_D - p_I \ne 0\]
Alternative Hypothesis 2 There is significantly less proportion of independents that support the NHP than the Democrats. \[H_{A2}: p_D - p_I > 0\]
Goal: To compute the p-value.
Step 1: Check if the conditions are satisfied.
Step 2: Compute the pooled proportion \[ \begin{aligned} \hat{p}_{pool} & = \frac{\hat{p}_D n_{D} + \hat{p}_I n_{I}}{n_D + n_I} \\ & = \frac{0.79 (347) + 0.55 (617)}{347 + 617} \\ & = 0.6364 \end{aligned} \]
Step 3: Compute the Z statistic. \[ \begin{aligned} Z & = \frac{(\hat{p}_D - \hat{p}_I) - p_0}{\sqrt{\hat{p}_{pool}(1-\hat{p}_{pool}) \left( \frac{1}{n_D} + \frac{1}{n_I} \right) }} \\ & = \frac{0.24 - 0}{\sqrt{0.6364(1-0.6364) \left( \frac{1}{347} + \frac{1}{617} \right) }} \\ & = 7.4354 \end{aligned} \]
2*(1-pnorm(7.4354,0,1)) = 1.04361e-13
1-pnorm(7.4354,0,1) = 5.218048e-14
Step 1: Check if the conditions are satisfied. Since we already checked earlier, we can say that it is satisfied.
Step 2: Compute standard error. \[ \begin{aligned} SE & = \sqrt{\frac{\hat{p}_D(1-\hat{p}_D)}{n_D} + \frac{\hat{p}_I(1-\hat{p}_I)}{n_I}} \\ & = \sqrt{\frac{0.79(1-0.79)}{347} + \frac{0.55(1-0.55)}{617}} \\ & = 0.0297 \end{aligned} \]
Step 3: Compute the \(z^*\) for a 95% CI. \[z^* = 1.96\]
Step 4: Compute the CI. \[\hat{p}_D - \hat{p}_I \pm z^* SE\] \[0.24 \pm 1.96(0.0297)\] \[(0.2103,0.2697)\]
Therefore, we are 95% confident that the true difference in proportion is between 0.2103 and 0.2697. Here, the null value of \(p_0 = 0\) is outside the 85% confidence interval which probability indicates significance.
The problem shown below was taken and slightly modified from your textbook OpenIntro: Introduction to Modern Statistics Section 17.5.
Consider the research study described below.
Sleep deprivation, CA vs. OR.
According to a report on sleep deprivation by the Centers for Disease Control and Prevention, the proportion of California residents who reported insufficient rest or sleep during each of the preceding 30 days is 8.0%, while this proportion is 8.8% for Oregon residents. These data are based on simple random samples of 11,545 California and 4,691 Oregon residents. CDC 2008
Perform a hypothesis test for the difference between the proportions of CA and OR who are sleep deprived.
Calculate a 95% confidence interval for the difference between the proportions of CA and OR who are sleep deprived and interpret it in context of the data.
10:10
Hypothesis Testing.
Null Hypothesis. \(H_0\): There is NO significant difference between the proportion of speed deproved individuals in Oregon and California. \[p_O - p_C = 0\]
Alternative Hypothesis 1. \(H_{A1}\): There is no significant difference between the proportion of speed deproved individuals in Oregon and California. \[p_O - p_C \ne 0\]
Alternative Hypothesis 2. \(H_{A2}\): There is a higher proportion of sleep deprived individuals in Oregon than in California. \[p_O - p_C > 0\].
The null value and point-estimate: \(p_0 = 0\) and \(\hat{p}_O - \hat{p}_C = 0.088 - 0.08 = 0.008\)
Compute the pooled proportion. \[ \begin{aligned} \hat{p}_{pool} & = \frac{\hat{p}_O n_{O} + \hat{p}_C n_{C}}{n_O + n_C} \\ & = \frac{0.088(4691) + 0.08(11545)}{4691 + 11545} \\ & = 0.0823 \end{aligned} \]
Compute the \(Z\) statistic. \[ \begin{aligned} Z & = \frac{(\hat{p}_O - \hat{p}_C) - p_0}{\sqrt{\hat{p}_{pool}(1-\hat{p}_{pool}) \left( \frac{1}{n_O} + \frac{1}{n_C} \right) }} \\ & = \frac{0.008 - 0}{\sqrt{0.0823(1-0.0823) \left( \frac{1}{4691} + \frac{1}{11545} \right) }} \\ & = 1.68 \end{aligned} \]
The p-value for \(Z = 1.68\) is \(0.0465\) (one-sided) and \(0.0929\) (two-sided).
If significance value is \(0.05\), then it looks like (barely) that the one-sided test (p-value is 0.04565) is significant. Therefore, we can say that there is a larger proportion of Oregon individuals who are sleep deprived than California individuals.
If significance value is \(0.05\), then the two-sided test (p-value is 0.0929) concludes in failure to reject the null. We don’t have enough evidence to support that there is a difference in proportions.
What does this mean? Well, it just means that we can’t conclude anything because we are too close to the significance value. However, the significance value is a subjective choice. We can reason ourselves that if the p-value is low enough, we can say that the data is unlikely to have occurred by chance, if set our significance value to be a reasonable \(0.10\), which is still a “rare” occurrence.
95% confidence Interval.
Compute the standard error: \[ \begin{aligned} SE & = \sqrt{\frac{\hat{p}_O(1-\hat{p}_O)}{n_O} + \frac{\hat{p}_C(1-\hat{p}_C)}{n_C}} \\ & = \sqrt{\frac{0.088(1-0.088)}{4691} + \frac{0.08(1-0.08)}{11545}} \\ & = 0.0048 \end{aligned} \]
Compute the \(z^*\) for a 95% CI. \[z^* = 1.96\]
Compute the 95% confidence interval. \[\hat{p}_O - \hat{p}_C \pm z^* SE\] \[0.008 \pm 1.96*0.0048\] \[(-0.0014,0.0174)\]
Therefore, we are 95% confident that the true difference in proportion is between -0.0014 and 0.0174. This means that there is a possibility that the difference is 0 where there is no significant difference. This coincides with the two-sided sided test where we concluded that there is no significant difference.
On additional note, if we compute the 90% confidence interval we will have an interval of (0.0001,0.0158), which technically not include 0 but barely. This conclusion coincides with our decision to have the significance value of 0.10.
Note that the choice of the significance value and confidence level is subjective. What matters is how you interpret the p-value and the confidence interval in your decision making process.
Here, choosing 90% CI over 95% CI would mean that we are less confident on where the true proportion should lie but the interval is more precise. Choosing a significance value of 0.10 means we subjectively chose to have the “unlikely to have occurred by chance” concept to be less than 10% probability.
The results’ dependability is only validated at the end of the test. The possibility of having a type I or type II error result still exists. This is one of the disadvantages of the Frequentist paradigm of inference.
Today, we discussed the following:
Next, we will discuss: