Previously on Inference for Proportions…

Inference for one proportion
- Hypothesis testing
- Confidence intervals

Inference for Comparing Two Proportion

Today, we will discuss the following:

Using theoretical methods to perform hypothesis testing on inference for comparing two proportion.

National Health Plan

The problem shown below was taken and slightly modified from your textbook OpenIntro: Introduction to Modern Statistics Section 17.5.

Consider the research study described below.

A Kaiser Family Foundation poll for a random sample of US adults in 2019 found that 79% of Democrats, 55% of Independents, and 24% of Republicans supported a generic “National Health Plan.” There were 347 Democrats, 298 Republicans, and 617 Independents surveyed. K. F. Foundation 2019

Claim: There is significant difference between the proportion of Democrats who support NHP and Independents who support the NHP. (Hypothesis Testing)

Question: What is the feasible range of values for the true difference in proportion between the Democrats and Independents? (Confidence Interval)

Population Parameter and Sample Statistic

Question - What is the population parameter and the sample statistic?

Answer:

Population Parameter: The difference in proportions of all democrats who support the NHP and all democrats who support the NHP. Let \(p_{diff} = p_D - p_I\) be the population proportion.
Sample Statistic: The difference in sampled proportions of all democrats who support the NHP and all democrats who support the NHP. Let \(\hat{p}_{diff} = \hat{p}_D - \hat{p}_I\) be the sample statistic.

Point estimate and Null Value

Question - What is the point-estimate and the null value?

Answer:

Null Value: The proportion for the null and alternative hypothesis. Let \(p_0 = p_D - p_I = 0\) be the the null value.
Point-estimates: The proportion of sampled individuals who support the NHP. We are given \(\hat{p}_D = 0.79\) and \(\hat{p}_I = 0.55\). Let \(\hat{p}_{diff} = \hat{p}_D - \hat{p}_I = 0.24\) be the point-estimate.

The Null and Alternative Hypothesis

Null Hypothesis There is NO significant difference in proportions of all democrats who support the NHP and all democrats who support the NHP.
\[H_0: p_D - p_I = 0\]
Alternative Hypothesis 1 There is a significant difference in proportions of all democrats who support the NHP and all democrats who support the NHP. \[H_{A1}: p_D - p_I \ne 0\]
Alternative Hypothesis 2 There is significantly less proportion of independents that support the NHP than the Democrats. \[H_{A2}: p_D - p_I > 0\]

Hypthesis Testing (1/3)

Goal: To compute the p-value.
Step 1: Check if the conditions are satisfied.
- Independence: The data is randomly sampled. Thus, we can assume that the data points are independent.
- Success-Failure: There is at least 10 “successes” and “failures” in each group. Let \(n_D = 347\) and \(n_I = 617\) \(n_D\hat{p}_D = 347(0.79) = 275\) “successes” for Democrat group \(n_D(1-\hat{p}_D) = 347(1-0.79) = 72\) “failures” for Democrat group \(n_I\hat{p}_I = 617(0.55) = 340\) “successes” for Independent group \(n_I(1-\hat{p}_I) = 617(1-0.55) = 277\) “failures” for Independent group

Hypthesis Testing (2/3)

Step 2: Compute the pooled proportion \[ \begin{aligned} \hat{p}_{pool} & = \frac{\hat{p}_D n_{D} + \hat{p}_I n_{I}}{n_D + n_I} \\ & = \frac{0.79 (347) + 0.55 (617)}{347 + 617} \\ & = 0.6364 \end{aligned} \]
Step 3: Compute the Z statistic. \[ \begin{aligned} Z & = \frac{(\hat{p}_D - \hat{p}_I) - p_0}{\sqrt{\hat{p}_{pool}(1-\hat{p}_{pool}) \left( \frac{1}{n_D} + \frac{1}{n_I} \right) }} \\ & = \frac{0.24 - 0}{\sqrt{0.6364(1-0.6364) \left( \frac{1}{347} + \frac{1}{617} \right) }} \\ & = 7.4354 \end{aligned} \]

Hypthesis Testing (3/3)

Step 4: Compute the p-value. \[\text{p-value} = 1.04361 \times 10^{-13} \longrightarrow \text{two-sided}\] \[\text{p-value} = 5.218048 \times 10^{-14} \longrightarrow \text{one-sided}\]
R equivalent: 2*(1-pnorm(7.4354,0,1)) = 1.04361e-13 1-pnorm(7.4354,0,1) = 5.218048e-14
For a significance value of 0.05, we can reject the null hypothesis because the p-value is less than 0.05. Therefore, there is a significant difference in proportion of Democrats who support the NHP and Independents who support the NHP. We can also say that there is a significant proportion of Democrats who support the NHP than the independents.

95% Confidence Interval (1/3)

Step 1: Check if the conditions are satisfied. Since we already checked earlier, we can say that it is satisfied.
Step 2: Compute standard error. \[ \begin{aligned} SE & = \sqrt{\frac{\hat{p}_D(1-\hat{p}_D)}{n_D} + \frac{\hat{p}_I(1-\hat{p}_I)}{n_I}} \\ & = \sqrt{\frac{0.79(1-0.79)}{347} + \frac{0.55(1-0.55)}{617}} \\ & = 0.0297 \end{aligned} \]

95% Confidence Interval (2/3)

Step 3: Compute the \(z^*\) for a 95% CI. \[z^* = 1.96\]
Step 4: Compute the CI. \[\hat{p}_D - \hat{p}_I \pm z^* SE\] \[0.24 \pm 1.96(0.0297)\] \[(0.2103,0.2697)\]
Therefore, we are 95% confident that the true difference in proportion is between 0.2103 and 0.2697. Here, the null value of \(p_0 = 0\) is outside the 85% confidence interval which probability indicates significance.

10.10-minute Activity (1/10)

The problem shown below was taken and slightly modified from your textbook OpenIntro: Introduction to Modern Statistics Section 17.5.

Consider the research study described below.

Sleep deprivation, CA vs. OR.

According to a report on sleep deprivation by the Centers for Disease Control and Prevention, the proportion of California residents who reported insufficient rest or sleep during each of the preceding 30 days is 8.0%, while this proportion is 8.8% for Oregon residents. These data are based on simple random samples of 11,545 California and 4,691 Oregon residents. CDC 2008

Perform a hypothesis test for the difference between the proportions of CA and OR who are sleep deprived.
Calculate a 95% confidence interval for the difference between the proportions of CA and OR who are sleep deprived and interpret it in context of the data.

Timer starts

10:10

10.10-minute Activity (2/10)

First, check for the conditions:
- Independence: The observations are independent because the samples are randomly sampled from the population.
- Success-Failure: We have at least 10 observation for each level in each group. \[n_O\hat{p}_O = 4691(0.088) = 413 \longrightarrow \text{"successes" - OR sleep deprived}\] \[n_O(1-\hat{p}_O) = 4691(1-0.088) = 4278 \longrightarrow \text{"failures" - OR not sleep deprived}\] \[n_C\hat{p}_C = 11545(0.08) = 924 \longrightarrow \text{"successes" - CA sleep deprived}\] \[n_C(1-\hat{p}_C) = 11545(1-0.08) = 10621 \longrightarrow \text{"failures" - CA not sleep deprived}\]
Thus, the variability of \(\hat{p}_O\) and \(\hat{p}_C\) is approximately normal and the difference \(\hat{p}_O - \hat{p}_C\) is aldo approximately normal.

10.10-minute Activity (3/10)

Hypothesis Testing.

Null Hypothesis. \(H_0\): There is NO significant difference between the proportion of speed deproved individuals in Oregon and California. \[p_O - p_C = 0\]
Alternative Hypothesis 1. \(H_{A1}\): There is no significant difference between the proportion of speed deproved individuals in Oregon and California. \[p_O - p_C \ne 0\]
Alternative Hypothesis 2. \(H_{A2}\): There is a higher proportion of sleep deprived individuals in Oregon than in California. \[p_O - p_C > 0\].

10.10-minute Activity (4/10)

The null value and point-estimate: \(p_0 = 0\) and \(\hat{p}_O - \hat{p}_C = 0.088 - 0.08 = 0.008\)
Compute the pooled proportion. \[ \begin{aligned} \hat{p}_{pool} & = \frac{\hat{p}_O n_{O} + \hat{p}_C n_{C}}{n_O + n_C} \\ & = \frac{0.088(4691) + 0.08(11545)}{4691 + 11545} \\ & = 0.0823 \end{aligned} \]

10.10-minute Activity (5/10)

Compute the \(Z\) statistic. \[ \begin{aligned} Z & = \frac{(\hat{p}_O - \hat{p}_C) - p_0}{\sqrt{\hat{p}_{pool}(1-\hat{p}_{pool}) \left( \frac{1}{n_O} + \frac{1}{n_C} \right) }} \\ & = \frac{0.008 - 0}{\sqrt{0.0823(1-0.0823) \left( \frac{1}{4691} + \frac{1}{11545} \right) }} \\ & = 1.68 \end{aligned} \]
The p-value for \(Z = 1.68\) is \(0.0465\) (one-sided) and \(0.0929\) (two-sided).

10.10-minute Activity (6/10)

If significance value is \(0.05\), then it looks like (barely) that the one-sided test (p-value is 0.04565) is significant. Therefore, we can say that there is a larger proportion of Oregon individuals who are sleep deprived than California individuals.
If significance value is \(0.05\), then the two-sided test (p-value is 0.0929) concludes in failure to reject the null. We don’t have enough evidence to support that there is a difference in proportions.
What does this mean? Well, it just means that we can’t conclude anything because we are too close to the significance value. However, the significance value is a subjective choice. We can reason ourselves that if the p-value is low enough, we can say that the data is unlikely to have occurred by chance, if set our significance value to be a reasonable \(0.10\), which is still a “rare” occurrence.

10.10-minute Activity (7/10)

95% confidence Interval.

Compute the standard error: \[ \begin{aligned} SE & = \sqrt{\frac{\hat{p}_O(1-\hat{p}_O)}{n_O} + \frac{\hat{p}_C(1-\hat{p}_C)}{n_C}} \\ & = \sqrt{\frac{0.088(1-0.088)}{4691} + \frac{0.08(1-0.08)}{11545}} \\ & = 0.0048 \end{aligned} \]
Compute the \(z^*\) for a 95% CI. \[z^* = 1.96\]

10.10-minute Activity (8/10)

Compute the 95% confidence interval. \[\hat{p}_O - \hat{p}_C \pm z^* SE\] \[0.008 \pm 1.96*0.0048\] \[(-0.0014,0.0174)\]
Therefore, we are 95% confident that the true difference in proportion is between -0.0014 and 0.0174. This means that there is a possibility that the difference is 0 where there is no significant difference. This coincides with the two-sided sided test where we concluded that there is no significant difference.

10.10-minute Activity (9/10)

On additional note, if we compute the 90% confidence interval we will have an interval of (0.0001,0.0158), which technically not include 0 but barely. This conclusion coincides with our decision to have the significance value of 0.10.
Note that the choice of the significance value and confidence level is subjective. What matters is how you interpret the p-value and the confidence interval in your decision making process.

10.10-minute Activity (10/10)

Here, choosing 90% CI over 95% CI would mean that we are less confident on where the true proportion should lie but the interval is more precise. Choosing a significance value of 0.10 means we subjectively chose to have the “unlikely to have occurred by chance” concept to be less than 10% probability.
The results’ dependability is only validated at the end of the test. The possibility of having a type I or type II error result still exists. This is one of the disadvantages of the Frequentist paradigm of inference.

Summary

Today, we discussed the following:

Hypothesis testing and confidence interval for comparing two proportions

Next, we will discuss:

Inference for two way tables

11 - Inference for Comparing Two Proportion Hypothesis Testing & Confidence Intervals