Previously…

In the previous lectures, we learned about the following:

Population and Sample
Parameter and Statistic
Bootstrapping and the Central Limit Theorem (CLT).

Central Limit Theorem

$Image Source: [Bootstrapping Statistics by Trist'n Joseph.](https://towardsdatascience.com/bootstrapping-statistics-what-it-is-and-why-its-used-e2fa29577307){target=_blank}$

Image Source: Bootstrapping Statistics by Trist’n Joseph.

The Central Limit Theorem (CLT) states that regardless of the underlying distribution, the sampling distribution of a statistic (e.g. mean or proportion) of any independent, random variable will be normal or near normal.

Not all sampling distribution will have a normal distribution. Many summary statistics and variables are nearly normal, but none are exactly normal. Thus the normal distribution, while not perfect for any single problem, is very useful for a variety of problems.

Bootstrapping

$Image Source: [Bootstrapping Statistics by Trist'n Joseph.](https://towardsdatascience.com/bootstrapping-statistics-what-it-is-and-why-its-used-e2fa29577307){target=_blank}$

Image Source: Bootstrapping Statistics by Trist’n Joseph.

Bootstrapping is a method of resampling to estimate the sampling distribution of a statistic (e.g. mean, proportion). Bootstrap sampling is often called sampling with replacement.

Bootstrapping allows us to simulate the sampling distribution of a statistic without the assumption of normality.

Introduction to Confidence Intervals (CIs)

In this lecture, we will learn about:

The basics of Confidence Intervals (CIs)
Bootstrapping sample proportions to understand the variability of a statistic
CIs and Hypothesis testing are both inferential techniques that are connected with each other.

Case Study - Medical Consultant

One consultant tried to attract patients by noting the average complication rate for liver donor surgeries in the US is about 10%, but her clients have had only 3 complications in the 62 liver donor surgeries she has facilitated.

She claims this is strong evidence that her work meaningfully contributes to reducing complications.

Let $p$ represent the true complication rate for liver donors working with this consultant. (refering to the population proportion - parameter)

We estimate $p$ using the data, and label the estimate $\hat{p}$. (referring to the sample proportion - statistic) \[\hat{p} = \frac{3}{62} = 0.048\]

Medical Consultant - Observed and Null Statistic

$\hat{p} = \frac{3}{62} = 0.0484 \longrightarrow$ Observed Statistic
$p_0 = 0.10 \longrightarrow$ The Null Statistic - the average complication rate in the US.
Medical Consultants Claim: Their work meaningfully contributes to reducing complications - far below the 10% US average complication rate.

Medical Consultant - The Problem

Is it possible to assess the consultant’s claim (that the reduction in complications is due to her work) using the data? No.
The claim is that there is a causal connection, but the data are observational, so we must be on the lookout for confounding variables. We can’t conclude a causal connection for observational studies.
While it is not possible to assess the causal claim, it is still possible to understand the consultant’s true rate of complications by considering its variability.
Objective: Estimate the unknown population proportion by using the sample to approximate the proportion of complications for a client of the medical consultant.

Medical Consultant - 95% CI using the Percentile Method

The original medical consultant data is bootstrapped 10,000 times. Each simulation creates a sample from the original data where the proportion of a complication is 3/62. The bootstrap 2.5 percentile proportion is 0 and the 97.5 percentile is 0.113. The result is: we are confident that, in the population, the true probability of a complication is between 0% and 11.3%.

Medical Consultant - Observed and Null Statistic

$\hat{p} = \frac{3}{62} - 0.0484 \longrightarrow$ Observed Statistic
$p_0 = 0.10 \longrightarrow$ The Null Statistic - the average complication rate in the US.
Medical Consultants Claim: Their work meaningfully contributes to reducing complications - far below the 10% US average complication rate.

95% Confidence Interval (CI): Using bootstrapping the true proportion of a complication is between 0 (0%) and 0.113 (11.3%).

The interval overlaps the null statistic 0.10 (10%). There is a possibility that the consultant’s work is associated with a higher risk ($p > 0.10$), higher than the US average.

Medical Consultant - Connection to Hypothesis Testing (1/3)

In hypothesis testing, we always assume that the null hypothesis is true.

$H_0:$ Under the null assumption, $p = 0.10$. The consultant’s work does not contribute to reducing complications. The observed statistic $\hat{p} = \frac{3}{62}$ is just due to the natural variability under the null hypothesis and likely occurred by chance.
$H_A:$ The alternative hypothesis is $p < 0.10$. The consultant’s work contributes to reducing complications - far below the 10% US average complication rate. The observed statistic $\hat{p} = \frac{3}{62}$ unlikely to occur by chance. This is a one-sided test.
The null statistic is $p_0 = 0.10$ and the observed statistic is $\hat{p} = \frac{3}{62}$.

Medical Consultant - Connection to Hypothesis Testing (2/3)

The null distribution, created from 10,000 simulated samples. The left tail, representing the p-value for the hypothesis test, contains 0.117 (11.7%) of the simulations.

Medical Consultant - Connection to Hypothesis Testing (3/3)

Because the estimated p-value is 0.117, which is larger than the significance level $0.05$, we fail reject the null hypothesis.
We don’t have enough evidence to support the alternative hypothesis but it does not say about the consultant’s performance.

Medical Consultant - CIs and Hypothesis testing

The Hypothesis test and confidence interval always agree. For significance value $\alpha$, the confidence value will be $1-\alpha$.
Remember that the 95% Confidence Interval (CI) of the population proportion of a complication is between 0 (0%) and 0.113 (11.3%). The interval overlaps the null statistic 0.10 (10%).
Also, we fail to reject the null because p-value is $0.117$, which is larger than the significance level $0.05$.
Previously,we only considered $H_A < 0.10$. Maybe $H_A > 0.10$. We need to consider a two-sided test or other possibilities.

Confidence Intervals (1/2)

A Confidence Interval is a range of possible values that is likely to capture an unknown parameter, given a certain degree of probability (confidence).
This interval tells us nothing about the distribution of the true parameter $p$. The population parameter $p$ is a fixed unknown number.
A 95% confidence interval gives us a region where, had we redone the same data, then 95% of the time, the true value $p$ will be contained in the interval.

Confidence Intervals (2/2)

If the null hypothesized value is found in our confidence interval, then that would mean we have a bad confidence interval and our p-value would be high.
If only looking at the CI, to a significant result, the null statistic should fall outside the 95% CI. (Remember that we can define this percentage, 95%, 98%, or 99% etc.)
We will talk more on how we can use CIs and Hypothesis testing together to make strong conclusion!

Summary

In this lecture we talked about:

Bootstrapping and how it is used to compute Confidence Intervals.
The connection between Confidence Intervals and Hypothesis Testing.

In the next lectures, we will talk about:

More details on how to compute quantiles and percentiles.
Confidence intervals of means.
The normal distribution.

Today’s Activity

Within your group, discuss the answers for the following problem.

Twitter users and news. A poll conducted in 2013 found that 52% of all US adult Twitter users get at least some news on Twitter. However, this value was based on a sample, so it may not be a perfect estimate for the population parameter of interest on its own. The study was based on a sample of 736 adults. Below is a distribution of 1000 bootstrapped sample proportions from the Pew dataset. OpenIntro: IMS Section 12.5

Using the distribution of 1000 bootstrapped proportions, approximate a 98% confidence interval for the true proportion of US adult Twitter users (in 2013) who get at least some of their news from Twitter. Interpret the interval in the context of the problem.

6 - Introduction to Confidence Intervals

Previously…

Central Limit Theorem

Bootstrapping

Introduction to Confidence Intervals (CIs)

Case Study - Medical Consultant

Medical Consultant - Observed and Null Statistic

Medical Consultant - The Problem

Medical Consultant - 95% CI using the Percentile Method

Medical Consultant - Observed and Null Statistic

Medical Consultant - Connection to Hypothesis Testing (1/3)

Medical Consultant - Connection to Hypothesis Testing (2/3)

Medical Consultant - Connection to Hypothesis Testing (3/3)

Medical Consultant - CIs and Hypothesis testing

Confidence Intervals (1/2)

Confidence Intervals (2/2)

Summary

Today’s Activity