Previously on Statistics…

Inference for proportions
Inference for two-way tables
Central Limit Theorem
Population parameter vs sample statistic

Inference on Single Mean

Today, we will discuss the following:

Using theoretical methods to compute confidence intervals on inference for one mean.
The Student’s t distribution

Central Limit Theorem for the sample mean

When we collect a sufficiently large sample of \(n\) independent observations from a population with mean \(\mu\) and standard deviation \(\sigma,\) the sampling distribution of \(\bar{x}\) will be nearly normal with

\[\text{Mean} = \mu \qquad \text{Standard Error }(SE) = \frac{\sigma}{\sqrt{n}}\]

Evaluating the two conditions required for modeling \(\bar{x}\)

Two conditions are required to apply the Central Limit Theorem for a sample mean \(\bar{x}:\)

Independence. The sample observations must be independent. The most common way to satisfy this condition is when the sample is a simple random sample from the population.
Normality. When a sample is small, we also require that the sample observations come from a normally distributed population. We can relax this condition more and more for larger and larger sample sizes. This condition is obviously vague, making it difficult to evaluate, so next we introduce a couple rules of thumb to make checking this condition easier.

General rule for performing the normality check

Note, it often takes practice to get a sense for whether or not a normal approximation is appropriate.

\(\mathbf{n < 30}:\) If the sample size \(n\) is less than 30 and there are no clear outliers in the data, then we typically assume the data come from a nearly normal distribution to satisfy the condition.
\(\mathbf{n \geq 30}:\) If the sample size \(n\) is at least 30 and there are no particularly extreme outliers, then we typically assume the sampling distribution of \(\bar{x}\) is nearly normal, even if the underlying distribution of individual observations is not.

Normality Assesment (1/2)

Consider the four plots provided that come from simple random samples from different populations.

Are the independence and normality conditions met in each case?

Histograms of samples from two different populations.

Normality Assesment (2/2)

The first sample has fewer than 30 observations, so we are watching for any clear outliers. With no clear outliers, the normality condition can be reasonably assumed to be met.
The second sample has a sample size greater than 30 and includes an outlier. This is an example of a particularly extreme outlier, so the normality condition would not be satisfied.

The t-distribution (1/2)

Comparison of a \(t\)-distribution and a normal distribution.

The \(t\)-distribution is always centered at zero and has a single parameter: degrees of freedom. The degrees of freedom describes the precise form of the bell-shaped \(t\)-distribution. In general, we’ll use a \(t\)-distribution with \(df = n - 1\) to model the sample mean when the sample size is \(n.\)

The t-distribution (2/2)

The larger the degrees of freedom, the more closely the \(t\)-distribution resembles the standard normal distribution.

Mercury content in Risso’s dolphins (1/3)

We will identify a confidence interval for the average mercury content in dolphin muscle using a sample of 19 Risso’s dolphins from the Taiji area in Japan.

Summary of mercury content in the muscle of 19 Risso’s dolphins from the Taiji area. Measurements are in micrograms of mercury per wet gram of muscle \((\mu\)g/wet g).
n	Mean	SD	Min	Max
19	4.4	2.3	1.7	9.2

Question - Are the independence and normality conditions satisfied for this dataset?

The observations are a simple random sample, therefore it is reasonable to assume that the dolphins are independent. The summary statistics do not suggest any clear outliers, with all observations within 2.5 standard deviations of the mean. Based on this evidence, the normality condition seems reasonable.

Mercury content in Risso’s dolphins (2/3)

One sample t-intervals

\[ \begin{aligned} \text{point estimate} \ &\pm\ t^*_{df} SE \\ \bar{x} \ &\pm\ t^*_{df} \frac{s}{\sqrt{n}} \end{aligned} \]

We plug in \(s\) and \(n\) into the formula: \(SE = \frac{s}{\sqrt{n}} = \frac{2.3}{\sqrt{19}} = 0.528.\)
The degrees of freedom is easy to calculate: \(df = n - 1 = 19 -1 = 18.\) Using statistical software, we find the cutoff where the upper tail is equal to 2.5%: \(t^*_{18} = 2.10.\) The area below -2.10 will also be equal to 2.5%.

# use qt() to find the t-cutoff (with 95% in the middle)
qt(0.025, df = 18)
#> [1] -2.1
qt(0.975, df = 18)
#> [1] 2.1

Mercury content in Risso’s dolphins (3/3)

One sample t-intervals

\[ \begin{aligned} \bar{x} \ &\pm\ t^*_{18} SE \\ 4.4 \ &\pm\ 2.10 (0.528) \\ \end{aligned} \] \[(3.29,5.51)\]

We are 95% confident the average mercury content of muscles in Risso’s dolphins is between 3.29 and 5.51 \(\mu\)g/wet gram, which is considered extremely high.

The Margin of Error for Means

\[ME = t^*_{df}SE = t^*_{df} \frac{s}{\sqrt{n}}\]

where \(t^*_{df}\) is calculated from a specified percentile on the t-distribution with df degrees of freedom.

We can work backwards:

to compute the critical \(t^*_{df}\) if given the \(ME\), \(n\), and \(s\).
to compute the \(SE\) if given the \(ME\), \(t^*_{df}\).
to compute the \(n\) if given the \(ME\), \(t^*_{df}\), and \(s\).

10.10-Minute Activity (1/3)

The exercise problem shown below was taken and slightly modified from your textbook OpenIntro: Introduction to Modern Statistics Section 19.4.

Heights of adults.

Researchers studying anthropometry collected body measurements, as well as age, weight, height and gender, for 507 physically active individuals. Summary statistics for the distribution of heights (measured in centimeters), along with a histogram, are provided below. Heinz et al. 2003)

Min	Q1	Median	Mean	Q3	Max	SD	IQR
147	164	170	171	178	198	9.4	14

Check if the conditions are statisfied.
Compute the 90% confidence interval for the average heights of adults.
Work backwards to compute the critical \(t^*_{df}\) for a margin of error of \(0.001\).

Timer starts

10:10

10.10-Minute Activity (2/4)

- Independence: The observations are a simple random sample, therefore it is reasonable to assume that the 507 physically active individuals are independent.
- Normality: Based on the summary statistics and the histogram, there are no clear outliers and the sample size is large enough to assume that the resulting sampling distribution of the mean is normally distributed.

10.10-Minute Activity (3/4)

- Compute the standard error. \[SE = \frac{s}{\sqrt{n}} = \frac{9.4}{\sqrt{148}} = 0.7727\]
- Compute the \(t^*_{df}\). Given a confidence level of 90%, the t statistic computed using R command qt(0.95,147) is shown below. \[t^*_{147} = 1.6553\]
- Compute the 90% confidence interval. \[ \begin{aligned} \bar{x} & \pm t^*_{147} SE \\ 171 & \pm 1.6553 (0.772) \\ \end{aligned} \] \[(169.721,172.279)\]
Therefore, we are 90% confident that the true mean heights of adults is between \(169.721\)cm and \(172.27\)cm.

10.10-Minute Activity (4/4)

- The goal is to find the critical \(t^*_{df}\) for a margin of error \(ME = 0.001\).
- Work backwards. \[ \begin{aligned} ME & = t^*_{df}SE \\ 0.001 & = t^*_{df} (0.772) \\ t^*_{df} & = \frac{0.001}{0.772} \\ t^*_{df} & = 0.0013 \\ \end{aligned} \]
For a \(t^*_{df} = 0.0013\) with the same standard error and degrees of freedom, the interval will become narrower and more precise but we lose the level of confidence.

Summary

Today, we discussed the following:

The student’s t distribution
Computing confidence interval using one sample t-intervals

Next, we will discuss:

Hypothesis testing on a single mean using one sample t-tests
Hypothesis testing on two means using two sample t-tests

12 - Inference for One Mean Confidence Intervals