Upon completing today’s lab activity, students should be able to do the following using R and RStudio:
library(tidyverse)
library(openintro)
library(dplyr)
library(infer)
library(kableExtra)
library(janitor)
Every year, the US releases to the public a large data set containing information on births recorded in the country. This data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of 1,000 cases from the data set released in 2014.
The births14
data can be found in the openintro R package.
<- births14 %>% na.omit()
births14 glimpse(births14)
## Rows: 794
## Columns: 13
## $ fage <int> 34, 36, 37, 32, 32, 37, 29, 30, 29, 30, 34, 28, 32, 24,…
## $ mage <dbl> 34, 31, 36, 31, 26, 36, 24, 32, 26, 34, 27, 22, 25, 20,…
## $ mature <chr> "younger mom", "younger mom", "mature mom", "younger mo…
## $ weeks <dbl> 37, 41, 37, 36, 39, 36, 40, 39, 39, 42, 40, 40, 34, 37,…
## $ premie <chr> "full term", "full term", "full term", "premie", "full …
## $ visits <dbl> 14, 12, 10, 12, 14, 10, 13, 15, 11, 14, 16, 20, 20, 10,…
## $ gained <dbl> 28, 41, 28, 48, 45, 20, 65, 25, 22, 40, 30, 31, 25, 70,…
## $ weight <dbl> 6.96, 8.86, 7.51, 6.75, 6.69, 6.13, 6.74, 8.94, 9.12, 8…
## $ lowbirthweight <chr> "not low", "not low", "not low", "not low", "not low", …
## $ sex <chr> "male", "female", "female", "female", "female", "female…
## $ habit <chr> "nonsmoker", "nonsmoker", "nonsmoker", "nonsmoker", "no…
## $ marital <chr> "married", "married", "married", "married", "married", …
## $ whitemom <chr> "white", "white", "not white", "white", "white", "white…
<- births14 %>%
births14_sample_stats group_by(habit) %>%
summarise(
n = n(),
Mean = mean(weight),
SD = sd(weight)
)
%>%
births14_sample_stats kbl(linesep = "", booktabs = TRUE, caption = "Summary statistics for the `births14` dataset.",
col.names = c("Habit", "n", "Mean", "SD"),
align = "lccc", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position"), full_width = FALSE) %>%
column_spec(1:4, width = "7em")
Habit | n | Mean | SD |
---|---|---|---|
nonsmoker | 732 | 7.31 | 1.21 |
smoker | 62 | 6.55 | 1.61 |
Question - Is the data (smoking group) a convincing evidence to support the claim of the average weight to be less than \(7.5\) lbs?
\(H_0\): The average weight of the smoking group is \(7.5\) lbs. \[\mu = 7.5\]
\(H_A\): The average weight of the smoking group is not \(7.5\) lbs. \[\mu \ne 7.5\]
The null value is \(\mu_0 = 7.5\). The point-estimate is \(\bar{x} = 6.68\) and the sample standard deviation is \(s = 1.60\).
Is there a difference in weight means between the smoking group and nonsmoking group?
\(H_0\): There is no difference in means between the smoking and nonsmoking groups. \[\mu_{smoking} = \mu_{nonsmoking}\]
\(H_A\): There is a significant difference in means between the smoking and nonsmoking groups. In particular the smoking group weights is less than the nonsmoking group weights. \[\mu_{smoking} < \mu_{nonsmoking}\]
The null value is \(\mu_0 = 0\). The point-estimate is \(\bar{x}_{nonsmoking} - \bar{x}_{smoking} = 0.59\) and the sample standard deviations are \(s_{smoking} = 1.60\) and \(s_{nonsmoking} = 1.23\).
<- births14 %>% filter(habit == "smoker") births14_smoke
Calculating the observed statistic,
<- births14_smoke %>%
T specify(response = weight) %>%
hypothesize(null = "point", mu = 7.5) %>%
calculate(stat = "t")
T
## Response: weight (numeric)
## Null Hypothesis: point
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 -4.63
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- births14_smoke %>%
T observe(response = weight, null = "point", mu = 7.5, stat = "t")
T
## Response: weight (numeric)
## Null Hypothesis: point
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 -4.63
Then, generating the null distribution,
<- births14_smoke %>%
null_dist specify(response = weight) %>%
hypothesize(null = "point", mu = 7.5) %>%
generate(reps = 3000, type = "bootstrap") %>%
calculate(stat = "t")
Alternatively, finding the null distribution using theoretical methods using the assume()
verb,
<- births14_smoke %>%
null_dist_theory specify(response = weight) %>%
assume("t")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = T, direction = "two-sided")
## Warning: Check to make sure the conditions have been met for the theoretical
## method. {infer} currently does not check these for you.
Compute p-value.
%>%
null_dist get_p_value(obs_stat = T, direction = "two-sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
Alternatively, using the t_test wrapper:
%>%
births14_smoke t_test(response = weight, mu = 7.5)
## # A tibble: 1 × 7
## statistic t_df p_value alternative estimate lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 -4.63 61 0.0000195 two.sided 6.55 6.15 6.96
infer
does not support testing on one numerical variable via the z distribution.
Finding the observed statistic,
<- births14_smoke %>%
x_bar specify(response = weight) %>%
calculate(stat = "mean")
Alternatively, using the observe() wrapper to calculate the observed statistic,
<- births14_smoke %>%
x_bar observe(response = weight, stat = "mean")
Then, generating a bootstrap distribution,
<- births14_smoke %>%
boot_dist specify(response = weight) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
Use the bootstrap distribution to find a confidence interval,
<- get_ci(boot_dist) percentile_ci
Visualizing the observed statistic alongside the distribution,
visualize(boot_dist) +
shade_confidence_interval(endpoints = percentile_ci)
Alternatively, use the bootstrap distribution to find a confidence interval using the standard error,
<- get_ci(boot_dist, type = "se", point_estimate = x_bar)
standard_error_ci
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
Instead of a simulation-based bootstrap distribution, we can also define a theory-based sampling distribution,
<- births14_smoke %>%
sampling_dist specify(response = weight) %>%
assume(distribution = "t")
Visualization and calculation of confidence intervals interfaces in the same way as with the simulation-based distribution,
<- get_ci(sampling_dist, point_estimate = x_bar)
theor_ci
theor_ci
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 6.15 6.96
visualize(sampling_dist) +
shade_confidence_interval(endpoints = theor_ci)
Note that the t distribution is recentered and rescaled to lie on the scale of the observed data. infer does not support confidence intervals on means via the z distribution.
Finding the standardized observed statistic,
<- births14 %>%
t_hat specify(weight ~ habit) %>%
hypothesize(null = "independence") %>%
calculate(stat = "t", order = c("smoker", "nonsmoker"))
t_hat
## Response: weight (numeric)
## Explanatory: habit (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 -3.60
Alternatively, using the observe()
wrapper to calculate the observed statistic
<- births14 %>%
t_hat observe(weight ~ habit,
stat = "t", order = c("smoker", "nonsmoker"))
Then, generating the null distribution,
<- births14 %>%
null_dist specify(weight ~ habit) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "t", order = c("smoker", "nonsmoker"))
Alternatively, finding the null distribution using theoretical methods using the assume()
verb,
<- births14 %>%
null_dist_theory specify(weight ~ habit) %>%
assume("t")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = t_hat, direction = "less")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = t_hat, direction = "less")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = t_hat, direction = "less")
## Warning: Check to make sure the conditions have been met for the theoretical
## method. {infer} currently does not check these for you.
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = t_hat, direction = "less")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.001
Note the similarities in this plot and the previous one.
Finding the standardized point estimate,
<- births14 %>%
t_hat specify(weight ~ habit) %>%
calculate(stat = "t", order = c("smoker", "nonsmoker"))
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- births14 %>%
t_hat observe(weight ~ habit,
stat = "t", order = c("smoker", "nonsmoker"))
Then, generating a bootstrap distribution,
<- births14 %>%
boot_dist specify(weight ~ habit) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "t", order = c("smoker", "nonsmoker"))
Use the bootstrap distribution to find a confidence interval,
<- get_ci(boot_dist) percentile_ci
Visualizing the observed statistic alongside the distribution,
visualize(boot_dist) +
shade_confidence_interval(endpoints = percentile_ci)
Alternatively, use the bootstrap distribution to find a confidence interval using the standard error,
<- boot_dist %>%
standard_error_ci get_ci(type = "se", point_estimate = t_hat)
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
See the above subsection (diff in means) for a theory-based approach. infer does not support confidence intervals on means via the z distribution.
The ucla_textbooks_f18
data can be found in the openintro R package.
glimpse(ucla_textbooks_f18)
## Rows: 201
## Columns: 20
## $ year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,…
## $ term <fct> Fall, Fall, Fall, Fall, Fall, Fall, Fall, Fall, Fall,…
## $ subject <fct> "American Indian Studies", "Anthropology", "Art", "Ar…
## $ subject_abbr <fct> AM IND, ANTHRO, NA, ART&ARC, NA, ASTR, BMD RES, CHEM,…
## $ course <fct> "Introduction to American Indian Studies", "Archaeolo…
## $ course_num <fct> M10, 2, 11D, 10, M60W, 4, 5HA, 14A, 14B, M5A, 10, 1E,…
## $ course_numeric <int> 10, 2, 11, 10, 60, 4, 5, 14, 14, 5, 10, 1, 2, 9, 10, …
## $ seminar <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ ind_study <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ apprenticeship <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ internship <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ honors_contracts <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ laboratory <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ special_topic <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ textbook_isbn <chr> "9781138477858", "9780307741806", NA, "9780979757549"…
## $ bookstore_new <dbl> 47.97, 14.26, NA, 13.50, 49.26, 119.97, NA, 188.10, 2…
## $ bookstore_used <dbl> 44.97, 10.96, NA, 11.00, 43.26, 67.00, NA, 149.00, 16…
## $ amazon_new <dbl> 47.45, 13.55, NA, 12.53, 54.95, 124.80, NA, NA, NA, N…
## $ amazon_used <dbl> 51.20, 7.10, NA, NA, 24.83, 47.75, NA, NA, 90.00, NA,…
## $ notes <fct> "", "", NA, "", NA, NA, NA, "", "New version only fro…
Select the prices for new books from the bookstore and amazon.
<- ucla_textbooks_f18 %>% select(bookstore_new,amazon_new) %>% na.omit()
new_books new_books
## # A tibble: 68 × 2
## bookstore_new amazon_new
## <dbl> <dbl>
## 1 48.0 47.4
## 2 14.3 13.6
## 3 13.5 12.5
## 4 49.3 55.0
## 5 120. 125.
## 6 17.0 11.8
## 7 12.0 10.9
## 8 26.8 38.9
## 9 9.96 8.99
## 10 40.0 35
## # … with 58 more rows
Next, we take the difference of the pairs.
<- new_books %>% mutate(diff = bookstore_new - amazon_new)
new_books_diff new_books_diff
## # A tibble: 68 × 3
## bookstore_new amazon_new diff
## <dbl> <dbl> <dbl>
## 1 48.0 47.4 0.520
## 2 14.3 13.6 0.710
## 3 13.5 12.5 0.97
## 4 49.3 55.0 -5.69
## 5 120. 125. -4.83
## 6 17.0 11.8 5.18
## 7 12.0 10.9 1.09
## 8 26.8 38.9 -12.2
## 9 9.96 8.99 0.97
## 10 40.0 35 4.97
## # … with 58 more rows
Now, I can use my differences to perform hypothesis test, which is equivalent to a one-sample t-test.
<- new_books_diff %>%
T specify(response = diff) %>%
hypothesize(null = "point", mu = 0) %>%
calculate(stat = "t")
T
## Response: diff (numeric)
## Null Hypothesis: point
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 2.20
<- new_books_diff %>%
null_dist_theory specify(response = diff) %>%
assume("t")
visualize(null_dist_theory) +
shade_p_value(obs_stat = T, direction = "two-sided")
%>%
null_dist_theory get_p_value(obs_stat = T, direction = "two-sided")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.0312
Next, we can do confidence intervals, which is equivalent to a two-sample t-interval.
<- new_books_diff %>%
x_bar_diff specify(response = diff) %>%
calculate(stat = "mean")
x_bar_diff
## Response: diff (numeric)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 3.58
Simulation Method
<- new_books_diff %>%
boot_dist specify(response = diff) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
<- get_ci(boot_dist, type = "se", point_estimate = x_bar_diff)
standard_error_ci
standard_error_ci
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.489 6.68
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
Theoretical Method
<- new_books_diff %>%
sampling_dist specify(response = diff) %>%
assume(distribution = "t")
<- get_ci(sampling_dist, point_estimate = x_bar_diff)
theor_ci
theor_ci
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.334 6.83
visualize(sampling_dist) +
shade_confidence_interval(endpoints = theor_ci)
Go through the tutorial in Chapter 23.4 in the Textbook.
Use the births14
data set to perform a hypothesis test and confidence interval using the weight
and whitemom
variables.
Print out the summarized statistics of the variables weight
and whitemom
. Note that the weight
variable is a continuous numerical variable while the whitemom
variable is a nominal categorical variable with two levels.
Perform a hypothesis test and confidence interval for the not white
level in the whitemom
variable. Suppose that the average birth weight is \(7.5\) lb. Does the data show a convincing evidence that non white individuals have less average birth weights?
Perform inference (hypothesis testing and confidence interval) for the difference in birth weight means for the two levels in the whitemom
variable. Is there a significant difference and by how much?
Use the ucla_textbooks_f18
data set to perform a hypothesis test and confidence interval for comparing the new and used books.
Compute the paired difference between the UCLA bookstore and Amazon used books. Perform a hypothesis test and compute the 95% confidence interval. What is your null and alternative hypothesis? What are your conclusions?
Suppose that you are comparing new and used books regardless of store. Compute the paired difference between new and used books (take the average price for each book between stores and then compute the difference). Compute the 95% confidence interval.