Learning Objectives


Upon completing today’s lab activity, students should be able to do the following using R and RStudio:

  1. Perform the Chi-Squared Test for Independence using simulation and theoretical methods.


library(tidyverse)
library(openintro)
library(dplyr)
library(infer)
library(kableExtra)
library(janitor)


Lizard Habitats


The exercise problems shown below was taken and slightly modified from your textbook OpenIntro: Introduction to Modern Statistics Section 18.4.

Consider the following problem statement.

In order to assess whether habitat conditions are related to the sunlight choices a lizard makes for resting, Western fence lizard (Sceloporus occidentalis) were observed across three different microhabitats. Adolph1990 Asbury2007

The lizard_habitat data used in this exercise can be found in the openintro R package.

  • The null and alternative Hypothesis \[H_0: \text{Sunlight and site are independent. Sunlight choices do not vary by site.}\] \[H_A: \text{Sunlight and site are dependent. Sunlight choices vary by site}\]

The data. Here, it counts the number of items in each category.

lizard_habitat %>% 
  count(site, sunlight)
## # A tibble: 9 × 3
##   site     sunlight     n
##   <fct>    <fct>    <int>
## 1 desert   sun         16
## 2 desert   partial     32
## 3 desert   shade       71
## 4 mountain sun         56
## 5 mountain partial     36
## 6 mountain shade       15
## 7 valley   sun         42
## 8 valley   partial     40
## 9 valley   shade       24

Here is the data printed as table.

lizard_habitat %>%
  count(site, sunlight) %>%
  pivot_wider(names_from = sunlight, values_from = n) %>%
  adorn_totals(where = c("row", "col")) %>%
  kbl(align = "lrrrr", booktabs = TRUE) %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), 
                latex_options = "HOLD_position",
                full_width = FALSE) %>%
  add_header_above(c(" "=1, "sunlight" = 3, " " = 1)) %>%
  column_spec(1:5, width = "5em")
sunlight
site sun partial shade Total
desert 16 32 71 119
mountain 56 36 15 107
valley 42 40 24 106
Total 114 108 110 332

Throughout this demonstration, we will be using the infer package.

Calculating the observed \(\chi^2\) statistic.

Calculating the observed statistic,

Chisq_hat <- lizard_habitat %>%
  specify(formula = site ~ sunlight) %>% 
  hypothesize(null = "independence") %>%
  calculate(stat = "Chisq")

Chisq_hat
## Response: site (factor)
## Explanatory: sunlight (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  68.8

Alternatively, using the observe() wrapper to calculate the observed statistic,

Chisq_hat <- lizard_habitat %>%
  observe(formula = site ~ sunlight, stat = "Chisq")

Chisq_hat
## Response: site (factor)
## Explanatory: sunlight (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  68.8

Randomization/Permutation Method

Then, generating the null distribution using randomizations,

null_dist <- lizard_habitat %>%
  specify(site ~ sunlight) %>%
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "Chisq")

Visualizing the observed statistic alongside the null distribution,

visualize(null_dist) +
  shade_p_value(obs_stat = Chisq_hat, direction = "greater")

Theoretical Method

Before using the theoretical method for the Chi-Squared test for independence, you need to check if the conditions are met.

Conditions: * Independent observations * Large samples: 5 expected counts in each cell

Finding the null distribution using theoretical methods using the assume() verb,

null_dist_theory <- lizard_habitat %>%
  specify(site ~ sunlight) %>%
  assume(distribution = "Chisq")

Visualizing the observed statistic using the theory-based null distribution,

visualize(null_dist_theory) +
  shade_p_value(obs_stat = Chisq_hat, direction = "greater")

Alternatively, visualizing the observed statistic using both of the null distributions,

visualize(null_dist, method = "both") +
  shade_p_value(obs_stat = Chisq_hat, direction = "greater")
## Warning: Check to make sure the conditions have been met for the theoretical
## method. {infer} currently does not check these for you.

Note that the above code makes use of the randomization-based null distribution.

Calculating the p-value

Calculating the p-value from the null distribution and observed statistic,

null_dist %>%
  get_p_value(obs_stat = Chisq_hat, direction = "greater")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Alternatively, using the wrapper to carry out the test,

lizard_habitat %>%
  chisq_test(formula = site ~ sunlight)
## # A tibble: 1 × 3
##   statistic chisq_df  p_value
##       <dbl>    <int>    <dbl>
## 1      68.8        4 4.12e-14

Since the p-value is close to zero, we can reject the null and conclude that there is an association between sunlight choice and the site on which the lizards prefer to live their best lives.

Ngrams and Collocations

Sorry, I thought I have time for this but I am just busy busy lemon squishy. If you are interested in learning about statistical natural language processing using text data or data analysis in general, I recommend taking the Data Science class in Spring of 2022. The prerequisites for Data Science is Introduction to Probability and Statistics. If you are doing well in the Introduction to Probability and Statistics, then you should be okay in Data Science.


Mini Activities


Disaggregating Asian American tobacco use, data.

Understanding cultural differences in tobacco use across different demographic groups can lead to improved health care education and treatment. A recent study disaggregated tobacco use across Asian American ethnic groups including Asian-Indian (n = 4,373), Chinese (n = 4,736), and Filipino (n = 4,912), in comparison to non-Hispanic Whites (n = 275,025). The number of current smokers in each group was reported as Asian-Indian (n = 223), Chinese (n = 279), Filipino (n = 609), and non-Hispanic Whites (n = 50,880). Rao2021

Here is the data code.

asian_smoke <- tibble(
  ethnicity = c(
    rep("Asian-Indian", 4373),
    rep("Chinese", 4736),
    rep("Filipino", 4912)
  ),
  outcome = c(
    rep("smoke", 223), rep("don't smoke", 4150),
    rep("smoke", 279), rep("don't smoke", 4457),
    rep("smoke", 609), rep("don't smoke", 4303)
  )
)

Here is the data printed as table.

asian_smoke %>%
  count(ethnicity, outcome) %>%
  pivot_wider(names_from = outcome, values_from = n) %>%
  adorn_totals(where = c("row", "col")) %>%
  kbl(align = "lrrrr", booktabs = TRUE, format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), 
                latex_options = "HOLD_position",
                full_width = FALSE) %>%
  add_header_above(c(" " = 1, "Smoking" = 2, " " = 1)) %>%
  column_spec(1, width = "7em") %>%
  column_spec(2:4, width = "5em")
Smoking
ethnicity don’t smoke smoke Total
Asian-Indian 4,150 223 4,373
Chinese 4,457 279 4,736
Filipino 4,303 609 4,912
Total 12,910 1,111 14,021
  • In order to assess whether there is a difference in current smoking rates across three Asian American ethnic groups, the observed data is compared to the data that would be expected if there were no association between the variables.
  1. What is the null and alternative hypothesis?

  2. Carry out the randomization procedure and theoretical method for the Chi-Squared test for independence. Plot the resulting distributions.

  3. Compute the p-value using the randomization and theory.

  4. What is the conclusion?


---
title: "7 - Inference for Two-Way Tables"
author: "Alex John Quijano"
date: "11/16/2021"
output: openintro::lab_report
---

## **Learning Objectives**

<br>

Upon completing today's lab activity, students should be able to do the following using R and RStudio:
  
  1. Perform the Chi-Squared Test for Independence using simulation and theoretical methods.
  
<br>

```{r echo=TRUE, message=FALSE}
library(tidyverse)
library(openintro)
library(dplyr)
library(infer)
library(kableExtra)
library(janitor)
```

<br>

## **Lizard Habitats**

<br>

The exercise problems shown below was taken and slightly modified from your textbook [OpenIntro: Introduction to Modern Statistics Section 18.4](https://openintro-ims.netlify.app/inference-tables.html#chp18-exercises){target="_blank"}.

Consider the following problem statement.

> In order to assess whether habitat conditions are related to the sunlight choices a lizard makes for resting, Western fence lizard (*Sceloporus occidentalis*) were observed across three different microhabitats. [Adolph1990](https://esajournals.onlinelibrary.wiley.com/doi/abs/10.2307/1940271?casa_token=WOnmT3osc-MAAAAA:Ja981YuSFrRnYX1iaJMD7TT2t4WFEU5sm5NE-gnJTISMQTjRdA4XlYJcX45uAEGj4-hfjWBZ5rMC2a0){target="_blank"} [Asbury2007](https://scholarship.claremont.edu/cgi/viewcontent.cgi?referer=https://scholar.google.com/&httpsredir=1&article=1277&context=hmc_fac_pub){target="_blank"}

The [`lizard_habitat`](http://openintrostat.github.io/openintro/reference/lizard_habitat.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.

* The null and alternative Hypothesis
  $$H_0: \text{Sunlight and site are independent. Sunlight choices do not vary by site.}$$
  $$H_A: \text{Sunlight and site are dependent. Sunlight choices vary by site}$$

The data. Here, it counts the number of items in each category.

```{r}
lizard_habitat %>% 
  count(site, sunlight)
```

Here is the data printed as table.

```{r echo=TRUE}
lizard_habitat %>%
  count(site, sunlight) %>%
  pivot_wider(names_from = sunlight, values_from = n) %>%
  adorn_totals(where = c("row", "col")) %>%
  kbl(align = "lrrrr", booktabs = TRUE) %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), 
                latex_options = "HOLD_position",
                full_width = FALSE) %>%
  add_header_above(c(" "=1, "sunlight" = 3, " " = 1)) %>%
  column_spec(1:5, width = "5em")
```

Throughout this demonstration, we will be using the `infer` package.

### Calculating the observed $\chi^2$ statistic.

Calculating the observed statistic,

```{r}
Chisq_hat <- lizard_habitat %>%
  specify(formula = site ~ sunlight) %>% 
  hypothesize(null = "independence") %>%
  calculate(stat = "Chisq")

Chisq_hat
```

Alternatively, using the observe() wrapper to calculate the observed statistic,

```{r}
Chisq_hat <- lizard_habitat %>%
  observe(formula = site ~ sunlight, stat = "Chisq")

Chisq_hat
```

### Randomization/Permutation Method

Then, generating the null distribution using randomizations,

```{r}
null_dist <- lizard_habitat %>%
  specify(site ~ sunlight) %>%
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "Chisq")
```

Visualizing the observed statistic alongside the null distribution,

```{r}
visualize(null_dist) +
  shade_p_value(obs_stat = Chisq_hat, direction = "greater")
```

### Theoretical Method

Before using the theoretical method for the Chi-Squared test for independence, you need to check if the conditions are met.

Conditions:
  * Independent observations
  * Large samples: 5 expected counts in each cell

Finding the null distribution using theoretical methods using the `assume()` verb,

```{r}
null_dist_theory <- lizard_habitat %>%
  specify(site ~ sunlight) %>%
  assume(distribution = "Chisq")
```

Visualizing the observed statistic using the theory-based null distribution,

```{r}
visualize(null_dist_theory) +
  shade_p_value(obs_stat = Chisq_hat, direction = "greater")
```

Alternatively, visualizing the observed statistic using both of the null distributions,

```{r}
visualize(null_dist, method = "both") +
  shade_p_value(obs_stat = Chisq_hat, direction = "greater")
```

Note that the above code makes use of the randomization-based null distribution.

### Calculating the p-value

Calculating the p-value from the null distribution and observed statistic,

```{r}
null_dist %>%
  get_p_value(obs_stat = Chisq_hat, direction = "greater")
```

Alternatively, using the wrapper to carry out the test,

```{r}
lizard_habitat %>%
  chisq_test(formula = site ~ sunlight)
```

Since the p-value is close to zero, we can reject the null and conclude that there is an association between sunlight choice and the site on which the lizards prefer to live their best lives.

### Ngrams and Collocations

*Sorry, I thought I have time for this but I am just busy busy lemon squishy. If you are interested in learning about statistical natural language processing using text data or data analysis in general, **I recommend taking the Data Science class in Spring of 2022**. The prerequisites for Data Science is Introduction to Probability and Statistics. If you are doing well in the Introduction to Probability and Statistics, then you should be okay in Data Science.*

<br>

## **Mini Activities**

<br>

**Disaggregating Asian American tobacco use, data.**

> Understanding cultural differences in tobacco use across different demographic groups can lead to improved health care education and treatment. A recent study disaggregated tobacco use across Asian American ethnic groups including Asian-Indian (n = 4,373), Chinese (n = 4,736), and Filipino (n = 4,912), in comparison to non-Hispanic Whites (n = 275,025).  The number of current smokers in each group was reported as Asian-Indian (n = 223), Chinese (n = 279), Filipino (n = 609), and non-Hispanic Whites (n = 50,880). [Rao2021](https://link.springer.com/article/10.1007%2Fs40615-021-01024-5){target="_blank"}
  
Here is the data code.

```{r}
asian_smoke <- tibble(
  ethnicity = c(
    rep("Asian-Indian", 4373),
    rep("Chinese", 4736),
    rep("Filipino", 4912)
  ),
  outcome = c(
    rep("smoke", 223), rep("don't smoke", 4150),
    rep("smoke", 279), rep("don't smoke", 4457),
    rep("smoke", 609), rep("don't smoke", 4303)
  )
)
```

Here is the data printed as table.

```{r echo=TRUE}
asian_smoke %>%
  count(ethnicity, outcome) %>%
  pivot_wider(names_from = outcome, values_from = n) %>%
  adorn_totals(where = c("row", "col")) %>%
  kbl(align = "lrrrr", booktabs = TRUE, format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), 
                latex_options = "HOLD_position",
                full_width = FALSE) %>%
  add_header_above(c(" " = 1, "Smoking" = 2, " " = 1)) %>%
  column_spec(1, width = "7em") %>%
  column_spec(2:4, width = "5em")
```

  * In order to assess whether there is a difference in current smoking rates across three Asian American ethnic groups, the observed data is compared to the data that would be expected if there were no association between the variables.
  
  1. What is the null and alternative hypothesis?
  
  2. Carry out the randomization procedure and theoretical method for the Chi-Squared test for independence. Plot the resulting distributions.
  
  3. Compute the p-value using the randomization and theory.
  
  4. What is the conclusion?

<br>
