Lizard Habitats
The exercise problems shown below was taken and slightly modified from your textbook OpenIntro: Introduction to Modern Statistics Section 18.4.
Consider the following problem statement.
In order to assess whether habitat conditions are related to the sunlight choices a lizard makes for resting, Western fence lizard (Sceloporus occidentalis) were observed across three different microhabitats. Adolph1990 Asbury2007
The lizard_habitat
data used in this exercise can be found in the openintro R package.
- The null and alternative Hypothesis \[H_0: \text{Sunlight and site are independent. Sunlight choices do not vary by site.}\] \[H_A: \text{Sunlight and site are dependent. Sunlight choices vary by site}\]
The data. Here, it counts the number of items in each category.
lizard_habitat %>%
count(site, sunlight)
## # A tibble: 9 × 3
## site sunlight n
## <fct> <fct> <int>
## 1 desert sun 16
## 2 desert partial 32
## 3 desert shade 71
## 4 mountain sun 56
## 5 mountain partial 36
## 6 mountain shade 15
## 7 valley sun 42
## 8 valley partial 40
## 9 valley shade 24
Here is the data printed as table.
lizard_habitat %>%
count(site, sunlight) %>%
pivot_wider(names_from = sunlight, values_from = n) %>%
adorn_totals(where = c("row", "col")) %>%
kbl(align = "lrrrr", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = "HOLD_position",
full_width = FALSE) %>%
add_header_above(c(" "=1, "sunlight" = 3, " " = 1)) %>%
column_spec(1:5, width = "5em")
|
sunlight
|
|
site
|
sun
|
partial
|
shade
|
Total
|
desert
|
16
|
32
|
71
|
119
|
mountain
|
56
|
36
|
15
|
107
|
valley
|
42
|
40
|
24
|
106
|
Total
|
114
|
108
|
110
|
332
|
Throughout this demonstration, we will be using the infer
package.
Calculating the observed \(\chi^2\) statistic.
Calculating the observed statistic,
Chisq_hat <- lizard_habitat %>%
specify(formula = site ~ sunlight) %>%
hypothesize(null = "independence") %>%
calculate(stat = "Chisq")
Chisq_hat
## Response: site (factor)
## Explanatory: sunlight (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 68.8
Alternatively, using the observe() wrapper to calculate the observed statistic,
Chisq_hat <- lizard_habitat %>%
observe(formula = site ~ sunlight, stat = "Chisq")
Chisq_hat
## Response: site (factor)
## Explanatory: sunlight (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 68.8
Randomization/Permutation Method
Then, generating the null distribution using randomizations,
null_dist <- lizard_habitat %>%
specify(site ~ sunlight) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "Chisq")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")

Theoretical Method
Before using the theoretical method for the Chi-Squared test for independence, you need to check if the conditions are met.
Conditions: * Independent observations * Large samples: 5 expected counts in each cell
Finding the null distribution using theoretical methods using the assume()
verb,
null_dist_theory <- lizard_habitat %>%
specify(site ~ sunlight) %>%
assume(distribution = "Chisq")
Visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")

Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
## Warning: Check to make sure the conditions have been met for the theoretical
## method. {infer} currently does not check these for you.

Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value
Calculating the p-value from the null distribution and observed statistic,
null_dist %>%
get_p_value(obs_stat = Chisq_hat, direction = "greater")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
Alternatively, using the wrapper to carry out the test,
lizard_habitat %>%
chisq_test(formula = site ~ sunlight)
## # A tibble: 1 × 3
## statistic chisq_df p_value
## <dbl> <int> <dbl>
## 1 68.8 4 4.12e-14
Since the p-value is close to zero, we can reject the null and conclude that there is an association between sunlight choice and the site on which the lizards prefer to live their best lives.
Ngrams and Collocations
Sorry, I thought I have time for this but I am just busy busy lemon squishy. If you are interested in learning about statistical natural language processing using text data or data analysis in general, I recommend taking the Data Science class in Spring of 2022. The prerequisites for Data Science is Introduction to Probability and Statistics. If you are doing well in the Introduction to Probability and Statistics, then you should be okay in Data Science.
Mini Activities
Disaggregating Asian American tobacco use, data.
Understanding cultural differences in tobacco use across different demographic groups can lead to improved health care education and treatment. A recent study disaggregated tobacco use across Asian American ethnic groups including Asian-Indian (n = 4,373), Chinese (n = 4,736), and Filipino (n = 4,912), in comparison to non-Hispanic Whites (n = 275,025). The number of current smokers in each group was reported as Asian-Indian (n = 223), Chinese (n = 279), Filipino (n = 609), and non-Hispanic Whites (n = 50,880). Rao2021
Here is the data code.
asian_smoke <- tibble(
ethnicity = c(
rep("Asian-Indian", 4373),
rep("Chinese", 4736),
rep("Filipino", 4912)
),
outcome = c(
rep("smoke", 223), rep("don't smoke", 4150),
rep("smoke", 279), rep("don't smoke", 4457),
rep("smoke", 609), rep("don't smoke", 4303)
)
)
Here is the data printed as table.
asian_smoke %>%
count(ethnicity, outcome) %>%
pivot_wider(names_from = outcome, values_from = n) %>%
adorn_totals(where = c("row", "col")) %>%
kbl(align = "lrrrr", booktabs = TRUE, format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = "HOLD_position",
full_width = FALSE) %>%
add_header_above(c(" " = 1, "Smoking" = 2, " " = 1)) %>%
column_spec(1, width = "7em") %>%
column_spec(2:4, width = "5em")
|
Smoking
|
|
ethnicity
|
don’t smoke
|
smoke
|
Total
|
Asian-Indian
|
4,150
|
223
|
4,373
|
Chinese
|
4,457
|
279
|
4,736
|
Filipino
|
4,303
|
609
|
4,912
|
Total
|
12,910
|
1,111
|
14,021
|
- In order to assess whether there is a difference in current smoking rates across three Asian American ethnic groups, the observed data is compared to the data that would be expected if there were no association between the variables.
What is the null and alternative hypothesis?
Carry out the randomization procedure and theoretical method for the Chi-Squared test for independence. Plot the resulting distributions.
Compute the p-value using the randomization and theory.
What is the conclusion?
