Learning Objectives


Upon completing today’s lab activity, students should be able to do the following using R and RStudio:

  1. Using data to formulate a sampling distribution of an estimate.

  2. Perform the basics of randomization for hypothesis testing.


library(tidyverse)
library(openintro)
library(infer)
library(ggplot2)
library(gghighlight)


Case Study - Gender Discrimination

Load and preview gender_discrimination data which is part of the openintro package.

glimpse(gender_discrimination)
## Rows: 48
## Columns: 2
## $ gender   <fct> male, male, male, male, male, male, male, male, male, male, m…
## $ decision <fct> promoted, promoted, promoted, promoted, promoted, promoted, p…

The dataset have 48 rows and 2 columns. The variables are gender and decision. Both are categorical nominal variables. Next, we summarize the data.

Data Summarization

gender_discrimination_prop <- gender_discrimination %>%
  count(gender, decision) %>%
  group_by(gender) %>%
  mutate(prop = n / sum(n))
gender_discrimination_prop
## # A tibble: 4 × 4
## # Groups:   gender [2]
##   gender decision         n  prop
##   <fct>  <fct>        <int> <dbl>
## 1 male   promoted        21 0.875
## 2 male   not promoted     3 0.125
## 3 female promoted        14 0.583
## 4 female not promoted    10 0.417
p_male <- gender_discrimination_prop %>%
  filter(gender == "male", decision == "promoted") %>%
  pull(prop)

p_female <- gender_discrimination_prop %>%
  filter(gender == "female", decision == "promoted") %>%
  pull(prop)

p_diff <- p_male - p_female # difference in proportion
paste("Proportion of promoted males: ",p_male)
## [1] "Proportion of promoted males:  0.875"
paste("Proportion of promoted females: ",p_female)
## [1] "Proportion of promoted females:  0.583333333333333"
paste("Proportion difference: ",p_diff)
## [1] "Proportion difference:  0.291666666666667"

After summarizing the data, the difference in promotions can be identified using rates. That is, 0.5833333 (58.3333333%) of the women were promoted whereas 0.875 (87.5%) of the men were promoted. The important statistical question to ask after looking at the data is as follows: is it plausible to observe such a difference in proportions in a scenario where men and women are equally likely to be promoted?

All of the pieces of this code you’ve seen before, but let’s review them! First, we count the number of observations for each gender (male, female) and each decision (promoted, not) using the count() function. Next, we group these counts by gender, so that we can use mutate() to create a new variable that is the proportion of observations in each gender that received or didn’t receive a promotion.

We see in the results that 0.875 of males are promoted while 0.5833333 of females are promoted. The difference in proportion is 0.2916667.

Permutation

To help you understand the code used to create the randomization distribution, this section will walk you through the steps of the infer framework. In particular, you’ll see how differences in the generated replicates affect the calculated statistics.

For simplicity, we’ll keep our permutation to just 5 replicates – in reality we would want this value to be much larger.

In the code chunk below,

  • we start with our data frame: gender_discrimination,

  • then we specify our model where decision is the response variable and gender is the explanatory (grouping) variable, and we note that we’re calling "promoted" a success,

  • then we set our null hypothesis as "independence" (no gender discrimination), and

  • finally we permute 5 times under the specification we outlined so far.

gender_discrimination %>%
  specify(decision ~ gender, success = "promoted") %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5, type = "permute")
## Response: decision (factor)
## Explanatory: gender (factor)
## Null Hypothesis: independence
## # A tibble: 240 × 3
## # Groups:   replicate [5]
##    decision     gender replicate
##    <fct>        <fct>      <int>
##  1 promoted     male           1
##  2 promoted     male           1
##  3 promoted     male           1
##  4 promoted     male           1
##  5 promoted     male           1
##  6 not promoted male           1
##  7 promoted     male           1
##  8 promoted     male           1
##  9 promoted     male           1
## 10 promoted     male           1
## # … with 230 more rows

The resulting data frame has 240 rows: 48 observations per replicate (just like in the original gender_discrimination data) * 5 replicates.

Distribution of statistics

Next, we generate multiple shuffles and compute the difference in proportions. Then, we show the distribution of differences the proportions. Here we are using the difference for simplicity but you can use the ratio as well.

set.seed(35) # set seed for replicability
shuffles <- gender_discrimination %>%
  specify(decision ~ gender, success = "promoted") %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "diff in props", order = c("male", "female"))
shuffles %>%
  ggplot(aes(x = stat)) +
  geom_dotplot(binwidth = 0.015) +
  gghighlight(stat >= 0.292) +
  theme(
    axis.ticks.y = element_blank(),
    axis.text.y = element_blank()
  ) +
  labs(
    x = "Differences in promotion rates (male - female) across 100 shuffles",
    y = NULL
  )

This plot shows the distribution of the differences in proportions where \(\frac{2}{100}\) of the shuffles (shown in black dots) has difference in proportion larger than our observed statistic.

We can modify the above code to increase the number of shuffles to 1000 and use a histogram to show the distribution of the differences in proportions.

set.seed(35) # set seed for replicability
shuffles_more <- gender_discrimination %>%
  specify(decision ~ gender, success = "promoted") %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in props", order = c("male", "female"))
shuffles_more %>% ggplot(aes(x = stat)) +
  geom_histogram(bins=20) +
  gghighlight(stat >= 0.292) +
  labs(
    x = "Differences in promotion rates (male - female) across 1000 shuffles",
    y = "Frequency"
  )

In the histogram, the black bars indicate shuffles with values greater than our observed statistic.

Null and Alternative Hypothesis

Let \(\widehat{p}_M\) be the proportion of promoted males and \(\widehat{p}_F\) be the proportion of promoted females.

  • \(H_0:\) Null hypothesis. The variables gender and decision are independent. They have no relationship, and the observed difference between the proportion of males and females who were promoted, 0.2916667, was due to the natural variability inherent in the population. \[p_M - p_F = 0\]
  • \(H_A:\) Alternative hypothesis. The variables gender and decision are not independent. The difference in promotion rates of 0.2916667 was not due to natural variability, and equally qualified female personnel are less likely to be promoted than male personnel. \[p_M - p_F > 0\]

In the 100 shuffles example, we determined that there was only a \(\approx\) 2% probability of obtaining a sample where \(\geq\) 29.2% more male candidates than female candidates get promoted under the null hypothesis.

After some rigorous computations and formal studies (we skipped it here but we will go on to more details later), we can conclude that the data provide strong evidence of sex discrimination against female candidates by the male supervisors. In this case, we reject the null hypothesis in favor of the alternative.

When we simulated a world with no sex discrimination, we showed that the expected difference is 0 and the observed statistic is a “rare” occurrence, which means that only a \(\approx\) 2% probability of obtaining a sample where 29.2% more male candidates.


Lab Exercises


Note: You must include your code and results to answer the exercise problems.


I. Gender Discrimination

As the first step of any analysis, you should look at and summarize the data. Categorical variables are often summarized using proportions, and it is always important to understand the denominator of the proportion.

The discrimination study data are available as gender_discrimination using the openintro package.

  1. Using the count() function, tabulate the variables gender and decision. Group the data by gender. Calculate the proportion of those who were and were not promoted in each gender and call this variable prop_row. Print out your results.

  2. Using the count() function, tabulate the variables gender and decision. Group the data by decision. Calculate the proportion of those who males and females in each decision and call this variable prop_col. Print out your results.

  3. From your calculations in the previous two points, what is the numerator and denominator for each?

II. Opportunity Cost

One-hundred and fifty students were recruited for the study, and each was given the following statement:

“Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of $14.99. What would you do in this situation? Please circle one of the options below.”

Half of the 150 students were randomized into a control group and were given the following two options:

  1. Buy this entertaining video.

  2. Not buy this entertaining video.

The remaining 75 students were placed in the treatment group, and they saw a slightly modified option (B):

  1. Buy this entertaining video.

  2. Not buy this entertaining video. Keep the $14.99 for other purchases.

Would the extra statement reminding students of an obvious fact impact the purchasing decision?

In this exercise, use the opportunity_cost data set, which is available through the openintro package. Answer the following questions.

  1. How many rows and columns does this data set have? What are the variables? If you have categorical variables, state the levels. What type of study is this particular example?

  2. Using the count() function, tabulate the variables group and decision. Group the data by group. Calculate the proportion of those who bought and not bought the video in each group and call this variable prop_row. Print out your results.

  3. Suppose a success in this study is a student who chooses not to buy the video. Construct a point estimate for this difference as (T for treatment and C for control). State the null and alternative hypothesis.

  4. Perform a randomization test (also known as the permutation test). Use at least 1000 shuffles and plot the distribution of the difference in proportions using a histogram. What can you conclude based on your results? Can we make causal statement in this study?


