Upon completing today’s lab activity, students should be able to do the following using R and RStudio:
Perform different types of random sampling (simple random, stratified, and clustered) on a data frame.
Perform sampling with or without replacement.
Perform randomization and probability simulations.
library(openintro)
library(tidyverse)
Throughout this section, we are using the county
dataset. Note that this data set is already part of the base R installation.
<- county %>% na.omit()
county glimpse(county)
## Rows: 2,560
## Columns: 15
## $ name <chr> "Autauga County", "Baldwin County", "Barbour County"…
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama…
## $ pop2000 <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 36583, 23…
## $ pop2010 <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 34215, 25…
## $ pop2017 <int> 55504, 212628, 25270, 22668, 58013, 10309, 33713, 25…
## $ pop_change <dbl> 1.48, 9.19, -6.22, 0.73, 0.68, -2.28, -1.20, -0.60, …
## $ poverty <dbl> 13.7, 11.8, 27.2, 15.2, 15.6, 28.5, 18.8, 16.1, 19.4…
## $ homeownership <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 71.4, 77.5, 75.1…
## $ multi_unit <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 8.7, 4.3, 4.4, 3.9, …
## $ unemployment_rate <dbl> 3.86, 3.99, 5.90, 4.39, 4.02, 4.93, 4.08, 4.05, 4.05…
## $ metro <fct> yes, yes, no, yes, yes, no, no, no, yes, no, no, no,…
## $ median_edu <fct> some_college, some_college, hs_diploma, hs_diploma, …
## $ per_capita_income <dbl> 27841.70, 27779.85, 17891.73, 20572.05, 21367.39, 15…
## $ median_hh_income <int> 55317, 52562, 33368, 43404, 47412, 29655, 37342, 400…
## $ smoking_ban <fct> none, none, partial, none, none, none, none, none, n…
Example: Suppose we want to randomly sample 100 observations (rows) from the data frame.
<- 100 # number of samples
n <- sample.int(dim(county)[1],n, replace = FALSE) #pick random row indices from the data frame
rand_index # the replace=FALSE means that we sampling without replacement
<- county %>% slice(rand_index) # select rows using the random indices
simple_random_sample glimpse(simple_random_sample)
## Rows: 100
## Columns: 15
## $ name <chr> "Natchitoches Parish", "Douglas County", "Yazoo Coun…
## $ state <fct> Louisiana, Colorado, Mississippi, Kansas, Kentucky, …
## $ pop2000 <dbl> 39080, 175766, 28149, 4789, 6748, 38543, 170200, 203…
## $ pop2010 <dbl> 39566, 285465, 28065, 4437, 7852, 38520, 172188, 220…
## $ pop2017 <int> 39021, 335299, 27057, 4207, 7523, 37711, 173693, 229…
## $ pop_change <dbl> -0.23, 9.59, -3.35, -3.13, -0.38, -2.12, 0.87, 3.39,…
## $ poverty <dbl> 31.9, 3.6, 36.5, 14.5, 34.4, 17.1, 18.5, 20.5, 29.0,…
## $ homeownership <dbl> 61.4, 82.5, 63.7, 80.9, 79.9, 71.5, 75.7, 61.2, 38.6…
## $ multi_unit <dbl> 16.9, 14.9, 12.9, 5.0, 6.1, 12.1, 14.3, 19.7, 44.2, …
## $ unemployment_rate <dbl> 5.97, 2.39, 6.02, 3.30, 10.19, 7.82, 5.41, 5.74, 4.4…
## $ metro <fct> no, yes, yes, no, no, no, yes, yes, yes, no, no, no,…
## $ median_edu <fct> some_college, bachelors, hs_diploma, some_college, h…
## $ per_capita_income <dbl> 17817.87, 48499.64, 17032.10, 27138.86, 16364.02, 23…
## $ median_hh_income <int> 29001, 111154, 28330, 47121, 29043, 44030, 46077, 46…
## $ smoking_ban <fct> partial, partial, none, none, none, none, none, none…
Example 1: Suppose we want to sample 50% of observations of each state in the data frame.
<- county %>%
strat_sample_frac group_by(state) %>%# group by state (states as strata)
mutate(num_rows=n()) %>% # create new column that counts the number of observations in each strata
sample_frac(0.50, weight=num_rows) %>% # randomly sample 50% of observations in each strata
ungroup
Now, let’s check if the code did it right.
<- dim(county %>% filter(state == "California"))[1] # number of total observations
nto <- dim(strat_sample_frac %>% filter(state == "California"))[1] # number of sample observations
nso
# this value should be close to 0.50
/nto nso
## [1] 0.5
Example 2: Suppose we want to sample exactly 100 observations in each level of the metro
variable.
<- county %>%
stratified_sample group_by(metro) %>%# group by metro (metro - yes or no - as strata)
sample_n(100) %>% # randomly sample 100 observations in each strata
ungroup
stratified_sample
## # A tibble: 200 × 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Cheyenne County Colo… 2231 1836 1845 -1.28 11.8 80.5
## 2 Dunn County Nort… 3600 3536 4289 3.75 10.6 84.9
## 3 Plumas County Cali… 20824 20007 18742 -0.83 13.3 65.6
## 4 Morrill County Nebr… 5440 5042 4836 -1.83 9.4 67.7
## 5 Patrick County Virg… 19407 18490 17665 -3.17 20.2 81.3
## 6 Amador County Cali… 35100 38091 38626 5.43 10.6 77.3
## 7 Red River County Texas 14314 12860 12229 -2.14 16.5 71.2
## 8 Ochiltree County Texas 9006 10223 10073 -5.59 13 69.5
## 9 Monroe County Kent… 11756 10963 10659 -0.4 23.5 76.1
## 10 Greenwood County Sout… 66271 69661 70355 0.75 24.2 69.8
## # … with 190 more rows, and 7 more variables: multi_unit <dbl>,
## # unemployment_rate <dbl>, metro <fct>, median_edu <fct>,
## # per_capita_income <dbl>, median_hh_income <int>, smoking_ban <fct>
Example: Suppose we consider each state as a cluster, and we randomly sample 10 clusters.
<- 10 # number of clusters to randmonly sample
n_clusters <- levels(county$state) # get list of states
states <- sample(states,n_clusters) # sample the states
states_sample
%>% filter(state==states_sample) # get the rows with corresponding states that are sampled. county
## # A tibble: 40 × 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Fairbanks Nor… Alas… 82840 97581 99703 -1.18 7.7 59.8
## 2 Northwest Arc… Alas… 7208 7523 7684 -0.48 25.3 53.7
## 3 Maricopa Coun… Ariz… 3072149 3817117 4307033 7.51 15.7 66.3
## 4 Allen County Kans… 14385 13371 12519 -4.1 14.2 78
## 5 Clark County Kans… 2390 2215 2004 -8.03 8.9 73.6
## 6 Douglas County Kans… 99962 110826 120793 5.45 18.7 53.5
## 7 Gray County Kans… 5904 6006 5958 -0.4 8.5 75.4
## 8 Kearny County Kans… 4531 3977 3960 1.02 13.2 74.1
## 9 McPherson Cou… Kans… 29554 29180 28708 -2.26 7.6 76.2
## 10 Norton County Kans… 5953 5671 5441 -3.08 11.6 68
## # … with 30 more rows, and 7 more variables: multi_unit <dbl>,
## # unemployment_rate <dbl>, metro <fct>, median_edu <fct>,
## # per_capita_income <dbl>, median_hh_income <int>, smoking_ban <fct>
Suppose we have a deck of cards and we put this as a list of strings. First, let’s shuffle the deck.
<- c("AH","2H","3H","4H","5H","6H","7H","8H","9H","10H","JH","QH","KH",
deck_ordered "AD","2D","3D","4D","5D","6D","7D","8D","9D","10D","JD","QD","KD",
"AC","2C","3C","4C","5C","6C","7C","8C","9C","10C","JC","QC","KC",
"AS","2S","3S","4S","5S","6S","7S","8S","9S","10S","JS","QS","KS")
<- sample(deck_ordered)
deck_shuffled
print("ordered deck")
## [1] "ordered deck"
print(deck_ordered)
## [1] "AH" "2H" "3H" "4H" "5H" "6H" "7H" "8H" "9H" "10H" "JH" "QH"
## [13] "KH" "AD" "2D" "3D" "4D" "5D" "6D" "7D" "8D" "9D" "10D" "JD"
## [25] "QD" "KD" "AC" "2C" "3C" "4C" "5C" "6C" "7C" "8C" "9C" "10C"
## [37] "JC" "QC" "KC" "AS" "2S" "3S" "4S" "5S" "6S" "7S" "8S" "9S"
## [49] "10S" "JS" "QS" "KS"
print("shuffled")
## [1] "shuffled"
print(deck_shuffled)
## [1] "3D" "2D" "6D" "2C" "8H" "10S" "5D" "8C" "7H" "5C" "3C" "QD"
## [13] "9S" "AC" "10H" "5H" "KD" "QH" "JS" "7C" "9D" "AS" "4C" "2H"
## [25] "QS" "7D" "6H" "AD" "AH" "KH" "9C" "10C" "JH" "7S" "3S" "6S"
## [37] "4S" "QC" "KS" "6C" "4D" "JD" "2S" "8D" "4H" "8S" "KC" "10D"
## [49] "3H" "JC" "5S" "9H"
Next, we want to sample five cards (five draws) without replacement, which means when we sample each card, we don’t put it back into the deck.
<- sample(deck_ordered,5,replace=FALSE)
five_sample_cards five_sample_cards
## [1] "KC" "8S" "QC" "QH" "5H"
Notice that we used the ordered deck to sample. That is because the sample
function randomly samples from that list. If we use the shuffled deck, we pick the first 5 items from the list.
<- deck_shuffled[1:5]
five_sample_cards five_sample_cards
## [1] "3D" "2D" "6D" "2C" "8H"
Key Point: When sampleing without replacement, you can only sample until the number of cards. For example, since there are 52 cards then you can only sample 52 times.
Consider the deck of cards from the previous section. Now, we will sample from that deck 1000 times with replacement and count how many times does ace of spades show up. We will call each sample or draw to be a trial.
<- 1000 # sample cards 1000 times with replacement
n_trials <- sample(deck_ordered,n_trials,replace=TRUE)
card_samples <- length(which(card_samples == "AS"))
AS_counts <- AS_counts/n_trials
AS_sample_prop <- 1/52
AS_true_prob
cat("observed AS = ", AS_counts, "\n")
## observed AS = 14
cat("sample AS proportion = ", AS_sample_prop, "\n")
## sample AS proportion = 0.014
cat("true probability of AS = ", AS_true_prob, "\n")
## true probability of AS = 0.01923077
For 1000 trials, we observed that there 24 AS cards, which has proportion 0.014. This is close to the true probability of AS which is 0.0192308.
Say we sample the cards 1000000 times.
<- 1000000 # sample cards 1000000 times with replacement
n_trials <- sample(deck_ordered,n_trials,replace=TRUE)
card_samples <- length(which(card_samples == "AS"))
AS_counts <- AS_counts/n_trials
AS_sample_prop <- 1/52
AS_true_prob
cat("observed AS = ", AS_counts, "\n")
## observed AS = 19357
cat("sample AS proportion = ", AS_sample_prop, "\n")
## sample AS proportion = 0.019357
cat("true probability of AS = ", AS_true_prob, "\n")
## true probability of AS = 0.01923077
Notice that we have the As proportion to be close to the true probability of AS with the number of trials to be significantly larger.
Suppose that we simulate flipping two coins 20 times - one fair and one unfair - and we count the number of heads. We repeat the process 1000 times and record the number of heads for each trial.
Note: It is easier to run the following code using an R script.
# fair coin
<- c("H","T")
fair_coin <- c(1/2,1-1/2) # (H,T) true probability of a fair coin
fair_coin_p
# unfair coint
<- c("H","T")
unfair_coin <- c(1/3,1-1/3) # (H,T) true probability of an unfair coin
unfair_coin_p
# number of flips
<- 20 # edit here
n_flips
# number of trials
<- 10000 # edit here
n_trials
# fair coin trials and counting number of head in each trial
<- c()
n_heads_fair for(i in 1:n_trials) {
<- sample(fair_coin,n_flips,fair_coin_p,replace=TRUE)
trial_1 <- c(n_heads_fair,length(trial_1[trial_1 == "H"]))
n_heads_fair
}
# unfair coin trials and counting number of head in each trial
<- c()
n_heads_unfair for(i in 1:n_trials) {
<- sample(unfair_coin,n_flips,unfair_coin_p,replace=TRUE)
trial_2 <- c(n_heads_unfair,length(trial_2[trial_2 == "H"]))
n_heads_unfair
}
# make data frame for plotting
<- c("fair","unfair")
label <- rep(label,each=n_trials)
coin <- c(n_heads_fair,n_heads_unfair)
head_count <- data.frame(coin=coin,head_count=head_count)
df
# plot distribution of number of heads
library(ggplot2)
<- ggplot(data=df, aes(x=head_count)) + geom_histogram(binwidth=0.50, color="black", position="identity", aes(fill=coin)) +
histogram xlab("Heads") + ylab("Frequency") + ggtitle(paste("trials = ",n_trials)) +
geom_vline(aes(xintercept=10),
color="blue", linetype="dashed", size=1)
print(histogram)
The histogram shows two distributions of head counts for each coin. The blue dashed line is the head count of 10. For a fair coin and 20 flips, we expect the mean of the distribution (red) to be close to 10 because the true probability of observing a head is 1/2.
For the unfair coin, we established that the true probability of observing heads is 1/3. So the mean of the distribution (green) is not close to 10 but close to around 6.66.
Let’s go back to the county
data set. Let’s get multiple random samples - sampling without replacement - from the unemployment_rate
variable. Here, each trial will have a number of samples and a number of trials. Each trial will have a set number of samples from the data. For each sample we compute the mean. Then, we plot the distribution of the means.
Note: It is easier to run the following code using an R script.
# sample size per trial
<- 100 # try smaller numbers first and look at the resulting histogram
sample_size
# number of trials
<- 10000 # try smaller numbers first and look at the resulting histogram
n_trials
# sampling loop
<- c()
means for(i in 1:n_trials) {
<- sample.int(dim(county)[1],sample_size)
rand_index <- county %>% na.omit %>% select(unemployment_rate) %>% slice(rand_index) %>% sapply(mean)
samp <- c(means,samp[[1]])
means
}
<- data.frame(mean=means)
df_means
# plot distribution of means
library(ggplot2)
<- ggplot(data=df_means, aes(x=means)) + geom_histogram(bins=30) +
histogram xlab("unemployment Means") + ylab("Frequency") + ggtitle(paste("sample size = ",sample_size," trials = ",n_trials))
print(histogram)
We see in the histogram that - for large enough sample size - the sampling distribution of the means give a bell shaped curve, which is also called the normal distribution. This phenomenon is called the central limit theorem.
The Central Limit Theorem says that regardless of the underlying distribution, the sampling distribution of the mean of any independent, random variable will be normal or near normal. We get a lovely bell shaped curve if the sample size is large enough. We will discuss this theory in more detail in the next two lectures.
iris
data set.Use the iris
data set to perform the following sampling procedures.
Sample 10 observations using simple random sampling.
Using the species as strata, sampling 30% of observation from each species.
Using the species as clusters, sample one species.
Using the Petal.Length
variable, perform a sampling procedure where you sample 10 observations and compute its mean. Repeat this procedure for 1000 trials while recording the means. Plot the distribution of the means. What is the shape of the distribution?
Repeat the procedure from problem 4 but with 100 observations per sample and 2000 trials. Plot the distribution of the means. Is the distribution much more refined from the previous plot?
Download the R sript sampling-from-urns.R
and use the source
command to initially run the R script. Then, answer the following questions.
Provide your initial output of the R script. Describe what the output shows you.
Provide output from your code (the figures comparing both probabilities) for at least these two different scenarios:
numTrials = 10^2
numTrials = 10^4