Alex John Quijano
09/24/2021
Last session, we discussed about:
Discrete random variables
The Bernoulli and the Binomial probability mass functions (pmf)
A light introduction to the law of large numbers
In this lecture, we will learn about,
More on continuous random variables
A light introduction to the central limit theorem
An example of hypothesis testing
A Random Variable (R.V.) is a type of variable where the value is a function that associates a numerical value with a potential outcome.
Random Variables are denoted by upper case letters (\(Y\))
Individual outcomes for R.V. are denoted by lower case letters (\(y\))
Sample space: \[\Omega = \{\{H,H,H\},\{H,H,T\},\{H,T,T\},\{T,T,T\},\\ \{T,T,H\},\{T,H,H\},\{H,T,H\},\{T,H,T\}\}\]
Let the random variable \(X\) to be the number of heads in this experiment, \[X = \{x_1,x_2,x_3\} = {0,1,2,3}\]
Example probabilities:
Here, \(\Omega\) is a discrete sample space and \(X\) is a discrete R.V.
Example:
Example:
Number of eggs a hen lays.
Number of wins in a game.
Number of people waiting in line at a restaurant.
Number of people at a hospital’s ICUs in a given day.
Number of times a car broke down in a given week.
Number of people who passed an exam.
Number of cats visiting your front porch in a given day. (does the cat have to be the same cat?)
Number of spiders showing up in your house in a given day. (how to know if you saw the same spider?)
Wait times in a restaurant or a bus stop.
Length of time in a queue. (e.g. the amount of time waiting in line)
Heights of people in a given population.
Quantity of sugar in iced drinks at Starbucks.
Weight of people in the age range 20-30 years old.
Percentage scores of exams.
The length of hairs on a cat.
Length of time for a spider to build a web.
Here is a histogram of the unemployment rates of US counties.
In this example, we viewing the continuous numerical variable using a histogram. Recall that histograms divides the entire range of values into intervals and count how many observations fall into each interval.
Let \(X\) be a continuous R.V. of the unemployment rates of counties.
We can define a theoretical probability density function (pdf) that maps the unemployment rate value to a probability.
FYI: This particular PDF example is the Gamma PDF.
Using the PDF, we can compute probabilities of unemployment rates.
Using the PDF, we can compute probabilities of unemployment rates.
The probability of unemployment rate of exacty 5.75 is \[P(X = 5.75) = 0.162\]
Using the PDF, we can compute probabilities of unemployment rates.
The probability of unemployment rate for at least 5.75 is \[P(X \ge 5.75) = 0.211\]
Note: We are taking the “area under the curve” here to compute the probability of the union of a disjoint interval.
Using the PDF, we can compute probabilities of unemployment rates.
The probability of unemployment rate for at most 5.75 is \[P(X \le 5.75) = 0.789\]
Note: We are taking the “area under the curve” here to compute the probability of the union of a disjoint interval.
Using the PDF, we can compute probabilities of unemployment rates.
The probability of unemployment rate between 2.75 and 5.75 is \[P(2.75 \le X \le 5.75) = 0.693\]
Note: We are taking the “area under the curve” here to compute the probability of the union of a disjoint interval.
So, here we are, back to the unemployment rate histogram.
The mean of the unemployment rates is shown as dashed line.
Image Source: Bootstrapping Statistics.
Suppose that we take random samples from the unemployment rate data.
Let \(\{X_1, X_2, \cdots, X_n\}\) be a random sample of size \(n\).
We are interested in the sample mean \[\bar{X_n} = \frac{X_1 + X_2 + \cdots + X_n}{n}\]
Image Source: Bootstrapping Statistics by Trist’n Joseph.
Suppose that we have independent trails \(m\). That is we sample - with replacement - from the original sample and compute the means of those samples (simulating the sampling processes).
Note that we sample from he population to make inferences about the population.
This method we are about to use is called the Bootstrapping method. A method of resampling to estimate the sampling distribution of means.
Original Sample
Resamples - Distribution of Means
We see in the histogram that - for large enough sample size (the law of large numbers) - the sampling distribution of the means give a bell shaped curve, which is also called the normal distribution.
This phenomenon is called the central limit theorem.
Resamples - Distribution of Means
Suppose that you want to know the difference in mean scores for two exams A and B.
Construct hypotheses to evaluate whether there is a difference in the sample means, is likely to have happened due to chance, if the null hypothesis is true.
The null hypothesis is that there is no difference.
The randomization distribution allows us to identify whether a difference of 3.1 points is more than one would expect by natural variability of the scores.
Also, by bootstrapping we can compute the confidence interval (CI) of the difference in means. CI refers to the probability that a population parameter will fall between a set of values.
In this lecture, we talked about the following:
More details on continuous random variables
A light introduction to the central limit theorem
In the next lecture, we will talk about,
An introduction to hypothesis testing
Formulating/Identifying the null and alternative hypothesis
The idea of randomization test.