4 - Basic Probability Continued Continued

Alex John Quijano

09/24/2021

Previously on Probability…

Last session, we discussed about:

Discrete random variables
The Bernoulli and the Binomial probability mass functions (pmf)
A light introduction to the law of large numbers

Basic Probability Continued Continued

In this lecture, we will learn about,

More on continuous random variables
A light introduction to the central limit theorem
An example of hypothesis testing

Random Variable (R.V.)

A Random Variable (R.V.) is a type of variable where the value is a function that associates a numerical value with a potential outcome.

Random Variables are denoted by upper case letters (\(Y\))
Individual outcomes for R.V. are denoted by lower case letters (\(y\))

Tossing a Fair Coin THREE Times

Sample space: \[\Omega = \{\{H,H,H\},\{H,H,T\},\{H,T,T\},\{T,T,T\},\\ \{T,T,H\},\{T,H,H\},\{H,T,H\},\{T,H,T\}\}\]
Let the random variable \(X\) to be the number of heads in this experiment, \[X = \{x_1,x_2,x_3\} = {0,1,2,3}\]
Example probabilities:
- \(P(X = 3) = \frac{1}{8} \longrightarrow\) the probability of getting exactly three heads
- \(P(X = 2) = \frac{3}{8} \longrightarrow\) the probability of getting exactly two heads
- \(P(X \ge 2) = \frac{1}{2} \longrightarrow\) the probability of getting at least two heads
- \(P(X \le 2) = \frac{7}{8} \longrightarrow\) the probability of getting at most two heads
Here, \(\Omega\) is a discrete sample space and \(X\) is a discrete R.V.

Discrete R.V. vs Continuous R.V.

Discrete R.V. corresponds to
Probability Mass Functions (PMF)

Example:

Continuous R.V. corresponds to
Probability Density Functions (PDF)

Example:

Some Examples of Discrete Random Variables

Number of eggs a hen lays.
Number of wins in a game.
Number of people waiting in line at a restaurant.
Number of people at a hospital’s ICUs in a given day.
Number of times a car broke down in a given week.
Number of people who passed an exam.
Number of cats visiting your front porch in a given day. (does the cat have to be the same cat?)
Number of spiders showing up in your house in a given day. (how to know if you saw the same spider?)

Examples of Continuous Random Variables

Wait times in a restaurant or a bus stop.
Length of time in a queue. (e.g. the amount of time waiting in line)
Heights of people in a given population.
Quantity of sugar in iced drinks at Starbucks.
Weight of people in the age range 20-30 years old.
Percentage scores of exams.
The length of hairs on a cat.
Length of time for a spider to build a web.

Unemployment Rates (1/7)

Here is a histogram of the unemployment rates of US counties.
In this example, we viewing the continuous numerical variable using a histogram. Recall that histograms divides the entire range of values into intervals and count how many observations fall into each interval.

Unemployment Rates (2/7)

Let \(X\) be a continuous R.V. of the unemployment rates of counties.
We can define a theoretical probability density function (pdf) that maps the unemployment rate value to a probability.
FYI: This particular PDF example is the Gamma PDF.

Unemployment Rates (3/7)

Using the PDF, we can compute probabilities of unemployment rates.

Unemployment Rates (4/7)

Using the PDF, we can compute probabilities of unemployment rates.

The probability of unemployment rate of exacty 5.75 is \[P(X = 5.75) = 0.162\]

Unemployment Rates (5/7)

Using the PDF, we can compute probabilities of unemployment rates.

The probability of unemployment rate for at least 5.75 is \[P(X \ge 5.75) = 0.211\]

Note: We are taking the “area under the curve” here to compute the probability of the union of a disjoint interval.

Unemployment Rates (6/7)

Using the PDF, we can compute probabilities of unemployment rates.

The probability of unemployment rate for at most 5.75 is \[P(X \le 5.75) = 0.789\]

Note: We are taking the “area under the curve” here to compute the probability of the union of a disjoint interval.

Unemployment Rates (7/7)

Using the PDF, we can compute probabilities of unemployment rates.

The probability of unemployment rate between 2.75 and 5.75 is \[P(2.75 \le X \le 5.75) = 0.693\]

Note: We are taking the “area under the curve” here to compute the probability of the union of a disjoint interval.

The Mean of the Unemployment Rates

So, here we are, back to the unemployment rate histogram.
The mean of the unemployment rates is shown as dashed line.

Resampling (1/2)

$Image Source: [Bootstrapping Statistics.](https://towardsdatascience.com/bootstrapping-statistics-what-it-is-and-why-its-used-e2fa29577307){target=_blank}$

Image Source: Bootstrapping Statistics.

Suppose that we take random samples from the unemployment rate data.
Let \(\{X_1, X_2, \cdots, X_n\}\) be a random sample of size \(n\).
We are interested in the sample mean \[\bar{X_n} = \frac{X_1 + X_2 + \cdots + X_n}{n}\]

Resampling (2/2)

$Image Source: [Bootstrapping Statistics by Trist'n Joseph.](https://towardsdatascience.com/bootstrapping-statistics-what-it-is-and-why-its-used-e2fa29577307){target=_blank}$

Image Source: Bootstrapping Statistics by Trist’n Joseph.

Suppose that we have independent trails \(m\). That is we sample - with replacement - from the original sample and compute the means of those samples (simulating the sampling processes).
Note that we sample from he population to make inferences about the population.
This method we are about to use is called the Bootstrapping method. A method of resampling to estimate the sampling distribution of means.

Resampling 1000 times (1/2)

Original Sample

Resamples - Distribution of Means

Resampling 1000 times (2/2)

We see in the histogram that - for large enough sample size (the law of large numbers) - the sampling distribution of the means give a bell shaped curve, which is also called the normal distribution.
This phenomenon is called the central limit theorem.

Resamples - Distribution of Means

The Central Limit Theorem

The Central Limit Theorem says that regardless of the underlying distribution, the sampling distribution of the mean of any independent, random variable will be normal or near normal.

Hypothesis Testing Preview

Suppose that you want to know the difference in mean scores for two exams A and B.

Construct hypotheses to evaluate whether there is a difference in the sample means, is likely to have happened due to chance, if the null hypothesis is true.
The null hypothesis is that there is no difference.

Apply Some Resampling and Randomization Method

The randomization distribution allows us to identify whether a difference of 3.1 points is more than one would expect by natural variability of the scores.
Also, by bootstrapping we can compute the confidence interval (CI) of the difference in means. CI refers to the probability that a population parameter will fall between a set of values.

Summary

In this lecture, we talked about the following:

More details on continuous random variables
A light introduction to the central limit theorem

In the next lecture, we will talk about,

An introduction to hypothesis testing
Formulating/Identifying the null and alternative hypothesis
The idea of randomization test.

Today’s Activity

Sorry, no activity today. :-(