Previously on Visualizing Categorical Data…

Using contingency tables to compute proportions/percentage.
The three types of bar plots which are stacked, dodged, and standardized and talked about there advatages and disadvantages.
The advantages and disadvantages of mosaic plots, pie charts, and waffle charts.

Previously on Data…

$Source: [Fig:1.1 of OpenIntro: Introduction to Modern Statistics](https://openintro-ims.netlify.app/data-hello.html#variable-types){target='_blank'}$

Source: Fig:1.1 of OpenIntro: Introduction to Modern Statistics

Visualizing Numerical Data

Exploring numerical data using sample statistics and visualizations.

Mean, Variance, and Standard Deviation
Quartiles, Median, and the InterQuartile Range (IQR)
Dot and Scatter Plots
Histograms
Boxplots
Comparing Numerical data across categories/levels.

Two Variable Scatter Plots

A two-variable scatterplot provides a case-by-case view of data for two numerical variables.

Example 1:

A scatterplot of loan amount versus total income for the loan50 dataset.

Two Variable Scatter Plots

Example 2:

A scatterplot of the median household income against the poverty rate for the county dataset. Data are from 2017. A statistical model has also been fit to the data and is shown as a dashed line.

One Variable Scatter Plots

A dot plot is a one-variable scatterplot; an example using the interest rate of 50 loans from the loan50 dataset is shown below. Note that the loan50 dataset is subset with 50 sampled rows from the loans_full_schema dataset.

Key points:

It shows the exact value for each observation.
Usefull for small datasets.

A dot plot of interest rate for the loan50 dataset. The rates have been rounded to the nearest percent in this plot, and the distribution’s mean is shown as a red triangle.

Histograms

A histogram are the binned counts of the data plotted as bars. Note that the histogram resembles a more heavily binned version of the stacked dot plot.

Key points:

Histograms provide a view of the data density. Higher bars represent where the data are relatively more common.
Very good at visualizing data with many oberservations.
Histograms are especially convenient for understanding the shape of the data distribution.

Histograms

Example:

Counts for the binned interest rate data.
Interest rate	Count
(5% - 7.5%]	11
(7.5% - 10%]	15
(10% - 12.5%]	8
(12.5% - 15%]	4
(15% - 17.5%]	5
(17.5% - 20%]	4
(20% - 22.5%]	1
(22.5% - 25%]	1
(25% - 27.5%]	1

A histogram of interest rate. This distribution is strongly skewed to the right.

Modality of Distributions

The modality of a distribution is the number of prominent peaks, also know known as modes.

unimodal - one mode (or the most frequent value in the data)
bimodal - two modes
multimodal - multiple modes.

Examples:

Counting only prominent peaks, the distributions are (left to right) unimodal, bimodal, and multimodal. Note that the left plot is unimodal because we are counting prominent peaks, not just any peak.

Shapes of Distributions

There are three main typpes of distribution shapes:

right skewed - when the distribution of a variable trails off to the right in this way and has a longer right tail. It is also called positive skewed.
Symmetric - variables that show roughly equal trailing off in both directions.
left skewed - variables with the reverse characteristic with a long, thinner tail to the left. It is also called negative skewed.

Examples:

$Source: [Skewness Wiki](https://en.wikipedia.org/wiki/Skewness#/media/File:Relationship_between_mean_and_median_under_different_skewness.png){target=_blank}$

Source: Skewness Wiki

Sample Mean and Median

The sample mean can be calculated as the sum of the observed values divided by the number of observations:

\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}\] where $n$ is the number of observations and $x_1 + x_2 + \cdots + x_n$ are the observed values.

The median is the number in the middle of ordered numerical observations. If the data are ordered from smallest to largest, the median is the observation right in the middle. If there are an even number of observations, there will be two values in the middle, and the median is taken as their average.

Median Example:

$Source: [Median Wiki](https://en.wikipedia.org/wiki/Median#/media/File:Finding_the_median.png){target=_blank}$

Source: Median Wiki

Quartiles and the Boxplot

A box plot summarizes a dataset using five statistics while also identifying unusual observations. These statistics are the “minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”. A $\alpha$ percentile is a number with $\alpha$% of the observations below and $100-\alpha$% of the observations above.

Median - (Q2/50th Percentile) - the middle value.
First quartile - (Q1/25th Percentile) - the middle number between the smallest number (not the “minimum”) and the median of the dataset.
Third quartile - (Q3/75th Percentile) - the middle value between the median and the highest value (not the “maximum”) of the dataset.
Interquartile range (IQR) - 25th to the 75th percentile and computed as Q3-Q1.

Key Points:

Boxplots gives a sense of how much the data is actually spreads.
It shows the range of the data
It can not show the number of modes compared to looking at a histogram.
It can show the skewness and potential outliers.

Quartiles and the Boxplot

$Source: [Towards Data Science: Understanding Boxplots](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51){target='_blank'}$

Source: Towards Data Science: Understanding Boxplots

Note: You might notice the “1.5*IQR” terms in the “minimum” and “maximum” values which separates the outliers. In short, this is a method to detect potential outliers of the data. We will revisit the outlier topic in the near future for more details.

Quartiles and the Boxplot

Example:

Plot A shows a dot plot and Plot B shows a box plot of the distribution of interest rates from the loan50 dataset.

Outliers

An outlier is an observation that appears extreme relative to the rest of the data. Examining data for outliers serves many useful purposes, including

identifying strong skew in the distribution,
identifying possible data collection or data entry errors, and
providing insight into interesting properties of the data.

Keep in mind, however, that some datasets have a naturally long skew and outlying points do not represent any sort of problem in the dataset.

Sample Statistics

Measures of “center”: mean, median, and mode

Measures of spread: quartiles, IQR, variance, and standard deviation

Variance and standard deviation.

The sample standard deviation can be calculated as the square root of the sum of the squared distance of each value from the mean divided by the number of observations minus one:

\[s = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\]

The sample variance is the average squared distance from the mean and it is the squared of the sample standard deviation.

\[s^2\] * The standard deviation is useful when considering how far the data are distributed from the mean and it represents the typical deviation of observations from the mean.

Variance can be zero and positive but it can’t be a negative value.

Standard Statistics Notation

Meaning	Notation
sample size	$\hspace{10px} n$
sample mean	$\hspace{10px} \bar{x}$
population mean	$\hspace{10px} \mu$
sample standard deviation	$\hspace{10px} s$
population standard deviation	$\hspace{10px} \sigma$
sample variance	$\hspace{10px} s^2$
population variance	$\hspace{10px} \sigma^2$
InterQuartile Range	$\hspace{10px} IQR$
Quartiles	$\hspace{10px} Q_i$

Robustness of the Sample Statistics

A comparison of how the median, IQR, mean, and standard deviation change as the value of an extereme observation from the original interest data changes.
	Robust		Not robust
Scenario	Median	IQR	Mean	SD
Original data	9.93	5.75	11.6	5.05
Move 26.3% to 15%	9.93	5.75	11.3	4.61
Move 26.3% to 35%	9.93	5.75	11.7	5.68

Visualizing Numerical Data Across Categories/Levels

Example 1:

Histograms (Plot A) and side by-side box plots (Plot B) for median household income, where counties are split by whether there was a population gain or not.

Visualizing Numerical Data Across Categories/Levels

Example 2:

Summary

In this lecture, we talked about the following:

Exploring numerical data by using sample statistics and visualizations.
Sample mean, median, and mode as measures of “center”.
Sample variance, standard deviation, quartiles, and the IQR as measures of spread.
Anatomy of a distribution using boxplots.
Shapes and modality of histograms
One variable scatter plots, also called dot plots.
Two variable scatter plots

Today’s Activity

Provide answers for the following exercise problems. This exercise problem is modified from OpenIntro: Introduction to Modern Statistics Section 5.10

Dot plots and Boxplots. For each part, sketch dot plots and box plots. Identify the parts of the plots (e.g. mean, median, mode, quartiles, etc.). Have every member of your group try a different part of the problem and share and discuss your individual results with your group.

A: 3, 5, 5, 5, 8, 11, 11, 11, 13; B: 3, 5, 5, 5, 8, 11, 11, 11, 20
A: -20, 0, 0, 0, 15, 25, 30, 30; B: -40, 0, 0, 0, 15, 25, 30, 30
A: 0, 2, 4, 6, 8, 10; B: 20, 22, 24, 26, 28, 30
A: 100, 200, 300, 400, 500; B: 0, 50, 300, 550, 600

Meaning	Notation
sample size	\(\hspace{10px} n\)
sample mean	\(\hspace{10px} \bar{x}\)
population mean	\(\hspace{10px} \mu\)
sample standard deviation	\(\hspace{10px} s\)
population standard deviation	\(\hspace{10px} \sigma\)
sample variance	\(\hspace{10px} s^2\)
population variance	\(\hspace{10px} \sigma^2\)
InterQuartile Range	\(\hspace{10px} IQR\)
Quartiles	\(\hspace{10px} Q_i\)

2 - Exploring Numerical Data

Previously on Visualizing Categorical Data…

Previously on Data…

Visualizing Numerical Data

Two Variable Scatter Plots

Two Variable Scatter Plots

One Variable Scatter Plots

Histograms

Histograms

Modality of Distributions

Shapes of Distributions

Sample Mean and Median

Quartiles and the Boxplot

Quartiles and the Boxplot

Quartiles and the Boxplot

Outliers

Sample Statistics

Standard Statistics Notation

Robustness of the Sample Statistics

Visualizing Numerical Data Across Categories/Levels

Visualizing Numerical Data Across Categories/Levels

Summary

Today’s Activity