2 - Exploring Numerical Data

Alex John Quijano

09/10/2021

Previously on Visualizing Categorical Data…


Previously on Data…


Source: [Fig:1.1 of OpenIntro: Introduction to Modern Statistics](https://openintro-ims.netlify.app/data-hello.html#variable-types){target='_blank'}

Source: Fig:1.1 of OpenIntro: Introduction to Modern Statistics

Visualizing Numerical Data


Exploring numerical data using sample statistics and visualizations.

  1. Mean, Variance, and Standard Deviation

  2. Quartiles, Median, and the InterQuartile Range (IQR)

  3. Dot and Scatter Plots

  4. Histograms

  5. Boxplots

  6. Comparing Numerical data across categories/levels.

Two Variable Scatter Plots

A two-variable scatterplot provides a case-by-case view of data for two numerical variables.

Example 1:

A scatterplot of loan amount versus total income for the `loan50` dataset.

A scatterplot of loan amount versus total income for the loan50 dataset.

Two Variable Scatter Plots

Example 2:

A scatterplot of the median household income against the poverty rate for the `county` dataset. Data are from 2017. A statistical model has also been fit to the data and is shown as a dashed line.

A scatterplot of the median household income against the poverty rate for the county dataset. Data are from 2017. A statistical model has also been fit to the data and is shown as a dashed line.

One Variable Scatter Plots

A dot plot is a one-variable scatterplot; an example using the interest rate of 50 loans from the loan50 dataset is shown below. Note that the loan50 dataset is subset with 50 sampled rows from the loans_full_schema dataset.

Key points:

A dot plot of interest rate for the `loan50` dataset. The rates have been rounded to the nearest percent in this plot, and the distribution's mean is shown as a red triangle.

A dot plot of interest rate for the loan50 dataset. The rates have been rounded to the nearest percent in this plot, and the distribution’s mean is shown as a red triangle.

Histograms


A histogram are the binned counts of the data plotted as bars. Note that the histogram resembles a more heavily binned version of the stacked dot plot.

Key points:

Histograms


Example:

Counts for the binned interest rate data.
Interest rate Count
(5% - 7.5%] 11
(7.5% - 10%] 15
(10% - 12.5%] 8
(12.5% - 15%] 4
(15% - 17.5%] 5
(17.5% - 20%] 4
(20% - 22.5%] 1
(22.5% - 25%] 1
(25% - 27.5%] 1
A histogram of interest rate. This distribution is strongly skewed to the right.

A histogram of interest rate. This distribution is strongly skewed to the right.

Modality of Distributions

The modality of a distribution is the number of prominent peaks, also know known as modes.

Examples:

Counting only prominent peaks, the distributions are (left to right) unimodal, bimodal, and multimodal. Note that the left plot is unimodal because we are counting prominent peaks, not just any peak.

Counting only prominent peaks, the distributions are (left to right) unimodal, bimodal, and multimodal. Note that the left plot is unimodal because we are counting prominent peaks, not just any peak.

Shapes of Distributions

There are three main typpes of distribution shapes:

Examples:

Source: [Skewness Wiki](https://en.wikipedia.org/wiki/Skewness#/media/File:Relationship_between_mean_and_median_under_different_skewness.png){target=_blank}

Source: Skewness Wiki

Sample Mean and Median

The sample mean can be calculated as the sum of the observed values divided by the number of observations:

\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}\] where \(n\) is the number of observations and \(x_1 + x_2 + \cdots + x_n\) are the observed values.

The median is the number in the middle of ordered numerical observations. If the data are ordered from smallest to largest, the median is the observation right in the middle. If there are an even number of observations, there will be two values in the middle, and the median is taken as their average.

Median Example:

Source: [Median Wiki](https://en.wikipedia.org/wiki/Median#/media/File:Finding_the_median.png){target=_blank}

Source: Median Wiki

Quartiles and the Boxplot

A box plot summarizes a dataset using five statistics while also identifying unusual observations. These statistics are the “minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”. A \(\alpha\) percentile is a number with \(\alpha\)% of the observations below and \(100-\alpha\)% of the observations above.

Key Points:

Quartiles and the Boxplot



Source: [Towards Data Science: Understanding Boxplots](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51){target='_blank'}

Source: Towards Data Science: Understanding Boxplots

Note: You might notice the “1.5*IQR” terms in the “minimum” and “maximum” values which separates the outliers. In short, this is a method to detect potential outliers of the data. We will revisit the outlier topic in the near future for more details.

Quartiles and the Boxplot

Example:

Plot A shows a dot plot and Plot B shows a box plot of the distribution of interest rates from the `loan50` dataset.

Plot A shows a dot plot and Plot B shows a box plot of the distribution of interest rates from the loan50 dataset.

Outliers


An outlier is an observation that appears extreme relative to the rest of the data. Examining data for outliers serves many useful purposes, including

  1. identifying strong skew in the distribution,

  2. identifying possible data collection or data entry errors, and

  3. providing insight into interesting properties of the data.

Keep in mind, however, that some datasets have a naturally long skew and outlying points do not represent any sort of problem in the dataset.

Sample Statistics

Measures of “center”: mean, median, and mode

Measures of spread: quartiles, IQR, variance, and standard deviation


Variance and standard deviation.

\[s = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\]

\[s^2\] * The standard deviation is useful when considering how far the data are distributed from the mean and it represents the typical deviation of observations from the mean.

Variance can be zero and positive but it can’t be a negative value.

Standard Statistics Notation



Meaning Notation
sample size \(\hspace{10px} n\)
sample mean \(\hspace{10px} \bar{x}\)
population mean \(\hspace{10px} \mu\)
sample standard deviation \(\hspace{10px} s\)
population standard deviation \(\hspace{10px} \sigma\)
sample variance \(\hspace{10px} s^2\)
population variance \(\hspace{10px} \sigma^2\)
InterQuartile Range \(\hspace{10px} IQR\)
Quartiles \(\hspace{10px} Q_i\)

Robustness of the Sample Statistics



A comparison of how the median, IQR, mean, and standard deviation change as the value of an extereme observation from the original interest data changes.
Robust
Not robust
Scenario Median IQR Mean SD
Original data 9.93 5.75 11.6 5.05
Move 26.3% to 15% 9.93 5.75 11.3 4.61
Move 26.3% to 35% 9.93 5.75 11.7 5.68

Visualizing Numerical Data Across Categories/Levels

Example 1:

Histograms (Plot A) and side by-side box plots (Plot B) for median household income, where counties are split by whether there was a population gain or not.

Histograms (Plot A) and side by-side box plots (Plot B) for median household income, where counties are split by whether there was a population gain or not.

Visualizing Numerical Data Across Categories/Levels

Example 2:

Summary


In this lecture, we talked about the following:

Today’s Activity


Provide answers for the following exercise problems. This exercise problem is modified from OpenIntro: Introduction to Modern Statistics Section 5.10

Dot plots and Boxplots. For each part, sketch dot plots and box plots. Identify the parts of the plots (e.g. mean, median, mode, quartiles, etc.). Have every member of your group try a different part of the problem and share and discuss your individual results with your group.

  1. A: 3, 5, 5, 5, 8, 11, 11, 11, 13; B: 3, 5, 5, 5, 8, 11, 11, 11, 20

  2. A: -20, 0, 0, 0, 15, 25, 30, 30; B: -40, 0, 0, 0, 15, 25, 30, 30

  3. A: 0, 2, 4, 6, 8, 10; B: 20, 22, 24, 26, 28, 30

  4. A: 100, 200, 300, 400, 500; B: 0, 50, 300, 550, 600