Alex John Quijano
09/10/2021
Using contingency tables to compute proportions/percentage.
The three types of bar plots which are stacked, dodged, and standardized and talked about there advatages and disadvantages.
The advantages and disadvantages of mosaic plots, pie charts, and waffle charts.
Exploring numerical data using sample statistics and visualizations.
Mean, Variance, and Standard Deviation
Quartiles, Median, and the InterQuartile Range (IQR)
Dot and Scatter Plots
Histograms
Boxplots
Comparing Numerical data across categories/levels.
A two-variable scatterplot provides a case-by-case view of data for two numerical variables.
Example 1:
A scatterplot of loan amount versus total income for the loan50
dataset.
Example 2:
A scatterplot of the median household income against the poverty rate for the county
dataset. Data are from 2017. A statistical model has also been fit to the data and is shown as a dashed line.
A dot plot is a one-variable scatterplot; an example using the interest rate of 50 loans from the loan50
dataset is shown below. Note that the loan50
dataset is subset with 50 sampled rows from the loans_full_schema
dataset.
Key points:
It shows the exact value for each observation.
Usefull for small datasets.
A dot plot of interest rate for the loan50
dataset. The rates have been rounded to the nearest percent in this plot, and the distribution’s mean is shown as a red triangle.
A histogram are the binned counts of the data plotted as bars. Note that the histogram resembles a more heavily binned version of the stacked dot plot.
Key points:
Histograms provide a view of the data density. Higher bars represent where the data are relatively more common.
Very good at visualizing data with many oberservations.
Histograms are especially convenient for understanding the shape of the data distribution.
Example:
Interest rate | Count |
---|---|
(5% - 7.5%] | 11 |
(7.5% - 10%] | 15 |
(10% - 12.5%] | 8 |
(12.5% - 15%] | 4 |
(15% - 17.5%] | 5 |
(17.5% - 20%] | 4 |
(20% - 22.5%] | 1 |
(22.5% - 25%] | 1 |
(25% - 27.5%] | 1 |
A histogram of interest rate. This distribution is strongly skewed to the right.
The modality of a distribution is the number of prominent peaks, also know known as modes.
unimodal - one mode (or the most frequent value in the data)
bimodal - two modes
multimodal - multiple modes.
Examples:
Counting only prominent peaks, the distributions are (left to right) unimodal, bimodal, and multimodal. Note that the left plot is unimodal because we are counting prominent peaks, not just any peak.
There are three main typpes of distribution shapes:
right skewed - when the distribution of a variable trails off to the right in this way and has a longer right tail. It is also called positive skewed.
Symmetric - variables that show roughly equal trailing off in both directions.
left skewed - variables with the reverse characteristic with a long, thinner tail to the left. It is also called negative skewed.
Examples:
Source: Skewness Wiki
The sample mean can be calculated as the sum of the observed values divided by the number of observations:
\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}\] where \(n\) is the number of observations and \(x_1 + x_2 + \cdots + x_n\) are the observed values.
The median is the number in the middle of ordered numerical observations. If the data are ordered from smallest to largest, the median is the observation right in the middle. If there are an even number of observations, there will be two values in the middle, and the median is taken as their average.
Median Example:
Source: Median Wiki
A box plot summarizes a dataset using five statistics while also identifying unusual observations. These statistics are the “minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”. A \(\alpha\) percentile is a number with \(\alpha\)% of the observations below and \(100-\alpha\)% of the observations above.
Median - (Q2/50th Percentile) - the middle value.
First quartile - (Q1/25th Percentile) - the middle number between the smallest number (not the “minimum”) and the median of the dataset.
Third quartile - (Q3/75th Percentile) - the middle value between the median and the highest value (not the “maximum”) of the dataset.
Interquartile range (IQR) - 25th to the 75th percentile and computed as Q3-Q1.
Key Points:
Boxplots gives a sense of how much the data is actually spreads.
It shows the range of the data
It can not show the number of modes compared to looking at a histogram.
It can show the skewness and potential outliers.
Note: You might notice the “1.5*IQR” terms in the “minimum” and “maximum” values which separates the outliers. In short, this is a method to detect potential outliers of the data. We will revisit the outlier topic in the near future for more details.
Example:
Plot A shows a dot plot and Plot B shows a box plot of the distribution of interest rates from the loan50
dataset.
An outlier is an observation that appears extreme relative to the rest of the data. Examining data for outliers serves many useful purposes, including
identifying strong skew in the distribution,
identifying possible data collection or data entry errors, and
providing insight into interesting properties of the data.
Keep in mind, however, that some datasets have a naturally long skew and outlying points do not represent any sort of problem in the dataset.
Measures of “center”: mean, median, and mode
Measures of spread: quartiles, IQR, variance, and standard deviation
Variance and standard deviation.
\[s = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\]
\[s^2\] * The standard deviation is useful when considering how far the data are distributed from the mean and it represents the typical deviation of observations from the mean.
Meaning | Notation |
---|---|
sample size | \(\hspace{10px} n\) |
sample mean | \(\hspace{10px} \bar{x}\) |
population mean | \(\hspace{10px} \mu\) |
sample standard deviation | \(\hspace{10px} s\) |
population standard deviation | \(\hspace{10px} \sigma\) |
sample variance | \(\hspace{10px} s^2\) |
population variance | \(\hspace{10px} \sigma^2\) |
InterQuartile Range | \(\hspace{10px} IQR\) |
Quartiles | \(\hspace{10px} Q_i\) |
Robust
|
Not robust
|
|||
---|---|---|---|---|
Scenario | Median | IQR | Mean | SD |
Original data | 9.93 | 5.75 | 11.6 | 5.05 |
Move 26.3% to 15% | 9.93 | 5.75 | 11.3 | 4.61 |
Move 26.3% to 35% | 9.93 | 5.75 | 11.7 | 5.68 |
Example 1:
Histograms (Plot A) and side by-side box plots (Plot B) for median household income, where counties are split by whether there was a population gain or not.
Example 2:
In this lecture, we talked about the following:
Exploring numerical data by using sample statistics and visualizations.
Sample mean, median, and mode as measures of “center”.
Sample variance, standard deviation, quartiles, and the IQR as measures of spread.
Anatomy of a distribution using boxplots.
Shapes and modality of histograms
One variable scatter plots, also called dot plots.
Two variable scatter plots
Provide answers for the following exercise problems. This exercise problem is modified from OpenIntro: Introduction to Modern Statistics Section 5.10
Dot plots and Boxplots. For each part, sketch dot plots and box plots. Identify the parts of the plots (e.g. mean, median, mode, quartiles, etc.). Have every member of your group try a different part of the problem and share and discuss your individual results with your group.
A: 3, 5, 5, 5, 8, 11, 11, 11, 13; B: 3, 5, 5, 5, 8, 11, 11, 11, 20
A: -20, 0, 0, 0, 15, 25, 30, 30; B: -40, 0, 0, 0, 15, 25, 30, 30
A: 0, 2, 4, 6, 8, 10; B: 20, 22, 24, 26, 28, 30
A: 100, 200, 300, 400, 500; B: 0, 50, 300, 550, 600