Alex John Quijano
09/08/2021
Exploring categorical data using summary statistics and visualizations.
Contingency Tables
Bar Plots
Mosaic Plots
Pie Charts
Waffle Charts
A contingency table summarizes data for two categorical variables.
Here, we are using the loans_full_schema
data set for the tables and visualiztions in further slides.
Example:
homeownership
|
||||
---|---|---|---|---|
application_type | rent | mortgage | own | Total |
joint | 362 | 950 | 183 | 1495 |
individual | 3496 | 3839 | 1170 | 8505 |
Total | 3858 | 4789 | 1353 | 10000 |
What is the percentage of loans that was filed as an individual? \(\frac{8505}{10000} = 0.8505 \rightarrow 85.05\%\)
What is the percentage of loans that was filed as joint and who owns a home? \(\frac{183}{10000} = 0.183 \rightarrow 18.3\%\)
What is the percentage of loans that was filed as an individual and who rents? \(\frac{3496}{10000} = 0.3496 \rightarrow 34.96\%\)
Example:
homeownership
|
||||
---|---|---|---|---|
application_type | rent | mortgage | own | Total |
joint | 0.242 | 0.635 | 0.122 | 1 |
individual | 0.411 | 0.451 | 0.138 | 1 |
What is the percentage of loans among individuals that is in mortgage? \(\frac{3839}{8505} = 0.451 \rightarrow 45.10\%\)
Given that they filed as joint, what is the percentage of filers who rents? \(\frac{362}{1495} = 0.242 \rightarrow 24.20\%\)
Among individual filers, what is the percentage of filers who is in mortgage? \(\frac{3839}{8505} = 0.451 \rightarrow 45.1\%\)
Example:
homeownership
|
|||
---|---|---|---|
application_type | rent | mortgage | own |
joint | 0.094 | 0.198 | 0.135 |
individual | 0.906 | 0.802 | 0.865 |
Total | 1.000 | 1.000 | 1.000 |
What is the percentage of loans that was filed as joint given that they rented? \(\frac{362}{3858} = 0.094 \rightarrow 9.40\%\)
Among renters, what is the percentage of loans that was filed as individual? \(\frac{3496}{3858} = 0.906 \rightarrow 90.60\%\)
Given that they own a home, What is the percentage of loans that was filed as individual? \(\frac{1170}{1353} = 0.865 \rightarrow 86.50\%\)
A bar plot shows categorical data with rectangular bars with heights or lengths corresponding to the values that they represent.
Useful for visualizing the relationship between two categorical variables.
Works well for low number of levels in each category.
It might be difficult to visualize if the categories are severely imbalanced.
Types of bar plots:
Stacked: It can clearly display the most common level.
dodged: It can clearly display the most common within each level.
standardized: It can be helpful in understanding the fraction of levels relative to to other levels. We can observe assiciations between the variables.
Example:
Three bar plots (stacked, dodged, and standardized) displaying homeownership and application type variables.
A mosaic plot is a visualization technique suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the relative group sizes of the primary variable as well.
Very good at comparing proportions between levels.
Can be very good at looking at associations between two categorical variables.
Example:
The mosaic plots: one for homeownership alone and the other displaying the relationship between homeownership and application type.
A pie chart represents the same information as a bar plot.
Useful for giving a high-level overview to show how a set of cases break down.
Difficult to decipher certain details in a pie chart.
Work well when the goal is to visualize a categorical variable with very few levels.
Difficult to read when they are used to visualize a categorical variable with many levels.
Example 1:
A pie chart and bar plot of homeownership.
Example 2:
A pie chart and bar plot of loan grades.
A waffle chart can be used to communicate the proportion of the data that falls into each level of a categorical variable.
Works best when the number of levels represented is low.
Easier to compare proportions that represent non-simple fractions.
Example:
Plot A: Waffle chart of homeownership, with levels rent, morgage, and own. Plot B: Waffle chart of loan status, with levels current, fully paid, in grade period, and late.
In this lecture, we talked about the following:
Exploring categorical data by using contingency tables and visualizations.
Using contingency tables to compute proportions/percentage.
The three types of bar plots which are stacked, dodged, and standardized and talked about there advatages and disadvantages.
The advantages and disadvantages of mosaic plots, pie charts, and waffle charts.
Discuss your answers to your group for the following exercise problems. This exercise problem can be found in OpenIntro: Introduction to Modern Statistics Section 4.8
Black Lives Matter. A Washington Post-Schar School poll conducted in the United States in June 2020, among a random national sample of 1,006 adults, asked respondents whether they support or oppose protests following George Floyd’s killing that have taken place in cities across the US.
The survey also collected information on the age of the respondents. [Washington Post. 2020. “Washington Post-Schar School national poll, data collected June 2-7, 2020.”] The results are summarized in the stacked bar plot on the left.
Based on the stacked bar plot, do views on the protests and age appear to be associated? Explain your reasoning.
Conjecture other possible variables that might explain the potential association between these two variables.