2 - Exploring Categorical Data

Alex John Quijano

09/08/2021

Previously on Data…


Source: [Fig:1.1 of OpenIntro: Introduction to Modern Statistics](https://openintro-ims.netlify.app/data-hello.html#variable-types){target='_blank'}

Source: Fig:1.1 of OpenIntro: Introduction to Modern Statistics

Visualizing Categorical Data


Exploring categorical data using summary statistics and visualizations.

  1. Contingency Tables

  2. Bar Plots

  3. Mosaic Plots

  4. Pie Charts

  5. Waffle Charts

Contingency Tables - Counts

A contingency table summarizes data for two categorical variables.

Here, we are using the loans_full_schema data set for the tables and visualiztions in further slides.

Example:

A contingency table for application type and homeownership.
homeownership
application_type rent mortgage own Total
joint 362 950 183 1495
individual 3496 3839 1170 8505
Total 3858 4789 1353 10000


Contingency Tables - Row Proportions

Example:

A contingency table with row proportions for the application type and homeownership variables.
homeownership
application_type rent mortgage own Total
joint 0.242 0.635 0.122 1
individual 0.411 0.451 0.138 1


Contingency Tables - Column Proportions

Example:

A contingency table with column proportions for the application type and homeownership variables.
homeownership
application_type rent mortgage own
joint 0.094 0.198 0.135
individual 0.906 0.802 0.865
Total 1.000 1.000 1.000


Bar Plots


A bar plot shows categorical data with rectangular bars with heights or lengths corresponding to the values that they represent.


Types of bar plots:

Bar Plots

Example:

Three bar plots (stacked, dodged, and standardized) displaying homeownership and application type variables.

Three bar plots (stacked, dodged, and standardized) displaying homeownership and application type variables.

Mosaic Plots


A mosaic plot is a visualization technique suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the relative group sizes of the primary variable as well.

Mosaic Plots

Example:

The mosaic plots: one for homeownership alone and the other displaying the relationship between homeownership and application type.

The mosaic plots: one for homeownership alone and the other displaying the relationship between homeownership and application type.

Pie Charts


A pie chart represents the same information as a bar plot.

Pie Charts

Example 1:

A pie chart and bar plot of homeownership.

A pie chart and bar plot of homeownership.

Pie Charts

Example 2:

A pie chart and bar plot of loan grades.

A pie chart and bar plot of loan grades.

Waffle Charts


A waffle chart can be used to communicate the proportion of the data that falls into each level of a categorical variable.

Waffle Charts

Example:

Plot A: Waffle chart of homeownership, with levels rent, morgage, and own. Plot B: Waffle chart of loan status, with levels current, fully paid, in grade period, and late.

Plot A: Waffle chart of homeownership, with levels rent, morgage, and own. Plot B: Waffle chart of loan status, with levels current, fully paid, in grade period, and late.

Summary


In this lecture, we talked about the following:

Today’s Activity

Discuss your answers to your group for the following exercise problems. This exercise problem can be found in OpenIntro: Introduction to Modern Statistics Section 4.8


Black Lives Matter. A Washington Post-Schar School poll conducted in the United States in June 2020, among a random national sample of 1,006 adults, asked respondents whether they support or oppose protests following George Floyd’s killing that have taken place in cities across the US.

The survey also collected information on the age of the respondents. [Washington Post. 2020. “Washington Post-Schar School national poll, data collected June 2-7, 2020.”] The results are summarized in the stacked bar plot on the left.

  1. Based on the stacked bar plot, do views on the protests and age appear to be associated? Explain your reasoning.

  2. Conjecture other possible variables that might explain the potential association between these two variables.