This document contains information on the potential datasets you can use for your Math 141 project. Due to the size of the class, you must pick from this list of datasets.

Disclaimer: We have not cleaned the datasets so they likely contain errors. Make sure to spend time inspecting the data, especially your variables of interest.

Mass Mobilization

Pulled from the Mass Mobilization Data Project Dataverse:

“The Mass Mobilization (MM) data are an effort to understand citizen movements against governments, what citizens want when they demonstrate against governments, and how governments respond to citizens. The project codes protests against governments - the data cover 162 countries between 1990 and March 2017. For each protest event, the project records protester demands, government responses, protest location, and protester identities. The Principle Investigators for the Mass Mobilization project are David H. Clark (Binghamton University) and Patrick M. Regan (University of Notre Dame). The Mass Mobilization project is sponsored by the Political Instability Task Force (PITF). The PITF is funded by the Central Intelligence Agency. The views expressed herein are the Principal Investigators’ alone and do not represent the views of the US Government.”

Codebook

Here is the codebook for the Mass Mobilization data.

We have wrangled both the “Protestor Demands” and the “State Responses” so that there is a column (with categories “Yes”, “No”, NA) for each demand or response. Note: Some protests had more than one demand and/or more than one state response.

Accessing and Loading the Data

You can load the data using the code in the following R chunk:

library(tidyverse)
mm <- read_csv("https://reed-statistics.github.io/math141s21/projects/mass_mobilization/mass_mobilization.csv")

Terms of Use

Our Community Norms as well as good scientific practices expect that proper credit is given via citation.

CC0 - “Public Domain Dedication”

Citation

Clark, David, and Patrick Regan. “Mass Mobilization Protest Data.” Harvard Dataverse 2 (2016). https://doi.org/10.7910/DVN/HTTWYL.


COVID-19 Behaviors

Pulled from the Imperial College London YouGov Covid 19 Behaviour Tracker Data Hub:

"These data come from the Imperial College London YouGov Covid 19 Behaviour Tracker Data Hub.

YouGov has partnered with the Institute of Global Health Innovation (IGHI) at Imperial College London to gather global insights on people’s behaviours in response to COVID-19. The research will cover 29 countries, interviewing around 21,000 people each week.

It is designed to provide behavioural analysis on how different populations are responding to the pandemic, helping public health bodies in their efforts to limit the impact of the disease. Anonymised respondent level data will be available for all public health and academic institutions globally.

The questions in the survey, led by IGHI, cover data on testing, symptoms, self-isolating in response to symptoms and the ability and willingness to self-isolate if needed. It also looks at behaviours, including going outdoors, working outside the home, contact with others, hand washing and the extent of compliance with 20 common preventative measures.

The datafiles contain responses from nationally representative surveys of the general public about symptoms, testing, self-isolation, social distancing and behaviour.

Contextual data includes: gender, age, region (within country), number of people in the household, children in household, health conditions, working status and the date of the survey response. A weighting variable is also provided, typically based on age, gender and region. For obvious reasons, people with severe symptoms, people who are / have been hospitalised and some other hard to reach groups will be under-represented in the sample.

We have completed a privacy assessment and have taken steps to safeguard the anonymity of the respondents by ensuring that the survey responses and contextual data, when looked at in isolation or as a combined data set, cannot be used to re-identify the respondents. A key part of this has been the exclusion of all data may lead to a greater risk of identification. For example, in the data set ‘age’ is represented by a numeric value rather than a full date of birth, and ‘regions’ are represented areas large enough to protect privacy, but which are still statistically valuable, such as Scotland or West Midlands."

The data we have includes respondents from the US, China, Spain, and Brazil.

Codebook

Here is the codebook for the COVID-19 Behaviors data.

Note: The codebook contains more variables than we have in our dataset because several questions were not asked in all countries.

Accessing and Loading the Data

There are two data files, depending on whether you want to focus on the US only or multiple countries):

  • us_covid.csv: 15,031 US respondents over 60 variables
  • covid.csv: 51,400 respondents from US, Spain, China, and Brazil over 60 variables

You can load the data using the code in the following R chunk:

library(tidyverse)
covid <- read_csv("https://reed-statistics.github.io/math141s21/projects/covid-19-behaviors/covid.csv")
us_covid <- read_csv("https://reed-statistics.github.io/math141s21/projects/covid-19-behaviors/us_covid.csv")

Additional information about the data collecting organizations

Institute of Global Health Innovation Imperial College London – Big Data Analytics Unit

imperial.ac.uk/centre-for-health-policy/our-work/e-health-and-analytics/big-data-and-analytical-unit-bdau/

Imperial College London – ICL-YouGov Survey Development

ICL-YouGov Global Survey development of measures is led by Sarah P. Jones of Imperial College London’s Institute of Global Health Innovation (ORCID). Survey questions have been contributed and adapted from collaborative sources, please see coviddatahub.com for a list of contributors.

Terms of Use

This data repository is copyright 2020 YouGov Plc, all rights reserved, is provided to the public strictly for educational and academic research purposes.

Citation

Jones, Sarah P., Imperial College London Big Data Analytical Unit and YouGov Plc. 2020, Imperial College London YouGov Covid Data Hub, v1.0, YouGov Plc, April 2020

Global Party Survey

Pulled from the Global Party Survey

“The Global Party Survey, 2019 (GPS) is an international expert survey directed by Pippa Norris (Harvard University). Drawing on 1,861 party and election experts, the Global Party Survey, 2019 estimates key ideological values, issue positions, and populist rhetoric for 1,043 parties in 163 countries. The research project is designed to replicate the tried and tested methods of expert surveys, while simultaneously innovating and broadening the research agenda in several important ways. • By expanding the geographic scope of coverage, including parties and countries in all inhabited continents, it allows users to move beyond the traditional focus on Europe. • By incorporating continuous scaled measures of populist rhetoric, as well as ideological values, analysts can compare the degree to which all parties commonly adopt this discourse, not simply confining analysis to those designated a priori in binary categories as ‘populist’ parties. • By including party codes used in many other related cross-national studies, the dataset facilitates easy merger for multilevel analysis, such as by comparing party positions with their institutional characteristics or with the attitudes of their voters. • At the same time, however, sufficient continuity is preserved with prior research measuring party positions to facilitate comparison with these established datasets. Several robustness and validity tests increase confidence in the external validity of the new study. More: www.GlobalPartySurvey.org @PippaN15 (2020-2-10)”

Codebook

Here is the codebook for the Global Party Survey data.

Accessing and Loading the Data

library(tidyverse)
df_parties <- read_csv("https://reed-statistics.github.io/math141s21/projects/global-party-survey/global-party-survey.csv")

Terms of Use

Our Community Norms as well as good scientific practices expect that proper credit is given via citation.

CC0 - “Public Domain Dedication”

Citation

Norris, P. (2020). Global Party Survey, 2019. Harvard Dataverse. https://doi.org/10.7910/DVN/WMGTNS

Greenhouse Gas Emssions Data

Pulled from the EPA Greenhouse Gas Reporting Program:

“The GHGRP requires reporting of greenhouse gas (GHG) data and other relevant information from large GHG emission sources, fuel and industrial gas suppliers, and CO2 injection sites in the United States. Approximately 8,000 facilities are required to report their emissions annually, and the reported data are made available to the public in October of each year.”

Pulled from the EPA Greenhouse Gas Reporting Program Background:

“As directed by Congress, EPA’s Greenhouse Gas Reporting Program (GHGRP) collects annual greenhouse gas information from the top emitting sectors of the U.S. economy… The GHGRP is the only dataset containing facility-level greenhouse gas (GHG) emissions data from large industrial sources across the United States. With seven consecutive years of reporting for most sectors, GHGRP data are providing important new information on industrial emissions—showing variation in emissions across facilities within an industry, variation in industrial emissions across geographic areas, and changes in emissions over time at the sector and facility level. EPA is using this facility-level data to improve estimates of national greenhouse gas emissions in the U.S. Greenhouse Gas Inventory.”

Codebook

The file “ghgp_data_by_year.xlsx”, which is part of the 2018 Data Summary Spreadsheets contains a tab called “FAQs about this Data”, with definitions for many of the variables contained in the dataset.

The data we provide is the “Direct Emitters” tab from “ghgp_data_by_year.xlsx”. We have simplified some of the variables and variable names. The variable “Latest Reported Industry Type (sectors)” has been renamed “Industry.Sector” and summarized so that there is a column (with values 0 or 1) for each sector. Note: Some facilities had more than one assigned sector.

Accessing and Loading the Data

library(tidyverse)
df_epa <- read_csv("https://reed-statistics.github.io/math141s21/projects/epa-emissions/ghgrp_data.csv")

Citation

U.S. EPA (2019). 2018 Data Summary Spreadsheets. Available at: https://www.epa.gov/ghgreporting/ghg-reporting-program-data-sets