class: center, middle

# Data Collection

<span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 4 | Fall 2020] </span>

---

## Announcements

* Invited to a Math/Stats/CS Grad School Panel
    + Tuesday, Sept 22nd, 4:30 - 5:45pm
    + More details in the #outside-stats channel in the Slack Workspace

* At the end of class, will go through the "MoreWranglingData.Rmd" handout.  Have three options:
    + Listen and take notes as I go through the handout
    + Print out PDF and take notes as I go through the handout (posted to Slack #in-class)    
    + Run the code with me (grab handout from `/home/courses/math141f20/Handouts`)

---

## Reminders

* Lab 3 due before your lab session this week.
    + Practice visualizing data with `ggplot2` and wrangling data with `dplyr`.

* Project Assignment 1 is due on Friday October 2nd (end of day) on Gradescope.

* Come to office hours this week, especially if you haven't stopped by twice yet this semester.

---

## Week 4 Topics

* Finish up a couple more **Data Wrangling** examples

* Data collection

* Modeling

**This week is light on new R material.  Make sure to use that time to get caught up on the R work so far.**

---

# Goals for Today

* Look at a few more data wrangling examples.

* Data Collection/Aquisition
    + Have out your data dictionaries that you created with your lab mates.

* Data modeling

---

class: center, middle, inverse

# Now let's go through the More Wrangling Data handout!

---

## Data Collection

Key questions to ask of the data:

* Where did the data come from?

* When were the data collected?

* Why were the data collected?

* How were the data collected?

* Who are the data supposed to represent?
    + Who is present?  Who is absent?
    + What evidence is there that the data are representative?

**Much of our discussion/practice this week will be around the last two bullets (How and Who).**

For Project Assignment 2, we will write "data biographies" about our project dataset that attempt to answer all of these questions.

---

## Who are the data supposed to represent?

**Population**: Group we want to better understand

**Sample**: Subset of the population that we have data on

&rarr; Is the sample representative of the population?

* EX: How many children (including yourself) are in your immediate family?

* We are a sample of households in the US.  Is our average number of children a representative estimate for households in the US?

---

## Bias

**Sampling bias**:  When the sampled cases are **systematically** different from the non-sampled cases.

&rarr; Use random sampling (a random mechanism for selecting cases from the population) to minimize sampling bias.

**Nonresponse bias**: The responses are **systematically** different from the non-responses.

&rarr; Use multiple modes and multiple attempts for reaching sampled cases.

&rarr; Explore key demographic variables to see how respondents and non-respondents vary.

---

## How were the data collected?

The answer to the question greatly impacts the conclusions we can draw about the population.

&rarr; EX: If we don't use random sampling, the sample might not be representative of the population and therefore we might only be able to draw conclusions about the sample itself.

### Types of random sampling

* Simple random sampling

* Stratified random sampling

* Cluster sampling

Why aren't all samples generated using simple random sampling?

---

## National Health and Nutrition Examination Survey (NHANES)

Why are these data collected?

&rarr; To assess the health of people in the US.

How are these data collected?

&rarr; Stage 1: US is stratified by geography and distribution of minority populations.  Counties are randomly selected within each stratum.

&rarr; Stage 2: From the sampled counties, city blocks are randomly selected. (City blocks are clusters.)

&rarr; Stage 3: From sampled city blocks, households are randomly selected. (Household are clusters.)

&rarr; Stage 4: From sampled households, people are randomly selected.  For the sampled households, a mobile health vehicle goes to the house and medical professionals take the necessary measurements.

**Why not use simple random sampling?**

---

## How were the data collected?

The answer to the question greatly impacts the conclusions we can draw about the population.

### Other key random mechanism:

**Random assignment**: Cases are randomly assigned to categories of the **explanatory variable**

* **Response variable**: Variable I want to better understand

* **Explanatory variables**: Variables I think might explain the response variable

&rarr; If the data were collected using random assignment, then I can determine if the explanatory variable **causes** changes in the response variable.

---

## Causal Inference

Often want to conclude that an explanatory variable causes changes in a response variable but you did not randomly assign the explanatory variable.

**Confounding variable**: When the explanatory variable and response variable vary, so does the confounder.

&rarr; Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response.

---

## Causal Inference

Often want to conclude that an explanatory variable causes changes in a response variable but you did not randomly assign the explanatory variable.

**Confounding variable**: When the explanatory variable and response variable vary, so does the confounder.

&rarr; Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response.

---

## Causal Inference

* **Spurious relationship**: Two variables are associated but not causally related
    + In the age of big data, lots of good examples [out there](https://tylervigen.com/spurious-correlations).
    
--

> "Correlation does not imply causation."

>  "Correlation does not imply not causation."

* **Causal inference**: Methods for finding causal relationships even when the data were collected without random sampling

---

## Types of Studies

* **Observational Study:** Collect data in a way that doesn't interfere

* **Experiment:** Interested in causal relationships so utilize random assignment.  Other key features include:
    + Blinding
    + Control group
    + Placebo

---

## Thoughts on Data Collection

* Two key forms of **randomness** in data collection:
    + Random Sampling
    + Random Assignment

* Most studies have one or neither of these forms of randomness. 
    + But still want to draw conclusions about the population.