class: center, middle # Data Collection <img src="img/DAW.png" width="500px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 4 | Fall 2020] </span> --- ## Announcements * Invited to a Math/Stats/CS Grad School Panel + Tuesday, Sept 22nd, 4:30 - 5:45pm + More details in the #outside-stats channel in the Slack Workspace -- * At the end of class, will go through the "MoreWranglingData.Rmd" handout. Have three options: + Listen and take notes as I go through the handout + Print out PDF and take notes as I go through the handout (posted to Slack #in-class) + Run the code with me (grab handout from `/home/courses/math141f20/Handouts`) --- ## Reminders * Lab 3 due before your lab session this week. + Practice visualizing data with `ggplot2` and wrangling data with `dplyr`. -- * Project Assignment 1 is due on Friday October 2nd (end of day) on Gradescope. -- * Come to office hours this week, especially if you haven't stopped by twice yet this semester. --- ## Week 4 Topics * Finish up a couple more **Data Wrangling** examples * Data collection * Modeling **This week is light on new R material. Make sure to use that time to get caught up on the R work so far.** --- # Goals for Today * Look at a few more data wrangling examples. * Data Collection/Aquisition + Have out your data dictionaries that you created with your lab mates. * Data modeling --- class: center, middle, inverse # Now let's go through the More Wrangling Data handout! --- ## Data Collection Key questions to ask of the data: * Where did the data come from? * When were the data collected? * Why were the data collected? * How were the data collected? * Who are the data supposed to represent? + Who is present? Who is absent? + What evidence is there that the data are representative? -- **Much of our discussion/practice this week will be around the last two bullets (How and Who).** For Project Assignment 2, we will write "data biographies" about our project dataset that attempt to answer all of these questions. --- ## Who are the data supposed to represent? **Population**: Group we want to better understand -- **Sample**: Subset of the population that we have data on -- → Is the sample representative of the population? -- * EX: How many children (including yourself) are in your immediate family? -- * We are a sample of households in the US. Is our average number of children a representative estimate for households in the US? --- ## Bias **Sampling bias**: When the sampled cases are **systematically** different from the non-sampled cases. -- → Use random sampling (a random mechanism for selecting cases from the population) to minimize sampling bias. -- **Nonresponse bias**: The responses are **systematically** different from the non-responses. -- → Use multiple modes and multiple attempts for reaching sampled cases. -- → Explore key demographic variables to see how respondents and non-respondents vary. --- ## How were the data collected? The answer to the question greatly impacts the conclusions we can draw about the population. → EX: If we don't use random sampling, the sample might not be representative of the population and therefore we might only be able to draw conclusions about the sample itself. -- ### Types of random sampling * Simple random sampling * Stratified random sampling * Cluster sampling Why aren't all samples generated using simple random sampling? --- ## National Health and Nutrition Examination Survey (NHANES) Why are these data collected? -- → To assess the health of people in the US. -- How are these data collected? -- → Stage 1: US is stratified by geography and distribution of minority populations. Counties are randomly selected within each stratum. -- → Stage 2: From the sampled counties, city blocks are randomly selected. (City blocks are clusters.) -- → Stage 3: From sampled city blocks, households are randomly selected. (Household are clusters.) -- → Stage 4: From sampled households, people are randomly selected. For the sampled households, a mobile health vehicle goes to the house and medical professionals take the necessary measurements. -- **Why not use simple random sampling?** --- ## How were the data collected? The answer to the question greatly impacts the conclusions we can draw about the population. ### Other key random mechanism: **Random assignment**: Cases are randomly assigned to categories of the **explanatory variable** * **Response variable**: Variable I want to better understand * **Explanatory variables**: Variables I think might explain the response variable -- → If the data were collected using random assignment, then I can determine if the explanatory variable **causes** changes in the response variable. --- ## Causal Inference Often want to conclude that an explanatory variable causes changes in a response variable but you did not randomly assign the explanatory variable. -- **Confounding variable**: When the explanatory variable and response variable vary, so does the confounder. → Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response. <img src="img/confound.png" width="70%" style="display: block; margin: auto;" /> --- ## Causal Inference Often want to conclude that an explanatory variable causes changes in a response variable but you did not randomly assign the explanatory variable. **Confounding variable**: When the explanatory variable and response variable vary, so does the confounder. → Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response. <img src="img/confound2.png" width="70%" style="display: block; margin: auto;" /> --- ## Causal Inference * **Spurious relationship**: Two variables are associated but not causally related + In the age of big data, lots of good examples [out there](https://tylervigen.com/spurious-correlations). -- > "Correlation does not imply causation." -- > "Correlation does not imply not causation." -- * **Causal inference**: Methods for finding causal relationships even when the data were collected without random sampling --- ## Types of Studies * **Observational Study:** Collect data in a way that doesn't interfere -- * **Experiment:** Interested in causal relationships so utilize random assignment. Other key features include: + Blinding + Control group + Placebo --- ## Thoughts on Data Collection * Two key forms of **randomness** in data collection: + Random Sampling + Random Assignment -- * Most studies have one or neither of these forms of randomness. + But still want to draw conclusions about the population.