1 - Sources of Data

Alex John Quijano

09/03/2021

Previously on Data…


Source: [Fig:1.1 of OpenIntro: Introduction to Modern Statistics](https://openintro-ims.netlify.app/data-hello.html#variable-types){target='_blank'}

Source: Fig:1.1 of OpenIntro: Introduction to Modern Statistics

Previously on Association vs Independence…



Associated variables are related somehow due to some underlying phenomenon.

Independent variables are the case where two variables are not associated.


Two variables CAN NOT be both associated and independent.


Note that association DOES NOT imply causation.

Study Design


In order to do statistical analysis using data, we need to understand the following:

Research Questions

Consider the following research questions:

  1. What is the average cyanide content of unprocessed cassava?

  2. Does this drug reduce the number of deaths of hospitalized COVID-19 patients?

  3. Is the vaccine safe and effective against COVID-19?

  4. Does these new vitamin supplements improve people’s health?

An explanatory variable is the likely cause that explains the response variable. A response variable is the expected outcome, and it responds to explanatory variable.

Examples:

  1. The drug is the explanatory variable while the number of deaths is the response.

  2. The supplements are the explanatory variable while people’s health is the response.

Anecdotal Evidence

Examples:

  1. My neighbor ate unprocessed cassava and he was just fine.

  2. The news says that two people took this drug while in the hospital and they recovered, so it must have worked.

  3. A social media post says that a friend died after getting the vaccine, so it must be dangerous.

  4. A close friend took these vitamins for 30 years and he says he feels great and has not got the flu in a year.

Note: We need to be careful on taking data so quickly. These anecdotal evidence examples may be true and can be verified but it may not be a good representation of the entire population of interest.

What can Statistics do to try avoid making hasty generalizations?

Populations and Samples

Sampling Methods - Simple random sampling

Sampling Methods - Stratified sampling

Simple random sampling (top) and stratified sampling (bottom)

Simple random sampling (top) and stratified sampling (bottom)

Sampling Methods - Clustered sampling

More Statistical Terms


We use specific terms in order to differentiate when a number is being calculated on a sample of data (statistic) and when it is being calculated or considered for calculation on the entire population (parameter).

The terms statistic and parameter are useful for communicating claims and models.

Sampling Downfalls


  1. Cherry picking sampling: A pick-and-choose method on which samples to get based on some interest.

  2. Voluntary surveys: A way to take samples based on a voluntary basis. This may introduce non-response bias which can skew the results.

  3. Convenience Sample: A sample that is easily accessible which are more likely to be sampled. It is often difficult to discern this type of sample represents because it might ignore the ones that can not be easily sampled.

Experimental Study


An experimental study is a type of study where we randomly assign a treatment to a group so where we can draw a causal relationship between the explanatory and response variables. We group people/things to groups and apply some treatment to one of the groups and the other group (control) does not get any treatment - or we apply a placebo.

Key points:

Observational Study


An observational study is a type of study where we measure or survey people or things of a sample without doing any control and manipulation of the variables.

Key points:

Summary

In this lecture, we talked about the following:

Today’s Activity

Identify what type of studies of these examples below and comment on what type of sampling they used and the type of variables involved. Include a comment whether there are missing information or the sampling method might be a bit problematic.

  1. A study took a random sample of students and asked them about their bedtime schedules. The data showed that people who sleep for at least 8 hours before the exam day were more likely to get good grades than those who sleep for less than 8 hours.

  2. A study randomly assigned people to one of the two groups. Group 1 was asked to follow a strict study schedule for a fixed period of time whereas Group 2 was asked to study in the same way as they used to earlier. The researchers looked at which group scored better in the exams.

  3. A study took a random sample of people and examined their smoking habits. Each person was classified as either a light, moderate or heavy smoker. The researcher looked at the stress level of each group.

These problems are taken from Towards Data Science Blog - “Observational vs Experimental Study”.