1 - Types of Data

Alex John Quijano

09/01/2021

What is “Data”?

Data are pieces of information from our world. It is used as reference to validate a mathematical model or used for statistical analysis.

We call data items as variables, which are any quantity or features that can be measured.

General Process of Statistical Analysis/Modeling/Inference using Data

General Process of Statistical Analysis/Modeling/Inference using Data

Types of Variables


Source: [Fig:1.1 of OpenIntro: Introduction to Modern Statistics](https://openintro-ims.netlify.app/data-hello.html#variable-types){target='_blank'}

Source: Fig:1.1 of OpenIntro: Introduction to Modern Statistics

Numerical Variables

Shown here are numerical variables of the county data set, which is available in the usdata package.

library(usdata)
library(tidyverse)
glimpse(county[1:3142,c("pop2000","pop2010","pop2017","pop_change",
                        "poverty","unemployment_rate",
                        "per_capita_income","median_hh_income")])
## Rows: 3,142
## Columns: 8
## $ pop2000           <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399, 11…
## $ pop2010           <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947, 11…
## $ pop2017           <int> 55504, 212628, 25270, 22668, 58013, 10309, 19825, 11…
## $ pop_change        <dbl> 1.48, 9.19, -6.22, 0.73, 0.68, -2.28, -2.69, -1.51, …
## $ poverty           <dbl> 13.7, 11.8, 27.2, 15.2, 15.6, 28.5, 24.4, 18.6, 18.8…
## $ unemployment_rate <dbl> 3.86, 3.99, 5.90, 4.39, 4.02, 4.93, 5.49, 4.93, 4.08…
## $ per_capita_income <dbl> 27841.70, 27779.85, 17891.73, 20572.05, 21367.39, 15…
## $ median_hh_income  <int> 55317, 52562, 33368, 43404, 47412, 29655, 36326, 436…

Which ones are discrete and which ones are continuous?

Categorical Variables

Shown here are categorical variables of the county data set, which is available in the usdata package.

library(usdata)
library(tidyverse)
glimpse(county[1:3142,c("name","state","metro","median_edu","smoking_ban")])
## Rows: 3,142
## Columns: 5
## $ name        <chr> "Autauga County", "Baldwin County", "Barbour County", "Bib…
## $ state       <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama, Alab…
## $ metro       <fct> yes, yes, no, yes, yes, no, no, yes, no, no, yes, no, no, …
## $ median_edu  <fct> some_college, some_college, hs_diploma, hs_diploma, hs_dip…
## $ smoking_ban <fct> none, none, partial, none, none, none, NA, NA, none, none,…

The possible (unique) valuesof the categorical variables are called levels. For example, the metro levels are yes or no.

Which ones are ordinal and which ones are nominal?

Negative Relationship Between the Variables

ggplot(data = na.omit(county), aes(x = multi_unit, y = homeownership)) + geom_point()



Because the scatterplot shows a downward trend, we can describe the relationship between variables multi_unit and homeownership to be negatively associated.

Positive Relationship Between the Variables

ggplot(data = na.omit(county), aes(x = median_hh_income, y = pop_change)) + geom_point()



Because the scatterplot shows an upward trend, we can describe the relationship between variables median_hh_income and pop_change to be positively associated.

Associated or Independent Variables


Associated variables are related somehow due to some underlying phenomenon.

Independent variables are the case where two variables are not associated.


Two variables CAN NOT be both associated and independent.


Note that association DOES NOT imply causation.


There might be confounding variables that explains the trend or the association is just a coincidence, and it might be because of something else entirely.

That is why we need to consider more data exploration and perspectives, and applying statistical inference to arrive a better conclusion.

Summary

In this lecture, we talked about the following:

Today’s Activity 1/2

Consider this data set named passwords which is downloaded from tidytuesdayR package repository.

Does weak passwords takes shorter time to crack?
library(tidytuesdayR)
passwords <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv')
# printing only the first 2 items for each variable
glimpse(passwords[1:2,])
## Rows: 2
## Columns: 9
## $ rank              <dbl> 1, 2
## $ password          <chr> "password", "123456"
## $ category          <chr> "password-related", "simple-alphanumeric"
## $ value             <dbl> 6.91, 18.52
## $ time_unit         <chr> "years", "minutes"
## $ offline_crack_sec <dbl> 2.17e+00, 1.11e-05
## $ rank_alt          <dbl> 1, 2
## $ strength          <dbl> 8, 4
## $ font_size         <dbl> 11, 8

Today’s Activity 2/2

Does weak passwords takes shorter time to crack?
ggplot(data = na.omit(passwords), aes(x = strength, y = offline_crack_sec)) + geom_point()