Alex John Quijano
09/01/2021
Data are pieces of information from our world. It is used as reference to validate a mathematical model or used for statistical analysis.
We call data items as variables, which are any quantity or features that can be measured.
General Process of Statistical Analysis/Modeling/Inference using Data
Shown here are numerical variables of the county
data set, which is available in the usdata
package.
glimpse(county[1:3142,c("pop2000","pop2010","pop2017","pop_change",
"poverty","unemployment_rate",
"per_capita_income","median_hh_income")])
## Rows: 3,142
## Columns: 8
## $ pop2000 <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399, 11…
## $ pop2010 <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947, 11…
## $ pop2017 <int> 55504, 212628, 25270, 22668, 58013, 10309, 19825, 11…
## $ pop_change <dbl> 1.48, 9.19, -6.22, 0.73, 0.68, -2.28, -2.69, -1.51, …
## $ poverty <dbl> 13.7, 11.8, 27.2, 15.2, 15.6, 28.5, 24.4, 18.6, 18.8…
## $ unemployment_rate <dbl> 3.86, 3.99, 5.90, 4.39, 4.02, 4.93, 5.49, 4.93, 4.08…
## $ per_capita_income <dbl> 27841.70, 27779.85, 17891.73, 20572.05, 21367.39, 15…
## $ median_hh_income <int> 55317, 52562, 33368, 43404, 47412, 29655, 36326, 436…
Which ones are discrete and which ones are continuous?
pop2000
, pop2010
, pop2017
, median_hh_income
pop_change
, poverty
, unemployment_rate
, per_capita_income
Shown here are categorical variables of the county
data set, which is available in the usdata
package.
## Rows: 3,142
## Columns: 5
## $ name <chr> "Autauga County", "Baldwin County", "Barbour County", "Bib…
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama, Alab…
## $ metro <fct> yes, yes, no, yes, yes, no, no, yes, no, no, yes, no, no, …
## $ median_edu <fct> some_college, some_college, hs_diploma, hs_diploma, hs_dip…
## $ smoking_ban <fct> none, none, partial, none, none, none, NA, NA, none, none,…
The possible (unique) valuesof the categorical variables are called levels. For example, the metro
levels are yes
or no
.
Which ones are ordinal and which ones are nominal?
median_edu
name
, state
, metro
, smoking_ban
Because the scatterplot shows a downward trend, we can describe the relationship between variables multi_unit
and homeownership
to be negatively associated.
Because the scatterplot shows an upward trend, we can describe the relationship between variables median_hh_income
and pop_change
to be positively associated.
Associated variables are related somehow due to some underlying phenomenon.
Independent variables are the case where two variables are not associated.
There might be confounding variables that explains the trend or the association is just a coincidence, and it might be because of something else entirely.
That is why we need to consider more data exploration and perspectives, and applying statistical inference to arrive a better conclusion.
In this lecture, we talked about the following:
About Data and where it fits in statistical analysis/modeling/inference.
Examples of Numerical and Categorical Variables.
Examples of Discrete and Continuous Numerical Variables.
Examples of Ordinal and Nominal Categorical Variables.
Association vs Independence
Consider this data set named passwords
which is downloaded from tidytuesdayR
package repository.
Identify which variables are numerical or categorical. Explain why.
Identity each variable whether they are discrete/continuous if numerical or ordinal/nominal if categorical. Explain why.
library(tidytuesdayR)
passwords <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv')
## Rows: 2
## Columns: 9
## $ rank <dbl> 1, 2
## $ password <chr> "password", "123456"
## $ category <chr> "password-related", "simple-alphanumeric"
## $ value <dbl> 6.91, 18.52
## $ time_unit <chr> "years", "minutes"
## $ offline_crack_sec <dbl> 2.17e+00, 1.11e-05
## $ rank_alt <dbl> 1, 2
## $ strength <dbl> 8, 4
## $ font_size <dbl> 11, 8
Is there any association between strength
and offline_crack_sec
? Explain why.
Does it suggest that weaker passwords takes shorter time to crack? Identify any patterns in this scatter plot that this might not be the case.
Are there any variables we have not considered that might explain weaker passwords takes shorter time to crack?