Numerical Variables

Shown here are numerical variables of the county data set, which is available in the usdata package.

library(usdata)
library(tidyverse)

glimpse(county[1:3142,c("pop2000","pop2010","pop2017","pop_change",
                        "poverty","unemployment_rate",
                        "per_capita_income","median_hh_income")])

## Rows: 3,142
## Columns: 8
## $ pop2000           <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399, 11…
## $ pop2010           <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947, 11…
## $ pop2017           <int> 55504, 212628, 25270, 22668, 58013, 10309, 19825, 11…
## $ pop_change        <dbl> 1.48, 9.19, -6.22, 0.73, 0.68, -2.28, -2.69, -1.51, …
## $ poverty           <dbl> 13.7, 11.8, 27.2, 15.2, 15.6, 28.5, 24.4, 18.6, 18.8…
## $ unemployment_rate <dbl> 3.86, 3.99, 5.90, 4.39, 4.02, 4.93, 5.49, 4.93, 4.08…
## $ per_capita_income <dbl> 27841.70, 27779.85, 17891.73, 20572.05, 21367.39, 15…
## $ median_hh_income  <int> 55317, 52562, 33368, 43404, 47412, 29655, 36326, 436…

Which ones are discrete and which ones are continuous?

Discrete - pop2000, pop2010, pop2017, median_hh_income
Continuous - pop_change, poverty, unemployment_rate, per_capita_income

Categorical Variables

Shown here are categorical variables of the county data set, which is available in the usdata package.

library(usdata)
library(tidyverse)

glimpse(county[1:3142,c("name","state","metro","median_edu","smoking_ban")])

## Rows: 3,142
## Columns: 5
## $ name        <chr> "Autauga County", "Baldwin County", "Barbour County", "Bib…
## $ state       <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama, Alab…
## $ metro       <fct> yes, yes, no, yes, yes, no, no, yes, no, no, yes, no, no, …
## $ median_edu  <fct> some_college, some_college, hs_diploma, hs_diploma, hs_dip…
## $ smoking_ban <fct> none, none, partial, none, none, none, NA, NA, none, none,…

The possible (unique) valuesof the categorical variables are called levels. For example, the metro levels are yes or no.

Which ones are ordinal and which ones are nominal?

Ordinal - median_edu
Nominal - name, state, metro, smoking_ban

Negative Relationship Between the Variables

ggplot(data = na.omit(county), aes(x = multi_unit, y = homeownership)) + geom_point()

Because the scatterplot shows a downward trend, we can describe the relationship between variables multi_unit and homeownership to be negatively associated.

Positive Relationship Between the Variables

ggplot(data = na.omit(county), aes(x = median_hh_income, y = pop_change)) + geom_point()

Because the scatterplot shows an upward trend, we can describe the relationship between variables median_hh_income and pop_change to be positively associated.

Associated or Independent Variables

Associated variables are related somehow due to some underlying phenomenon.

Independent variables are the case where two variables are not associated.

Two variables CAN NOT be both associated and independent.

Note that association DOES NOT imply causation.

There might be confounding variables that explains the trend or the association is just a coincidence, and it might be because of something else entirely.

That is why we need to consider more data exploration and perspectives, and applying statistical inference to arrive a better conclusion.

Summary

In this lecture, we talked about the following:

About Data and where it fits in statistical analysis/modeling/inference.
Examples of Numerical and Categorical Variables.
Examples of Discrete and Continuous Numerical Variables.
Examples of Ordinal and Nominal Categorical Variables.
Association vs Independence

Today’s Activity 1/2

Consider this data set named passwords which is downloaded from tidytuesdayR package repository.

Does weak passwords takes shorter time to crack?

$Source: [Table:3.3 of OpenIntro: Introduction to Modern Statistics](https://openintro-ims.netlify.app/data-applications.html#tab:passwords-var-def){target='_blank'}$

Source: Table:3.3 of OpenIntro: Introduction to Modern Statistics

Identify which variables are numerical or categorical. Explain why.
Identity each variable whether they are discrete/continuous if numerical or ordinal/nominal if categorical. Explain why.

library(tidytuesdayR)
passwords <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv')

# printing only the first 2 items for each variable
glimpse(passwords[1:2,])

## Rows: 2
## Columns: 9
## $ rank              <dbl> 1, 2
## $ password          <chr> "password", "123456"
## $ category          <chr> "password-related", "simple-alphanumeric"
## $ value             <dbl> 6.91, 18.52
## $ time_unit         <chr> "years", "minutes"
## $ offline_crack_sec <dbl> 2.17e+00, 1.11e-05
## $ rank_alt          <dbl> 1, 2
## $ strength          <dbl> 8, 4
## $ font_size         <dbl> 11, 8

Today’s Activity 2/2