Learning Objectives


Upon completing today’s lab activity, students should be able to do the following using R and RStudio:

  1. Navigate through the RStudio interface and perform basic computations in the R console.

  2. Install and load R packages.

  3. Load data sets from packages and csv files and perform basic data subsetting.

  4. Perform basic plotting of histograms and scatterplots.


The RStudio Interface


R is a programming language for statistical computing and data visualization. It is free and widely used by statisticians and data scientist for data analysis, software development for statistical methods, and statistical application in research.

RStudio is an open-source Integrated Development Environment (IDE) for R. It is a desktop application where it allows using R as easy as possible. Below is a screen shot of what RStudio looks like.

RStudio is composed of three main panels.

  • Console (left or lower-left) where you can put all of your R commands after the prompt symbol >.

  • Environment (upper-right) where it contains a history of the commands you have previously entered and all of the variables you declared.

  • Files, Plots, Packages, Help, Viewer (lower-right) where you can browse files, access help for R functions, install and manage packages, and viewing visualizations.


Using R as a Calculator


Basic Math Computations

1+41
## [1] 42


Variables

x <- 1
y <- 2
z = (x+2)+(y+1)
print(z)
## [1] 6


Vectors

vector_1 <- c(1,2,3,4,5)
print(vector_1)
## [1] 1 2 3 4 5


R Packages


Installing Packages

install.packages("tidyverse")
install.packages("openintro")


Loading Packages

library(tidyverse)
library(openintro)


Loading Datasets


In this section, we are using the Loan data from Lending Club. The loans_full_schema is a data frame already embedded into the openintro package.


Datasets from Existing Packages

glimpse(loans_full_schema)
## Rows: 10,000
## Columns: 55
## $ emp_title                        <chr> "global config engineer ", "warehouse…
## $ emp_length                       <dbl> 3, 10, 3, 1, 10, NA, 10, 10, 10, 3, 1…
## $ state                            <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, I…
## $ homeownership                    <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN…
## $ annual_income                    <dbl> 90000, 40000, 40000, 30000, 35000, 34…
## $ verified_income                  <fct> Verified, Not Verified, Source Verifi…
## $ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.4…
## $ annual_income_joint              <dbl> NA, NA, NA, NA, 57000, NA, 155000, NA…
## $ verification_income_joint        <fct> , , , , Verified, , Not Verified, , ,…
## $ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.66, NA, 13.12, NA,…
## $ delinq_2y                        <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
## $ months_since_last_delinq         <int> 38, NA, 28, NA, NA, 3, NA, 19, 18, NA…
## $ earliest_credit_line             <dbl> 2001, 1996, 2006, 2007, 2008, 1990, 2…
## $ inquiries_last_12m               <int> 6, 1, 4, 0, 7, 6, 1, 1, 3, 0, 4, 4, 8…
## $ total_credit_lines               <int> 28, 30, 31, 4, 22, 32, 12, 30, 35, 9,…
## $ open_credit_lines                <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
## $ total_credit_limit               <int> 70795, 28800, 24193, 25400, 69839, 42…
## $ total_credit_utilized            <int> 38767, 4321, 16000, 4997, 52722, 3898…
## $ num_collections_last_12m         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_historical_failed_to_pay     <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ months_since_90d_late            <int> 38, NA, 28, NA, NA, 60, NA, 71, 18, N…
## $ current_accounts_delinq          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_collection_amount_ever     <int> 1250, 0, 432, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ current_installment_accounts     <int> 2, 0, 1, 1, 1, 0, 2, 2, 6, 1, 2, 1, 2…
## $ accounts_opened_24m              <int> 5, 11, 13, 1, 6, 2, 1, 4, 10, 5, 6, 7…
## $ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, 7, 4, 17, 3, 4,…
## $ num_satisfactory_accounts        <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
## $ num_accounts_120d_past_due       <int> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
## $ num_accounts_30d_past_due        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_active_debit_accounts        <int> 2, 3, 3, 2, 10, 1, 3, 5, 11, 3, 2, 2,…
## $ total_debit_limit                <int> 11100, 16500, 4300, 19400, 32700, 272…
## $ num_total_cc_accounts            <int> 14, 24, 14, 3, 20, 27, 8, 16, 19, 7, …
## $ num_open_cc_accounts             <int> 8, 14, 8, 3, 15, 12, 7, 12, 14, 5, 8,…
## $ num_cc_carrying_balance          <int> 6, 4, 6, 2, 13, 5, 6, 10, 14, 3, 5, 3…
## $ num_mort_accounts                <int> 1, 0, 0, 0, 0, 3, 2, 7, 2, 0, 2, 3, 3…
## $ account_never_delinq_percent     <dbl> 92.9, 100.0, 93.5, 100.0, 100.0, 78.1…
## $ tax_liens                        <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ public_record_bankrupt           <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ loan_purpose                     <fct> moving, debt_consolidation, other, de…
## $ application_type                 <fct> individual, individual, individual, i…
## $ loan_amount                      <int> 28000, 5000, 2000, 21600, 23000, 5000…
## $ term                             <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 3…
## $ interest_rate                    <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.7…
## $ installment                      <dbl> 652.53, 167.54, 71.40, 664.19, 786.87…
## $ grade                            <fct> C, C, D, A, C, A, C, B, C, A, C, B, C…
## $ sub_grade                        <fct> C3, C1, D1, A3, C3, A3, C2, B5, C2, A…
## $ issue_month                      <fct> Mar-2018, Feb-2018, Feb-2018, Jan-201…
## $ loan_status                      <fct> Current, Current, Current, Current, C…
## $ initial_listing_status           <fct> whole, whole, fractional, whole, whol…
## $ disbursement_method              <fct> Cash, Cash, Cash, Cash, Cash, Cash, C…
## $ balance                          <dbl> 27015.86, 4651.37, 1824.63, 18853.26,…
## $ paid_total                       <dbl> 1999.330, 499.120, 281.800, 3312.890,…
## $ paid_principal                   <dbl> 984.14, 348.63, 175.37, 2746.74, 1569…
## $ paid_interest                    <dbl> 1015.19, 150.49, 106.43, 566.15, 754.…
## $ paid_late_fees                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…


Datasets from a csv File

First, download the loans_full_schema csv file.

loans_full_schema <- read.csv(file='data-sets/loans_full_schema.csv',header=TRUE)
glimpse(loans_full_schema)
## Rows: 10,000
## Columns: 55
## $ emp_title                        <chr> "global config engineer ", "warehouse…
## $ emp_length                       <int> 3, 10, 3, 1, 10, NA, 10, 10, 10, 3, 1…
## $ state                            <chr> "NJ", "HI", "WI", "PA", "CA", "KY", "…
## $ homeownership                    <chr> "MORTGAGE", "RENT", "RENT", "RENT", "…
## $ annual_income                    <dbl> 90000, 40000, 40000, 30000, 35000, 34…
## $ verified_income                  <chr> "Verified", "Not Verified", "Source V…
## $ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.4…
## $ annual_income_joint              <dbl> NA, NA, NA, NA, 57000, NA, 155000, NA…
## $ verification_income_joint        <chr> "", "", "", "", "Verified", "", "Not …
## $ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.66, NA, 13.12, NA,…
## $ delinq_2y                        <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
## $ months_since_last_delinq         <int> 38, NA, 28, NA, NA, 3, NA, 19, 18, NA…
## $ earliest_credit_line             <int> 2001, 1996, 2006, 2007, 2008, 1990, 2…
## $ inquiries_last_12m               <int> 6, 1, 4, 0, 7, 6, 1, 1, 3, 0, 4, 4, 8…
## $ total_credit_lines               <int> 28, 30, 31, 4, 22, 32, 12, 30, 35, 9,…
## $ open_credit_lines                <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
## $ total_credit_limit               <int> 70795, 28800, 24193, 25400, 69839, 42…
## $ total_credit_utilized            <int> 38767, 4321, 16000, 4997, 52722, 3898…
## $ num_collections_last_12m         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_historical_failed_to_pay     <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ months_since_90d_late            <int> 38, NA, 28, NA, NA, 60, NA, 71, 18, N…
## $ current_accounts_delinq          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_collection_amount_ever     <int> 1250, 0, 432, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ current_installment_accounts     <int> 2, 0, 1, 1, 1, 0, 2, 2, 6, 1, 2, 1, 2…
## $ accounts_opened_24m              <int> 5, 11, 13, 1, 6, 2, 1, 4, 10, 5, 6, 7…
## $ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, 7, 4, 17, 3, 4,…
## $ num_satisfactory_accounts        <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
## $ num_accounts_120d_past_due       <int> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
## $ num_accounts_30d_past_due        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_active_debit_accounts        <int> 2, 3, 3, 2, 10, 1, 3, 5, 11, 3, 2, 2,…
## $ total_debit_limit                <int> 11100, 16500, 4300, 19400, 32700, 272…
## $ num_total_cc_accounts            <int> 14, 24, 14, 3, 20, 27, 8, 16, 19, 7, …
## $ num_open_cc_accounts             <int> 8, 14, 8, 3, 15, 12, 7, 12, 14, 5, 8,…
## $ num_cc_carrying_balance          <int> 6, 4, 6, 2, 13, 5, 6, 10, 14, 3, 5, 3…
## $ num_mort_accounts                <int> 1, 0, 0, 0, 0, 3, 2, 7, 2, 0, 2, 3, 3…
## $ account_never_delinq_percent     <dbl> 92.9, 100.0, 93.5, 100.0, 100.0, 78.1…
## $ tax_liens                        <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ public_record_bankrupt           <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ loan_purpose                     <chr> "moving", "debt_consolidation", "othe…
## $ application_type                 <chr> "individual", "individual", "individu…
## $ loan_amount                      <int> 28000, 5000, 2000, 21600, 23000, 5000…
## $ term                             <int> 60, 36, 36, 36, 36, 36, 60, 60, 36, 3…
## $ interest_rate                    <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.7…
## $ installment                      <dbl> 652.53, 167.54, 71.40, 664.19, 786.87…
## $ grade                            <chr> "C", "C", "D", "A", "C", "A", "C", "B…
## $ sub_grade                        <chr> "C3", "C1", "D1", "A3", "C3", "A3", "…
## $ issue_month                      <chr> "Mar-2018", "Feb-2018", "Feb-2018", "…
## $ loan_status                      <chr> "Current", "Current", "Current", "Cur…
## $ initial_listing_status           <chr> "whole", "whole", "fractional", "whol…
## $ disbursement_method              <chr> "Cash", "Cash", "Cash", "Cash", "Cash…
## $ balance                          <dbl> 27015.86, 4651.37, 1824.63, 18853.26,…
## $ paid_total                       <dbl> 1999.330, 499.120, 281.800, 3312.890,…
## $ paid_principal                   <dbl> 984.14, 348.63, 175.37, 2746.74, 1569…
## $ paid_interest                    <dbl> 1015.19, 150.49, 106.43, 566.15, 754.…
## $ paid_late_fees                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…


Subsetting of Data


Accessing Rows and Columns


  • Using brackets to access specific indices of the data frame with the first 3 rows.
loans_full_schema[1:3,1:5]
##                 emp_title emp_length state homeownership annual_income
## 1 global config engineer           3    NJ      MORTGAGE         90000
## 2  warehouse office clerk         10    HI          RENT         40000
## 3                assembly          3    WI          RENT         40000


  • Using brackets to access specific column labels of the data frame with the first 3 rows.
loans_full_schema[1:3,c("loan_amount","loan_purpose","loan_status")]
##   loan_amount       loan_purpose loan_status
## 1       28000             moving     Current
## 2        5000 debt_consolidation     Current
## 3        2000              other     Current


Accessing Entire Columns


  • Using the $ command to access a specific column label of the data.
glimpse(loans_full_schema$annual_income)
##  num [1:10000] 90000 40000 40000 30000 35000 34000 35000 110000 65000 30000 ...


  • Using brackets to access entire columns using specific column labels.
glimpse(loans_full_schema[,c("loan_amount","loan_purpose","loan_status")])
## Rows: 10,000
## Columns: 3
## $ loan_amount  <int> 28000, 5000, 2000, 21600, 23000, 5000, 24000, 20000, 2000…
## $ loan_purpose <chr> "moving", "debt_consolidation", "other", "debt_consolidat…
## $ loan_status  <chr> "Current", "Current", "Current", "Current", "Current", "C…


Basic Descriptive Statistics


  • Using $ command to summarize one column.
summary(loans_full_schema$annual_income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   45000   65000   79222   95000 2300000


  • Using brackets to summarize multiple columns.
summary(loans_full_schema[,c("loan_amount","loan_purpose","loan_status")])
##   loan_amount    loan_purpose       loan_status       
##  Min.   : 1000   Length:10000       Length:10000      
##  1st Qu.: 8000   Class :character   Class :character  
##  Median :14500   Mode  :character   Mode  :character  
##  Mean   :16362                                        
##  3rd Qu.:24000                                        
##  Max.   :40000


  • Computing Mean and Standard Deviation of one column.
mean(loans_full_schema$loan_amount)
## [1] 16361.92
sd(loans_full_schema$loan_amount)
## [1] 10301.96


Basic Data Visualization


Histograms

ggplot(data = loans_full_schema, aes(x = total_credit_limit)) + geom_histogram(bins=60)

You can also use hist. Use ?hist for details and usage.


Scatterplots

ggplot(data = loans_full_schema, aes(x = annual_income, y = total_credit_limit)) + geom_point()

You can also use plot.


Lab Exercises


I. Iris Flowers

  1. Load the iris dataset. Note that this dataset is in the datasets package which is already included in the base R installation.

  2. How many rows and columns does this data set have?

  3. Produce 3 scatterplots and describe them. What pattern(s) do you observe? Does it look linear or nonlinear? Are any points clustered together?

  4. Produce 3 histograms and describe them. What shape(s) does it show? Does it have one peak or multiple peaks?


II. Economic Regression

  1. Load the longley dataset. Note that this dataset is in the datasets package which is already included in the base R installation.

  2. How many rows and columns does this data set have?

  3. Produce 3 scatterplots and describe them. What pattern(s) do you observe? Does it look linear or nonlinear? Are any points clustered together?

  4. Produce 3 histograms and describe them. What shape(s) does it show? Does it have one peak or multiple peaks?


III. United States Counties

  1. Load the county dataset from the csv file.

  2. How many rows and columns does this data set have?

  3. Produce 2 scatterplots and describe them. What pattern(s) do you observe? Does it look linear or nonlinear? Are any points clustered together?

  4. Produce 2 histograms and describe them. What shape(s) does it show? Does it have one peak or multiple peaks?


