Upon completing today’s lab activity, students should be able to do the following using R and RStudio:
Navigate through the RStudio interface and perform basic computations in the R console.
Install and load R packages.
Load data sets from packages and csv files and perform basic data subsetting.
Perform basic plotting of histograms and scatterplots.
R is a programming language for statistical computing and data visualization. It is free and widely used by statisticians and data scientist for data analysis, software development for statistical methods, and statistical application in research.
RStudio is an open-source Integrated Development Environment (IDE) for R. It is a desktop application where it allows using R as easy as possible. Below is a screen shot of what RStudio looks like.
RStudio is composed of three main panels.
Console (left or lower-left) where you can put all of your R commands after the prompt symbol >.
Environment (upper-right) where it contains a history of the commands you have previously entered and all of the variables you declared.
Files, Plots, Packages, Help, Viewer (lower-right) where you can browse files, access help for R functions, install and manage packages, and viewing visualizations.
1+41
## [1] 42
<- 1
x <- 2
y = (x+2)+(y+1)
z print(z)
## [1] 6
<- c(1,2,3,4,5)
vector_1 print(vector_1)
## [1] 1 2 3 4 5
install.packages("tidyverse")
install.packages("openintro")
library(tidyverse)
library(openintro)
In this section, we are using the Loan data from Lending Club. The loans_full_schema
is a data frame already embedded into the openintro
package.
glimpse(loans_full_schema)
## Rows: 10,000
## Columns: 55
## $ emp_title <chr> "global config engineer ", "warehouse…
## $ emp_length <dbl> 3, 10, 3, 1, 10, NA, 10, 10, 10, 3, 1…
## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, I…
## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN…
## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34…
## $ verified_income <fct> Verified, Not Verified, Source Verifi…
## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.4…
## $ annual_income_joint <dbl> NA, NA, NA, NA, 57000, NA, 155000, NA…
## $ verification_income_joint <fct> , , , , Verified, , Not Verified, , ,…
## $ debt_to_income_joint <dbl> NA, NA, NA, NA, 37.66, NA, 13.12, NA,…
## $ delinq_2y <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
## $ months_since_last_delinq <int> 38, NA, 28, NA, NA, 3, NA, 19, 18, NA…
## $ earliest_credit_line <dbl> 2001, 1996, 2006, 2007, 2008, 1990, 2…
## $ inquiries_last_12m <int> 6, 1, 4, 0, 7, 6, 1, 1, 3, 0, 4, 4, 8…
## $ total_credit_lines <int> 28, 30, 31, 4, 22, 32, 12, 30, 35, 9,…
## $ open_credit_lines <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
## $ total_credit_limit <int> 70795, 28800, 24193, 25400, 69839, 42…
## $ total_credit_utilized <int> 38767, 4321, 16000, 4997, 52722, 3898…
## $ num_collections_last_12m <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_historical_failed_to_pay <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ months_since_90d_late <int> 38, NA, 28, NA, NA, 60, NA, 71, 18, N…
## $ current_accounts_delinq <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_collection_amount_ever <int> 1250, 0, 432, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ current_installment_accounts <int> 2, 0, 1, 1, 1, 0, 2, 2, 6, 1, 2, 1, 2…
## $ accounts_opened_24m <int> 5, 11, 13, 1, 6, 2, 1, 4, 10, 5, 6, 7…
## $ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, 7, 4, 17, 3, 4,…
## $ num_satisfactory_accounts <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
## $ num_accounts_120d_past_due <int> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
## $ num_accounts_30d_past_due <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_active_debit_accounts <int> 2, 3, 3, 2, 10, 1, 3, 5, 11, 3, 2, 2,…
## $ total_debit_limit <int> 11100, 16500, 4300, 19400, 32700, 272…
## $ num_total_cc_accounts <int> 14, 24, 14, 3, 20, 27, 8, 16, 19, 7, …
## $ num_open_cc_accounts <int> 8, 14, 8, 3, 15, 12, 7, 12, 14, 5, 8,…
## $ num_cc_carrying_balance <int> 6, 4, 6, 2, 13, 5, 6, 10, 14, 3, 5, 3…
## $ num_mort_accounts <int> 1, 0, 0, 0, 0, 3, 2, 7, 2, 0, 2, 3, 3…
## $ account_never_delinq_percent <dbl> 92.9, 100.0, 93.5, 100.0, 100.0, 78.1…
## $ tax_liens <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ public_record_bankrupt <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ loan_purpose <fct> moving, debt_consolidation, other, de…
## $ application_type <fct> individual, individual, individual, i…
## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000…
## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 3…
## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.7…
## $ installment <dbl> 652.53, 167.54, 71.40, 664.19, 786.87…
## $ grade <fct> C, C, D, A, C, A, C, B, C, A, C, B, C…
## $ sub_grade <fct> C3, C1, D1, A3, C3, A3, C2, B5, C2, A…
## $ issue_month <fct> Mar-2018, Feb-2018, Feb-2018, Jan-201…
## $ loan_status <fct> Current, Current, Current, Current, C…
## $ initial_listing_status <fct> whole, whole, fractional, whole, whol…
## $ disbursement_method <fct> Cash, Cash, Cash, Cash, Cash, Cash, C…
## $ balance <dbl> 27015.86, 4651.37, 1824.63, 18853.26,…
## $ paid_total <dbl> 1999.330, 499.120, 281.800, 3312.890,…
## $ paid_principal <dbl> 984.14, 348.63, 175.37, 2746.74, 1569…
## $ paid_interest <dbl> 1015.19, 150.49, 106.43, 566.15, 754.…
## $ paid_late_fees <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
First, download the loans_full_schema csv file.
<- read.csv(file='data-sets/loans_full_schema.csv',header=TRUE)
loans_full_schema glimpse(loans_full_schema)
## Rows: 10,000
## Columns: 55
## $ emp_title <chr> "global config engineer ", "warehouse…
## $ emp_length <int> 3, 10, 3, 1, 10, NA, 10, 10, 10, 3, 1…
## $ state <chr> "NJ", "HI", "WI", "PA", "CA", "KY", "…
## $ homeownership <chr> "MORTGAGE", "RENT", "RENT", "RENT", "…
## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34…
## $ verified_income <chr> "Verified", "Not Verified", "Source V…
## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.4…
## $ annual_income_joint <dbl> NA, NA, NA, NA, 57000, NA, 155000, NA…
## $ verification_income_joint <chr> "", "", "", "", "Verified", "", "Not …
## $ debt_to_income_joint <dbl> NA, NA, NA, NA, 37.66, NA, 13.12, NA,…
## $ delinq_2y <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
## $ months_since_last_delinq <int> 38, NA, 28, NA, NA, 3, NA, 19, 18, NA…
## $ earliest_credit_line <int> 2001, 1996, 2006, 2007, 2008, 1990, 2…
## $ inquiries_last_12m <int> 6, 1, 4, 0, 7, 6, 1, 1, 3, 0, 4, 4, 8…
## $ total_credit_lines <int> 28, 30, 31, 4, 22, 32, 12, 30, 35, 9,…
## $ open_credit_lines <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
## $ total_credit_limit <int> 70795, 28800, 24193, 25400, 69839, 42…
## $ total_credit_utilized <int> 38767, 4321, 16000, 4997, 52722, 3898…
## $ num_collections_last_12m <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_historical_failed_to_pay <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ months_since_90d_late <int> 38, NA, 28, NA, NA, 60, NA, 71, 18, N…
## $ current_accounts_delinq <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_collection_amount_ever <int> 1250, 0, 432, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ current_installment_accounts <int> 2, 0, 1, 1, 1, 0, 2, 2, 6, 1, 2, 1, 2…
## $ accounts_opened_24m <int> 5, 11, 13, 1, 6, 2, 1, 4, 10, 5, 6, 7…
## $ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, 7, 4, 17, 3, 4,…
## $ num_satisfactory_accounts <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
## $ num_accounts_120d_past_due <int> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
## $ num_accounts_30d_past_due <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_active_debit_accounts <int> 2, 3, 3, 2, 10, 1, 3, 5, 11, 3, 2, 2,…
## $ total_debit_limit <int> 11100, 16500, 4300, 19400, 32700, 272…
## $ num_total_cc_accounts <int> 14, 24, 14, 3, 20, 27, 8, 16, 19, 7, …
## $ num_open_cc_accounts <int> 8, 14, 8, 3, 15, 12, 7, 12, 14, 5, 8,…
## $ num_cc_carrying_balance <int> 6, 4, 6, 2, 13, 5, 6, 10, 14, 3, 5, 3…
## $ num_mort_accounts <int> 1, 0, 0, 0, 0, 3, 2, 7, 2, 0, 2, 3, 3…
## $ account_never_delinq_percent <dbl> 92.9, 100.0, 93.5, 100.0, 100.0, 78.1…
## $ tax_liens <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ public_record_bankrupt <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ loan_purpose <chr> "moving", "debt_consolidation", "othe…
## $ application_type <chr> "individual", "individual", "individu…
## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000…
## $ term <int> 60, 36, 36, 36, 36, 36, 60, 60, 36, 3…
## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.7…
## $ installment <dbl> 652.53, 167.54, 71.40, 664.19, 786.87…
## $ grade <chr> "C", "C", "D", "A", "C", "A", "C", "B…
## $ sub_grade <chr> "C3", "C1", "D1", "A3", "C3", "A3", "…
## $ issue_month <chr> "Mar-2018", "Feb-2018", "Feb-2018", "…
## $ loan_status <chr> "Current", "Current", "Current", "Cur…
## $ initial_listing_status <chr> "whole", "whole", "fractional", "whol…
## $ disbursement_method <chr> "Cash", "Cash", "Cash", "Cash", "Cash…
## $ balance <dbl> 27015.86, 4651.37, 1824.63, 18853.26,…
## $ paid_total <dbl> 1999.330, 499.120, 281.800, 3312.890,…
## $ paid_principal <dbl> 984.14, 348.63, 175.37, 2746.74, 1569…
## $ paid_interest <dbl> 1015.19, 150.49, 106.43, 566.15, 754.…
## $ paid_late_fees <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
1:3,1:5] loans_full_schema[
## emp_title emp_length state homeownership annual_income
## 1 global config engineer 3 NJ MORTGAGE 90000
## 2 warehouse office clerk 10 HI RENT 40000
## 3 assembly 3 WI RENT 40000
1:3,c("loan_amount","loan_purpose","loan_status")] loans_full_schema[
## loan_amount loan_purpose loan_status
## 1 28000 moving Current
## 2 5000 debt_consolidation Current
## 3 2000 other Current
$
command to access a specific column label of the data.glimpse(loans_full_schema$annual_income)
## num [1:10000] 90000 40000 40000 30000 35000 34000 35000 110000 65000 30000 ...
glimpse(loans_full_schema[,c("loan_amount","loan_purpose","loan_status")])
## Rows: 10,000
## Columns: 3
## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, 24000, 20000, 2000…
## $ loan_purpose <chr> "moving", "debt_consolidation", "other", "debt_consolidat…
## $ loan_status <chr> "Current", "Current", "Current", "Current", "Current", "C…
$
command to summarize one column.summary(loans_full_schema$annual_income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 45000 65000 79222 95000 2300000
summary(loans_full_schema[,c("loan_amount","loan_purpose","loan_status")])
## loan_amount loan_purpose loan_status
## Min. : 1000 Length:10000 Length:10000
## 1st Qu.: 8000 Class :character Class :character
## Median :14500 Mode :character Mode :character
## Mean :16362
## 3rd Qu.:24000
## Max. :40000
mean(loans_full_schema$loan_amount)
## [1] 16361.92
sd(loans_full_schema$loan_amount)
## [1] 10301.96
ggplot(data = loans_full_schema, aes(x = total_credit_limit)) + geom_histogram(bins=60)
You can also use hist
. Use ?hist
for details and usage.
ggplot(data = loans_full_schema, aes(x = annual_income, y = total_credit_limit)) + geom_point()
You can also use plot
.
Load the iris
dataset. Note that this dataset is in the datasets
package which is already included in the base R installation.
How many rows and columns does this data set have?
Produce 3 scatterplots and describe them. What pattern(s) do you observe? Does it look linear or nonlinear? Are any points clustered together?
Produce 3 histograms and describe them. What shape(s) does it show? Does it have one peak or multiple peaks?
Load the longley
dataset. Note that this dataset is in the datasets
package which is already included in the base R installation.
How many rows and columns does this data set have?
Produce 3 scatterplots and describe them. What pattern(s) do you observe? Does it look linear or nonlinear? Are any points clustered together?
Produce 3 histograms and describe them. What shape(s) does it show? Does it have one peak or multiple peaks?
Load the county dataset from the csv file.
How many rows and columns does this data set have?
Produce 2 scatterplots and describe them. What pattern(s) do you observe? Does it look linear or nonlinear? Are any points clustered together?
Produce 2 histograms and describe them. What shape(s) does it show? Does it have one peak or multiple peaks?