Learning Objectives


Upon completing today’s lab activity, students should be able to do the following using R and RStudio:

  1. Produce basic R Markdown html lab report documents and use inline R codes within R Markdown.

  2. Produce basic R scripts and run them using the console or within R Markdown.

  3. Produce intermediate plots within R Markdown.

  4. Create custom data frames, create and call functions, use conditional statements, and use for-loops and while-loops.


Basic R Scripting


In RStudio, go to New File -> R Script. Copy and paste the code below into your R script and save it as “example-r-script.R”.

## This is an example R script

# print the text "Hello R!"
print("Hello R!")

# declaring variables, perform math operations, and print the results
x <- 2
y <- 2
z <- x^2 + y^2 # the operator "^" indicates exponents

# print strings and variables while concatenating them together
cat("x^2 + y^2 =", z, "where x =", x, "and y =", y)

You can also use paste to concatenate strings and print them using print.

The lines with # characters indicate that you are writing a comment and the machine will ignore those lines.

Now, there are three ways to run the script.

  1. You can do line by line runs by clicking a specific line in the script and click the “Run” button located in the upper right corner of the R studio panel.

  2. You can run the entire script by clicking the “Source” button located in the upper right corner of the R studio panel.

  3. You can type the source command into the console.

source("1-files/example-r-script.R")
## [1] "Hello R!"
## x^2 + y^2 = 8 where x = 2 and y = 2

Note that you must put the path of the R script correctly. In this case, the R script “example-r-script.R” is in the “1-files” directory or folder.


R Markdown


To make reproducible reports, R Markdown provides quick and easy typesetting method. R Markdown can embed executable R code snippets and can run codes that produces plots. You can write your report along with R code snippets with the knitr syntax. You can then convert your document into several common formats such as pdf or html. Visit rmarkdown.rstudio.com for more details.


For HTML File

In RStudio, go to New File -> R Markdown. Then, choose “From Template” and then choose “Lab Report” from the list of templates. This template will only exist if you have installed the openintro package.


For PDF File

In RStudio, go to New File -> R Markdown. Then, choose “Document” and then choose “PDF” from the list of output format. This template will work if you have latexpdf package installed.

Please put the code snippet - shown below - at the beginning of the Rmd assignment template file if it does not exist.

---
title: '**Lab 1 - MATH 141**'
header-includes: |
  \usepackage{fancyhdr}
  \pagestyle{fancy}
  \fancyhead[CO,C]{Homework 1 - MATH 141}
  \fancyfoot[CO,C]{}
  \fancyfoot[C]{\thepage}
  \usepackage{float}
output:
  bookdown::pdf_document2:
    fig_caption: yes
    toc: no
    number_section: no
urlcolor: red
---


Data Frames


A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

Characteristics of a data frame:

  • The column names should be non-empty.

  • The row names should be unique.

  • The data stored in a data frame can be of numeric, factor or character type.

  • Each column should contain same number of data items.


Creating Data Frames

library(tidyverse)
# Create the data frame.
data_example <- data.frame(
   my_ranking = c(1:5),
   superheroes = c("Spiderman","Shang-Chi","Scarlet Witch","Doctor Strange","Black Panther"),
   subjective_power_scale = c(10000,1000,1000,900,900)
)
# Print the data frame.         
glimpse(data_example)
## Rows: 5
## Columns: 3
## $ my_ranking             <int> 1, 2, 3, 4, 5
## $ superheroes            <chr> "Spiderman", "Shang-Chi", "Scarlet Witch", "Doc…
## $ subjective_power_scale <dbl> 10000, 1000, 1000, 900, 900

The keyword for creating a dataframe in R is data.frame. Each parameter corresponds to the columns while the items in the lists corresponds to the rows. Notice that the name of the variable for each list is the default name for the columns. The rows are automatically have integer indices. This type of data structure is the same as you have seen in the previous lab.


Adding Columns

# Add the "subjective_pair" column.
data_example$subjective_pair <- c("Black Panther","Doctor Strange","Shang-Chi","Doctor Strange","Scarlet Witch")
# Add the "enemy" column.
data_example$enemy <- c("Green Goblin","Wenwu","Agatha Harkness","Dormammu","Erik Killmonger")
v <- data_example
glimpse(v)
## Rows: 5
## Columns: 5
## $ my_ranking             <int> 1, 2, 3, 4, 5
## $ superheroes            <chr> "Spiderman", "Shang-Chi", "Scarlet Witch", "Doc…
## $ subjective_power_scale <dbl> 10000, 1000, 1000, 900, 900
## $ subjective_pair        <chr> "Black Panther", "Doctor Strange", "Shang-Chi",…
## $ enemy                  <chr> "Green Goblin", "Wenwu", "Agatha Harkness", "Do…


Adding Rows

# Create the second data frame
data_new <- data.frame(
   my_ranking = c(6:8), 
   superheroes = c("Naruto","Saitama","Black Widow"),
   subjective_power_scale = c(700,700,800), 
   subjective_pair = c("Spiderman","Doctor Strange","Shang-Chi"),
   enemy = c("Sasuke","Unknown","General Dreykov")
)

# Bind the two data frames.
data_example_final <- rbind(data_example,data_new)
glimpse(data_example_final)
## Rows: 8
## Columns: 5
## $ my_ranking             <int> 1, 2, 3, 4, 5, 6, 7, 8
## $ superheroes            <chr> "Spiderman", "Shang-Chi", "Scarlet Witch", "Doc…
## $ subjective_power_scale <dbl> 10000, 1000, 1000, 900, 900, 700, 700, 800
## $ subjective_pair        <chr> "Black Panther", "Doctor Strange", "Shang-Chi",…
## $ enemy                  <chr> "Green Goblin", "Wenwu", "Agatha Harkness", "Do…

The keyword for adding rows to an existing data frame in R is rbind where the inputs must be data frames with the same columns.


Conditional Statements


Conditional statements in programming are define one or more conditions to be evaluated or tested by the program, as well as a statement or statements to be performed if the condition is true, and optionally, other statements to be executed if the condition is false.


if Statement

x <- 5
if(x > 0){
print("positive number")
}
## [1] "positive number"


if-else statement

x <- -5
if (x > 0) {
  print("positive number")
} else {
  print("negative number")
}
## [1] "negative number"


else-if statement

x <- 0
if (x > 0) {
  print("positive number")
} else if (x == 0) {
  print("zero")
} else {
  print("negative number")
}
## [1] "zero"


Conditional Subsetting

The code snippet below are generated using the iris data set, which is part of the base R installation.


  • Option A
# subsetting by category
setosa_sub <- iris[iris$Species == "setosa", ]
glimpse(setosa_sub)
## Rows: 50
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
# subsetting numerical values
sepal_length_sub <- iris[iris$Sepal.Length >= 4, ]
glimpse(sepal_length_sub)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
# subsetting numerical values within a range
sepal_length_sub_range <- iris[iris$Sepal.Length >= 5.10 & iris$Sepal.Length <= 6.40, ]
glimpse(sepal_length_sub_range)
## Rows: 83
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 5.4, 5.4, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 5.…
## $ Sepal.Width  <dbl> 3.5, 3.9, 3.7, 4.0, 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.7, 1.5, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5, 1.7, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.4, 0.2, 0.2, 0.4, 0.4, 0.3, 0.3, 0.3, 0.2, 0.4, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…


  • Option B
# subsetting by category
setosa_sub<- subset(iris, Species == "setosa")
glimpse(setosa_sub)
## Rows: 50
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
# subsetting numerical values
sepal_length_sub <- subset(iris, Sepal.Length >= 4)
glimpse(sepal_length_sub)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
# subsetting numerical values within a range
sepal_length_sub_range <- subset(iris, Sepal.Length >= 5.10 & Sepal.Length <= 6.40)
glimpse(sepal_length_sub_range)
## Rows: 83
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 5.4, 5.4, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 5.…
## $ Sepal.Width  <dbl> 3.5, 3.9, 3.7, 4.0, 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.7, 1.5, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5, 1.7, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.4, 0.2, 0.2, 0.4, 0.4, 0.3, 0.3, 0.3, 0.2, 0.4, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…


Loops


You may find yourself in a position where you need to run a block of code multiple times. Statements are typically executed in order. A function’s first statement is executed first, then the second, and so on.


for-loops

A for-loop iterates on a list until it reaches the last element.


  • Iterating through a vector of integers
for(i in 1:10) { # Head of for-loop
 
  x1 <- i^2      # Code block where each interger is squared
  print(x1)      # Print results
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100


  • Iterating through a vector of strings
letters_vector = c('A','B','C','D','E')
for(i in letters_vector) { # loop over character vector
  cat("My answer is", i, "for sure. \n")  # concatenate strings while printing in every new line
}
## My answer is A for sure. 
## My answer is B for sure. 
## My answer is C for sure. 
## My answer is D for sure. 
## My answer is E for sure.

You can also use paste to concatenate strings and print them using print.


  • Storing iterated results in to a vector by appending.
x <- numeric() # Create empty data numeric object

for(i in 1:10) {  # Head of for-loop
  x <- c(x, i^2)  # Code block where each interger is squared
}
print(x)
##  [1]   1   4   9  16  25  36  49  64  81 100
  • Iterating through a vector but must stop if a condition is met.
x <- numeric() # Create empty data numeric object

for(i in 1:100) {  # Head of for-loop
  x <- c(x, i^2)  # Code block where each interger is squared
  
  # conditional statement: if i^2 > 2000, the loop will stop
  if (i^2 > 2000) {
    break
  }
}
print(x)
##  [1]    1    4    9   16   25   36   49   64   81  100  121  144  169  196  225
## [16]  256  289  324  361  400  441  484  529  576  625  676  729  784  841  900
## [31]  961 1024 1089 1156 1225 1296 1369 1444 1521 1600 1681 1764 1849 1936 2025


while-loops

A while-loop is method of iterating while a condition is true. While a given condition is true, it repeats a statement or a series of statements. Before performing the loop body, it checks the condition.


  • Iterating until a condition is met or not met.
i <- 1
while (i < 6) { # iterate while i is less than 6
print(i)
i = i+1 # add 1 in each iteration. if you miss this part you will end up in an infinit loop
print(i)
}
## [1] 1
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 4
## [1] 4
## [1] 5
## [1] 5
## [1] 6

The for-loop and while-loop can be similar but the while-loop needs a condition in order for it to work.


Creating and Calling Functions


A function is a collection of statements that work together to accomplish a specified goal. R comes with a vast variety of built-in functions, and users can also construct their own.

A function in R is an object, which allows the R interpreter to send control to the function as well as any parameters that may be required for the function to complete the operations.

The function then completes its duty and returns control as well as any results that may have been saved in other objects to the interpreter.

R function syntax:

function_1 <- function(...) {
   # put operations here
}


Function with one variable.

# Create a function to print squares of numbers in sequence.
function_1 <- function(a) {
   for(i in 1:a) {
      b <- i^2
      print(b) # only prints the output
   }
}

# Call the function new.function supplying 6 as an argument.
function_1(6)
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36


Function with two variables.

# Create a function to print sum of squares of two numbers.
function_2 <- function(a,b) {
   c <- a^2 + b^2
   print(c) # only prints the output
}

# Call the function supplying 6 as an argument.
function_2(6,6)
## [1] 72


Function using Return

# Create a function to print sum of squares of two numbers.
function_2 <- function(a,b) {
   c <- a^2 + b^2
   return(c) # returns value c
}

# Call the function supplying 6 as an argument.
function_2(6,6)
## [1] 72


Intermediate Figures


Scatter Plots with Groups

The figures below are generated using the iris data set, which is part of the base R installation.


  • Using Base R.
plot(iris$Sepal.Length, iris$Sepal.Width, # x and y data
     pch=21, # dot design
     bg=c("red","green3","blue")[unclass(iris$Species)], # color for each group
     main="Edgar Anderson's Iris Data",  # plot title
     xlab = "Sepal Length", # x label
     ylab = "Sepal Width") # y label
legend('topright',levels(iris$Species),col=c("red","green3","blue"),pch=c(21,21,21))


  • Using ggplot2.
library(ggplot2)
scatter <- ggplot(data=iris, aes(x = Sepal.Length, y = Sepal.Width)) # using the iris data
scatter + geom_point(aes(color=Species, shape=Species), size = 1.5) + # scatter plot
  xlab("Sepal Length") +  ylab("Sepal Width") + # x and y label
  ggtitle("Edgar Anderson's Iris Data")


Histograms with Groups

The figures below are generated using the iris data set, which is part of the base R installation.


  • Using ggplot2.
library(ggplot2)
histogram <- ggplot(data=iris, aes(x=Sepal.Length))
histogram + geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) + 
  xlab("Sepal Length") +  ylab("Frequency") + ggtitle("Histogram of Sepal Length")


Boxplots with Groups

The figures below are generated using the iris data set, which is part of the base R installation.


  • Using ggplot2.
library(ggplot2)
boxplot <- ggplot(data=iris, aes(x=Sepal.Length))
boxplot + geom_boxplot(color="black", aes(fill=Species)) + 
  xlab("Sepal Length") + ggtitle("Boxplot of Sepal Length") + 
  theme(axis.ticks.y = element_blank(), # it removes the y axis ticks and texts
        axis.text.y = element_blank())


Line Plots with Groups


The plots below are generated using a synthentic data.

  • Using ggplot2.
# create synthetic data
groups <- c("A","B")
x_vals_seq <- seq(0, 1, length.out = 9) # generate a sequence of equally spaced numbers
x_values <- rep(x_vals_seq,2) # replicate vector twice
y_values <- c(1, 2, 2, 4, 5, 4, 4, 3, 1, 2, 4, 4, 8, 10, 8, 8, 6, 2)
df2 <- data.frame(letter=rep(groups, each=9), # replicate A and B 9 times each
                  x=rep(x_values,2), # independent variable
                  y=y_values) # dependent variable
glimpse(df2)
## Rows: 36
## Columns: 3
## $ letter <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B"…
## $ x      <dbl> 0.000, 0.125, 0.250, 0.375, 0.500, 0.625, 0.750, 0.875, 1.000, …
## $ y      <dbl> 1, 2, 2, 4, 5, 4, 4, 3, 1, 2, 4, 4, 8, 10, 8, 8, 6, 2, 1, 2, 2,…
library(ggplot2)
p<-ggplot(df2, aes(x=x, y=y, group=letter)) +
  geom_line(aes(color=letter))+
  geom_point(aes(color=letter))
p


Lab Exercises


I. Linear Function

Consider the linear function written below with parameters \(a\) and \(b\).

\[y(x;a,b) = ax + b\] where \(x\) is the independent variable and \(y\) is the dependent variable. Here, the slope of the line is \(a\) and intercept is \(b\).

  1. Write an R function which outputs the dependent variable \(y\) and takes in the independent variable \(x\), and the parameters \(a\) and \(b\).

  2. Using 3 different values of \(a\) and \(b\) (A:\(a=1,b=1\), B:\(a=1.25,b=1.2\), and C:\(a=2,b=1.5\)), create a dataframe with columns “line”, “x”, and “y”. Use the seq command to create a vector of \(x\) values from 0 to 1 with length 10. Use the rep command to generate \(x\) values for the three groups. Use your R function to generate \(y\) values using your \(x\) values as inputs. The “line” column should contain the line groups A, B, and C. Use glimpse to show your dataframe.

  3. Use ggplot to plot the lines on the same figure with proper line group labels.


II. United States Counties

Consider the county data set, which can be found in the usdata R package. Also, this dataset can be accessed as a csv file, county.

  1. Use ggplot to plot histograms of “pop_change” variable with categories from the “metro” variable. Make sure to label the x and y axis properly. What can you tell about the change in population whether the county has access to a metro or not?

  2. Use ggplot to plot histograms of “per_capita_income” variable with categories from the “median_edu” variable. Make sure to label the x and y axis properly. Describe the distributions between the levels of the “median_edu” variable. Is there a strong association between “median_edu” and “per_capita_income”? Explain why.

  3. Use ggplot to plot a scatterplot of “unemployment_rate” versus “poverty _rate” variables with categories from the “median_edu”. Make sure to label the x and y axis properly. Is there an association between “unemployment_rate” and “poverty _rate”? Are there any differences across the “median_edu” levels?

  4. Create a subset of the data where we only take rows with the states Washington, Oregon, and California.

  5. Using the subset you just created, create boxplots of “pop_change” variable with categories from the “state” variable. Make sure to label the x and y axis properly. Based on the medians shown on the boxplots, which state has the lowest and highest population change?

  6. Using the subset you just created, create boxplots of “per_capita_income” variable with categories from the “median_edu” variable. Make sure to label the x and y axis properly. Based on the medians shown on the boxplots, which state has the lowest and highest per capita income?


---
title: "1 - R scripting and R Markdown"
author: "Alex John Quijano"
date: "09/07/2021"
output: openintro::lab_report
---

## **Learning Objectives**

<br>

Upon completing today's lab activity, students should be able to do the following using R and RStudio:

  1. Produce basic R Markdown html lab report documents and use inline R codes within R Markdown.
  
  2. Produce basic R scripts and run them using the console or within R Markdown.
  
  3. Produce intermediate plots within R Markdown.
  
  4. Create custom data frames, create and call functions, use conditional statements, and use for-loops and while-loops.
  
<br>

## **Basic R Scripting**

<br>

In RStudio, go to New File -> R Script. Copy and paste the code below into your R script and save it as "example-r-script.R".

```
## This is an example R script

# print the text "Hello R!"
print("Hello R!")

# declaring variables, perform math operations, and print the results
x <- 2
y <- 2
z <- x^2 + y^2 # the operator "^" indicates exponents

# print strings and variables while concatenating them together
cat("x^2 + y^2 =", z, "where x =", x, "and y =", y)
```

You can also use `paste` to concatenate strings and print them using `print`.

The lines with `#` characters indicate that you are writing a comment and the machine will ignore those lines.

Now, there are three ways to run the script.

  1. You can do line by line runs by clicking a specific line in the script and click the "Run" button located in the upper right corner of the R studio panel.
  
  2. You can run the entire script by clicking the "Source" button located in the upper right corner of the R studio panel.
  
  3. You can type the `source` command into the console.
  
```{r sourcing-script, message=FALSE}
source("1-files/example-r-script.R")
```

Note that you must put the path of the R script correctly. In this case, the R script "example-r-script.R" is in the "1-files" directory or folder.

<br>

## **R Markdown**

<br>

To make reproducible reports, R Markdown provides quick and easy typesetting method. R Markdown can embed executable R code snippets and can run codes that produces plots. You can write your report along with R code snippets with the knitr syntax. You can then convert your document into several common formats such as pdf or html. Visit [rmarkdown.rstudio.com](https://rmarkdown.rstudio.com){target="_blank"} for more details.

<br>

### For HTML File

In RStudio, go to New File -> R Markdown. Then, choose “From Template” and then choose "Lab Report" from the list of templates. This template will only exist if you have installed the `openintro` package.

<br>

### For PDF File

In RStudio, go to New File -> R Markdown. Then, choose “Document” and then choose "PDF" from the list of output format. This template will work if you have `latexpdf` package installed.

Please put the code snippet - shown below - at the beginning of the Rmd assignment template file if it does not exist.

```
---
title: '**Lab 1 - MATH 141**'
header-includes: |
  \usepackage{fancyhdr}
  \pagestyle{fancy}
  \fancyhead[CO,C]{Homework 1 - MATH 141}
  \fancyfoot[CO,C]{}
  \fancyfoot[C]{\thepage}
  \usepackage{float}
output:
  bookdown::pdf_document2:
    fig_caption: yes
    toc: no
    number_section: no
urlcolor: red
---
```

<br>

## **Data Frames**

<br>

A **data frame** is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

Characteristics of a data frame:

  * The column names should be non-empty.
  
  * The row names should be unique.
  
  * The data stored in a data frame can be of numeric, factor or character type.

  * Each column should contain same number of data items.
  
<br>

### Creating Data Frames

```{r creating-data-frames, message=FALSE}
library(tidyverse)
# Create the data frame.
data_example <- data.frame(
   my_ranking = c(1:5),
   superheroes = c("Spiderman","Shang-Chi","Scarlet Witch","Doctor Strange","Black Panther"),
   subjective_power_scale = c(10000,1000,1000,900,900)
)
# Print the data frame.			
glimpse(data_example)
```

The keyword for creating a dataframe in R is `data.frame`. Each parameter corresponds to the columns while the items in the lists corresponds to the rows. Notice that the name of the variable for each list is the default name for the columns. The rows are automatically have integer indices. This type of data structure is the same as you have seen in the previous lab.

<br>

### Adding Columns

```{r adding-column-data-frame, message=FALSE}
# Add the "subjective_pair" column.
data_example$subjective_pair <- c("Black Panther","Doctor Strange","Shang-Chi","Doctor Strange","Scarlet Witch")
# Add the "enemy" column.
data_example$enemy <- c("Green Goblin","Wenwu","Agatha Harkness","Dormammu","Erik Killmonger")
v <- data_example
glimpse(v)
```

<br>

### Adding Rows

```{r adding-row-data-fram, message=FALSE}
# Create the second data frame
data_new <- data.frame(
   my_ranking = c(6:8), 
   superheroes = c("Naruto","Saitama","Black Widow"),
   subjective_power_scale = c(700,700,800), 
   subjective_pair = c("Spiderman","Doctor Strange","Shang-Chi"),
   enemy = c("Sasuke","Unknown","General Dreykov")
)

# Bind the two data frames.
data_example_final <- rbind(data_example,data_new)
glimpse(data_example_final)
```

The keyword for adding rows to an existing data frame in R is `rbind` where the inputs must be data frames with the same columns.

<br>

## **Conditional Statements**

<br>

Conditional statements in programming are define one or more conditions to be evaluated or tested by the program, as well as a statement or statements to be performed if the condition is true, and optionally, other statements to be executed if the condition is false.

<br>

### if Statement

```{r if-statement, message=FALSE}
x <- 5
if(x > 0){
print("positive number")
}
```

<br>

### if-else statement

```{r if-else-statement, message=FALSE}
x <- -5
if (x > 0) {
  print("positive number")
} else {
  print("negative number")
}
```

<br>

### else-if statement

```{r else-if-statement, message=FALSE}
x <- 0
if (x > 0) {
  print("positive number")
} else if (x == 0) {
  print("zero")
} else {
  print("negative number")
}
```

<br>

### Conditional Subsetting

The code snippet below are generated using the `iris` data set, which is part of the base R installation.

<br>

  * Option A

```{r subsetting-data-frame-using-conditions-1, message=FALSE}
# subsetting by category
setosa_sub <- iris[iris$Species == "setosa", ]
glimpse(setosa_sub)

# subsetting numerical values
sepal_length_sub <- iris[iris$Sepal.Length >= 4, ]
glimpse(sepal_length_sub)

# subsetting numerical values within a range
sepal_length_sub_range <- iris[iris$Sepal.Length >= 5.10 & iris$Sepal.Length <= 6.40, ]
glimpse(sepal_length_sub_range)
```

<br>

  * Option B
  
```{r subsetting-data-frame-using-conditions-2, message=FALSE}
# subsetting by category
setosa_sub<- subset(iris, Species == "setosa")
glimpse(setosa_sub)

# subsetting numerical values
sepal_length_sub <- subset(iris, Sepal.Length >= 4)
glimpse(sepal_length_sub)

# subsetting numerical values within a range
sepal_length_sub_range <- subset(iris, Sepal.Length >= 5.10 & Sepal.Length <= 6.40)
glimpse(sepal_length_sub_range)
```

<br>

## **Loops**

<br>

You may find yourself in a position where you need to run a block of code multiple times. Statements are typically executed in order. A function's first statement is executed first, then the second, and so on.

<br>

### for-loops

A **for-loop** iterates on a list until it reaches the last element.

<br>

  * Iterating through a vector of integers

```{r for-loop-1, message=FALSE}
for(i in 1:10) { # Head of for-loop
 
  x1 <- i^2      # Code block where each interger is squared
  print(x1)      # Print results
}
```

<br>

  * Iterating through a vector of strings
  
```{r for-loop-2, message=FALSE}
letters_vector = c('A','B','C','D','E')
for(i in letters_vector) { # loop over character vector
  cat("My answer is", i, "for sure. \n")  # concatenate strings while printing in every new line
}
```

You can also use `paste` to concatenate strings and print them using `print`.

<br>

  * Storing iterated results in to a vector by appending.
  
```{r for-loop-3, message=FALSE}
x <- numeric() # Create empty data numeric object

for(i in 1:10) {  # Head of for-loop
  x <- c(x, i^2)  # Code block where each interger is squared
}
print(x)
```

  * Iterating through a vector but must stop if a condition is met.
  
```{r for-loop-4, message=FALSE}
x <- numeric() # Create empty data numeric object

for(i in 1:100) {  # Head of for-loop
  x <- c(x, i^2)  # Code block where each interger is squared
  
  # conditional statement: if i^2 > 2000, the loop will stop
  if (i^2 > 2000) {
    break
  }
}
print(x)
```

<br>

### while-loops

A **while-loop** is method of iterating while a condition is true. While a given condition is true, it repeats a statement or a series of statements. Before performing the loop body, it checks the condition.

<br>

  * Iterating until a condition is met or not met.

```{r while-loop-1, message=FALSE}
i <- 1
while (i < 6) { # iterate while i is less than 6
print(i)
i = i+1 # add 1 in each iteration. if you miss this part you will end up in an infinit loop
print(i)
}
```

The for-loop and while-loop can be similar but the while-loop needs a condition in order for it to work.

<br>

## **Creating and Calling Functions**

<br>

A function is a collection of statements that work together to accomplish a specified goal. R comes with a vast variety of built-in functions, and users can also construct their own.

A function in R is an object, which allows the R interpreter to send control to the function as well as any parameters that may be required for the function to complete the operations.

The function then completes its duty and returns control as well as any results that may have been saved in other objects to the interpreter.

*R function syntax:*

```
function_1 <- function(...) {
   # put operations here
}
```

<br>

### Function with one variable.

```{r functions-1, message=FALSE}
# Create a function to print squares of numbers in sequence.
function_1 <- function(a) {
   for(i in 1:a) {
      b <- i^2
      print(b) # only prints the output
   }
}

# Call the function new.function supplying 6 as an argument.
function_1(6)
```

<br>

### Function with two variables.
  
```{r functions-2, message=FALSE}
# Create a function to print sum of squares of two numbers.
function_2 <- function(a,b) {
   c <- a^2 + b^2
   print(c) # only prints the output
}

# Call the function supplying 6 as an argument.
function_2(6,6)
```
  
<br>

### Function using Return

```{r functions-3, message=FALSE}
# Create a function to print sum of squares of two numbers.
function_2 <- function(a,b) {
   c <- a^2 + b^2
   return(c) # returns value c
}

# Call the function supplying 6 as an argument.
function_2(6,6)
```
  
<br>

## **Intermediate Figures**

<br>

### Scatter Plots with Groups

The figures below are generated using the `iris` data set, which is part of the base R installation.

<br>

  * Using Base R.

```{r figures-int-1, message = FALSE, fig.width = 5, fig.pos="H"}
plot(iris$Sepal.Length, iris$Sepal.Width, # x and y data
     pch=21, # dot design
     bg=c("red","green3","blue")[unclass(iris$Species)], # color for each group
     main="Edgar Anderson's Iris Data",  # plot title
     xlab = "Sepal Length", # x label
     ylab = "Sepal Width") # y label
legend('topright',levels(iris$Species),col=c("red","green3","blue"),pch=c(21,21,21))
```

<br>

  * Using `ggplot2`.
  
```{r figures-int-2, message = FALSE, fig.width = 6, fig.pos="H"}
library(ggplot2)
scatter <- ggplot(data=iris, aes(x = Sepal.Length, y = Sepal.Width)) # using the iris data
scatter + geom_point(aes(color=Species, shape=Species), size = 1.5) + # scatter plot
  xlab("Sepal Length") +  ylab("Sepal Width") + # x and y label
  ggtitle("Edgar Anderson's Iris Data")
```

<br>

### Histograms with Groups

The figures below are generated using the `iris` data set, which is part of the base R installation.

<br>

  * Using `ggplot2`.

```{r figures-int-4, message = FALSE, fig.width = 6, fig.pos="H"}
library(ggplot2)
histogram <- ggplot(data=iris, aes(x=Sepal.Length))
histogram + geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) + 
  xlab("Sepal Length") +  ylab("Frequency") + ggtitle("Histogram of Sepal Length")
```  

<br>

### Boxplots with Groups

The figures below are generated using the `iris` data set, which is part of the base R installation.

<br>

  * Using `ggplot2`.

```{r figures-int-5, message = FALSE, fig.width = 6, fig.pos="H"}
library(ggplot2)
boxplot <- ggplot(data=iris, aes(x=Sepal.Length))
boxplot + geom_boxplot(color="black", aes(fill=Species)) + 
  xlab("Sepal Length") + ggtitle("Boxplot of Sepal Length") + 
  theme(axis.ticks.y = element_blank(), # it removes the y axis ticks and texts
        axis.text.y = element_blank())
```  

<br>

### Line Plots with Groups

<br>

The plots below are generated using a synthentic data.

  * Using `ggplot2`.

```{r synthetic-data, message = FALSE}
# create synthetic data
groups <- c("A","B")
x_vals_seq <- seq(0, 1, length.out = 9) # generate a sequence of equally spaced numbers
x_values <- rep(x_vals_seq,2) # replicate vector twice
y_values <- c(1, 2, 2, 4, 5, 4, 4, 3, 1, 2, 4, 4, 8, 10, 8, 8, 6, 2)
df2 <- data.frame(letter=rep(groups, each=9), # replicate A and B 9 times each
                  x=rep(x_values,2), # independent variable
                  y=y_values) # dependent variable
glimpse(df2)
```

```{r figures-int-6, message = FALSE, fig.width = 6, fig.pos="H"}
library(ggplot2)
p<-ggplot(df2, aes(x=x, y=y, group=letter)) +
  geom_line(aes(color=letter))+
  geom_point(aes(color=letter))
p
```

<br>

## **Lab Exercises**

<br>

### I. Linear Function

Consider the linear function written below with parameters $a$ and $b$.
  
  $$y(x;a,b) = ax + b$$
  where $x$ is the independent variable and $y$ is the dependent variable. Here, the slope of the line is $a$ and intercept is $b$. 
  
  1. Write an R function which outputs the dependent variable $y$ and takes in the independent variable $x$, and the parameters $a$ and $b$.
  
  2. Using 3 different values of $a$ and $b$ (A:$a=1,b=1$, B:$a=1.25,b=1.2$, and C:$a=2,b=1.5$), create a dataframe with columns "line", "x", and "y". Use the `seq` command to create a vector of $x$ values from 0 to 1 with length 10. Use the `rep` command to generate $x$ values for the three groups. Use your R function to generate $y$ values using your $x$ values as inputs. The "line" column should contain the line groups A, B, and C. Use `glimpse` to show your dataframe.
  
  3. Use `ggplot` to plot the lines on the same figure with proper line group labels.

<br>

### II. United States Counties

Consider the `county` data set, which can be found in the `usdata` R package. Also, this dataset can be accessed as a csv file, [county](data-sets/county.csv){target="_blank"}.

  1. Use `ggplot` to plot histograms of "pop\_change" variable with categories from the "metro" variable. Make sure to label the x and y axis properly. What can you tell about the change in population whether the county has access to a metro or not?
  
  2. Use `ggplot` to plot histograms of "per\_capita\_income" variable with categories from the "median\_edu" variable. Make sure to label the x and y axis properly. Describe the distributions between the levels of the "median\_edu" variable. Is there a strong association between "median\_edu" and "per\_capita\_income"? Explain why.
  
  3. Use `ggplot` to plot a scatterplot of "unemployment\_rate" versus "poverty
  \_rate" variables with categories from the "median\_edu". Make sure to label the x and y axis properly. Is there an association between "unemployment\_rate" and "poverty
  \_rate"? Are there any differences across the "median\_edu" levels?
  
  4. Create a subset of the data where we only take rows with the states Washington, Oregon, and California.
  
  5. Using the subset you just created, create boxplots of "pop\_change" variable with categories from the "state" variable. Make sure to label the x and y axis properly. Based on the medians shown on the boxplots, which state has the lowest and highest population change?
  
  6. Using the subset you just created, create boxplots of "per\_capita\_income" variable with categories from the "median\_edu" variable. Make sure to label the x and y axis properly. Based on the medians shown on the boxplots, which state has the lowest and highest per capita income?

<br>
