Upon completing today’s lab activity, students should be able to do the following using R and RStudio:
Produce basic R Markdown html lab report documents and use inline R codes within R Markdown.
Produce basic R scripts and run them using the console or within R Markdown.
Produce intermediate plots within R Markdown.
Create custom data frames, create and call functions, use conditional statements, and use for-loops and while-loops.
In RStudio, go to New File -> R Script. Copy and paste the code below into your R script and save it as “example-r-script.R”.
## This is an example R script
# print the text "Hello R!"
print("Hello R!")
# declaring variables, perform math operations, and print the results
x <- 2
y <- 2
z <- x^2 + y^2 # the operator "^" indicates exponents
# print strings and variables while concatenating them together
cat("x^2 + y^2 =", z, "where x =", x, "and y =", y)
You can also use paste
to concatenate strings and print them using print
.
The lines with #
characters indicate that you are writing a comment and the machine will ignore those lines.
Now, there are three ways to run the script.
You can do line by line runs by clicking a specific line in the script and click the “Run” button located in the upper right corner of the R studio panel.
You can run the entire script by clicking the “Source” button located in the upper right corner of the R studio panel.
You can type the source
command into the console.
source("1-files/example-r-script.R")
## [1] "Hello R!"
## x^2 + y^2 = 8 where x = 2 and y = 2
Note that you must put the path of the R script correctly. In this case, the R script “example-r-script.R” is in the “1-files” directory or folder.
To make reproducible reports, R Markdown provides quick and easy typesetting method. R Markdown can embed executable R code snippets and can run codes that produces plots. You can write your report along with R code snippets with the knitr syntax. You can then convert your document into several common formats such as pdf or html. Visit rmarkdown.rstudio.com for more details.
In RStudio, go to New File -> R Markdown. Then, choose “From Template” and then choose “Lab Report” from the list of templates. This template will only exist if you have installed the openintro
package.
In RStudio, go to New File -> R Markdown. Then, choose “Document” and then choose “PDF” from the list of output format. This template will work if you have latexpdf
package installed.
Please put the code snippet - shown below - at the beginning of the Rmd assignment template file if it does not exist.
---
title: '**Lab 1 - MATH 141**'
header-includes: |
\usepackage{fancyhdr}
\pagestyle{fancy}
\fancyhead[CO,C]{Homework 1 - MATH 141}
\fancyfoot[CO,C]{}
\fancyfoot[C]{\thepage}
\usepackage{float}
output:
bookdown::pdf_document2:
fig_caption: yes
toc: no
number_section: no
urlcolor: red
---
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.
Characteristics of a data frame:
The column names should be non-empty.
The row names should be unique.
The data stored in a data frame can be of numeric, factor or character type.
Each column should contain same number of data items.
library(tidyverse)
# Create the data frame.
<- data.frame(
data_example my_ranking = c(1:5),
superheroes = c("Spiderman","Shang-Chi","Scarlet Witch","Doctor Strange","Black Panther"),
subjective_power_scale = c(10000,1000,1000,900,900)
)# Print the data frame.
glimpse(data_example)
## Rows: 5
## Columns: 3
## $ my_ranking <int> 1, 2, 3, 4, 5
## $ superheroes <chr> "Spiderman", "Shang-Chi", "Scarlet Witch", "Doc…
## $ subjective_power_scale <dbl> 10000, 1000, 1000, 900, 900
The keyword for creating a dataframe in R is data.frame
. Each parameter corresponds to the columns while the items in the lists corresponds to the rows. Notice that the name of the variable for each list is the default name for the columns. The rows are automatically have integer indices. This type of data structure is the same as you have seen in the previous lab.
# Add the "subjective_pair" column.
$subjective_pair <- c("Black Panther","Doctor Strange","Shang-Chi","Doctor Strange","Scarlet Witch")
data_example# Add the "enemy" column.
$enemy <- c("Green Goblin","Wenwu","Agatha Harkness","Dormammu","Erik Killmonger")
data_example<- data_example
v glimpse(v)
## Rows: 5
## Columns: 5
## $ my_ranking <int> 1, 2, 3, 4, 5
## $ superheroes <chr> "Spiderman", "Shang-Chi", "Scarlet Witch", "Doc…
## $ subjective_power_scale <dbl> 10000, 1000, 1000, 900, 900
## $ subjective_pair <chr> "Black Panther", "Doctor Strange", "Shang-Chi",…
## $ enemy <chr> "Green Goblin", "Wenwu", "Agatha Harkness", "Do…
# Create the second data frame
<- data.frame(
data_new my_ranking = c(6:8),
superheroes = c("Naruto","Saitama","Black Widow"),
subjective_power_scale = c(700,700,800),
subjective_pair = c("Spiderman","Doctor Strange","Shang-Chi"),
enemy = c("Sasuke","Unknown","General Dreykov")
)
# Bind the two data frames.
<- rbind(data_example,data_new)
data_example_final glimpse(data_example_final)
## Rows: 8
## Columns: 5
## $ my_ranking <int> 1, 2, 3, 4, 5, 6, 7, 8
## $ superheroes <chr> "Spiderman", "Shang-Chi", "Scarlet Witch", "Doc…
## $ subjective_power_scale <dbl> 10000, 1000, 1000, 900, 900, 700, 700, 800
## $ subjective_pair <chr> "Black Panther", "Doctor Strange", "Shang-Chi",…
## $ enemy <chr> "Green Goblin", "Wenwu", "Agatha Harkness", "Do…
The keyword for adding rows to an existing data frame in R is rbind
where the inputs must be data frames with the same columns.
Conditional statements in programming are define one or more conditions to be evaluated or tested by the program, as well as a statement or statements to be performed if the condition is true, and optionally, other statements to be executed if the condition is false.
<- 5
x if(x > 0){
print("positive number")
}
## [1] "positive number"
<- -5
x if (x > 0) {
print("positive number")
else {
} print("negative number")
}
## [1] "negative number"
<- 0
x if (x > 0) {
print("positive number")
else if (x == 0) {
} print("zero")
else {
} print("negative number")
}
## [1] "zero"
The code snippet below are generated using the iris
data set, which is part of the base R installation.
# subsetting by category
<- iris[iris$Species == "setosa", ]
setosa_sub glimpse(setosa_sub)
## Rows: 50
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
# subsetting numerical values
<- iris[iris$Sepal.Length >= 4, ]
sepal_length_sub glimpse(sepal_length_sub)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
# subsetting numerical values within a range
<- iris[iris$Sepal.Length >= 5.10 & iris$Sepal.Length <= 6.40, ]
sepal_length_sub_range glimpse(sepal_length_sub_range)
## Rows: 83
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 5.4, 5.4, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 5.…
## $ Sepal.Width <dbl> 3.5, 3.9, 3.7, 4.0, 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.7, 1.5, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5, 1.7, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.4, 0.2, 0.2, 0.4, 0.4, 0.3, 0.3, 0.3, 0.2, 0.4, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
# subsetting by category
<- subset(iris, Species == "setosa")
setosa_subglimpse(setosa_sub)
## Rows: 50
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
# subsetting numerical values
<- subset(iris, Sepal.Length >= 4)
sepal_length_sub glimpse(sepal_length_sub)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
# subsetting numerical values within a range
<- subset(iris, Sepal.Length >= 5.10 & Sepal.Length <= 6.40)
sepal_length_sub_range glimpse(sepal_length_sub_range)
## Rows: 83
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 5.4, 5.4, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 5.…
## $ Sepal.Width <dbl> 3.5, 3.9, 3.7, 4.0, 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.7, 1.5, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5, 1.7, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.4, 0.2, 0.2, 0.4, 0.4, 0.3, 0.3, 0.3, 0.2, 0.4, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
You may find yourself in a position where you need to run a block of code multiple times. Statements are typically executed in order. A function’s first statement is executed first, then the second, and so on.
A for-loop iterates on a list until it reaches the last element.
for(i in 1:10) { # Head of for-loop
<- i^2 # Code block where each interger is squared
x1 print(x1) # Print results
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100
= c('A','B','C','D','E')
letters_vector for(i in letters_vector) { # loop over character vector
cat("My answer is", i, "for sure. \n") # concatenate strings while printing in every new line
}
## My answer is A for sure.
## My answer is B for sure.
## My answer is C for sure.
## My answer is D for sure.
## My answer is E for sure.
You can also use paste
to concatenate strings and print them using print
.
<- numeric() # Create empty data numeric object
x
for(i in 1:10) { # Head of for-loop
<- c(x, i^2) # Code block where each interger is squared
x
}print(x)
## [1] 1 4 9 16 25 36 49 64 81 100
<- numeric() # Create empty data numeric object
x
for(i in 1:100) { # Head of for-loop
<- c(x, i^2) # Code block where each interger is squared
x
# conditional statement: if i^2 > 2000, the loop will stop
if (i^2 > 2000) {
break
}
}print(x)
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225
## [16] 256 289 324 361 400 441 484 529 576 625 676 729 784 841 900
## [31] 961 1024 1089 1156 1225 1296 1369 1444 1521 1600 1681 1764 1849 1936 2025
A while-loop is method of iterating while a condition is true. While a given condition is true, it repeats a statement or a series of statements. Before performing the loop body, it checks the condition.
<- 1
i while (i < 6) { # iterate while i is less than 6
print(i)
= i+1 # add 1 in each iteration. if you miss this part you will end up in an infinit loop
i print(i)
}
## [1] 1
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 4
## [1] 4
## [1] 5
## [1] 5
## [1] 6
The for-loop and while-loop can be similar but the while-loop needs a condition in order for it to work.
A function is a collection of statements that work together to accomplish a specified goal. R comes with a vast variety of built-in functions, and users can also construct their own.
A function in R is an object, which allows the R interpreter to send control to the function as well as any parameters that may be required for the function to complete the operations.
The function then completes its duty and returns control as well as any results that may have been saved in other objects to the interpreter.
R function syntax:
function_1 <- function(...) {
# put operations here
}
# Create a function to print squares of numbers in sequence.
<- function(a) {
function_1 for(i in 1:a) {
<- i^2
b print(b) # only prints the output
}
}
# Call the function new.function supplying 6 as an argument.
function_1(6)
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
# Create a function to print sum of squares of two numbers.
<- function(a,b) {
function_2 <- a^2 + b^2
c print(c) # only prints the output
}
# Call the function supplying 6 as an argument.
function_2(6,6)
## [1] 72
# Create a function to print sum of squares of two numbers.
<- function(a,b) {
function_2 <- a^2 + b^2
c return(c) # returns value c
}
# Call the function supplying 6 as an argument.
function_2(6,6)
## [1] 72
The figures below are generated using the iris
data set, which is part of the base R installation.
plot(iris$Sepal.Length, iris$Sepal.Width, # x and y data
pch=21, # dot design
bg=c("red","green3","blue")[unclass(iris$Species)], # color for each group
main="Edgar Anderson's Iris Data", # plot title
xlab = "Sepal Length", # x label
ylab = "Sepal Width") # y label
legend('topright',levels(iris$Species),col=c("red","green3","blue"),pch=c(21,21,21))
ggplot2
.library(ggplot2)
<- ggplot(data=iris, aes(x = Sepal.Length, y = Sepal.Width)) # using the iris data
scatter + geom_point(aes(color=Species, shape=Species), size = 1.5) + # scatter plot
scatter xlab("Sepal Length") + ylab("Sepal Width") + # x and y label
ggtitle("Edgar Anderson's Iris Data")
The figures below are generated using the iris
data set, which is part of the base R installation.
ggplot2
.library(ggplot2)
<- ggplot(data=iris, aes(x=Sepal.Length))
histogram + geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) +
histogram xlab("Sepal Length") + ylab("Frequency") + ggtitle("Histogram of Sepal Length")
The figures below are generated using the iris
data set, which is part of the base R installation.
ggplot2
.library(ggplot2)
<- ggplot(data=iris, aes(x=Sepal.Length))
boxplot + geom_boxplot(color="black", aes(fill=Species)) +
boxplot xlab("Sepal Length") + ggtitle("Boxplot of Sepal Length") +
theme(axis.ticks.y = element_blank(), # it removes the y axis ticks and texts
axis.text.y = element_blank())
The plots below are generated using a synthentic data.
ggplot2
.# create synthetic data
<- c("A","B")
groups <- seq(0, 1, length.out = 9) # generate a sequence of equally spaced numbers
x_vals_seq <- rep(x_vals_seq,2) # replicate vector twice
x_values <- c(1, 2, 2, 4, 5, 4, 4, 3, 1, 2, 4, 4, 8, 10, 8, 8, 6, 2)
y_values <- data.frame(letter=rep(groups, each=9), # replicate A and B 9 times each
df2 x=rep(x_values,2), # independent variable
y=y_values) # dependent variable
glimpse(df2)
## Rows: 36
## Columns: 3
## $ letter <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B"…
## $ x <dbl> 0.000, 0.125, 0.250, 0.375, 0.500, 0.625, 0.750, 0.875, 1.000, …
## $ y <dbl> 1, 2, 2, 4, 5, 4, 4, 3, 1, 2, 4, 4, 8, 10, 8, 8, 6, 2, 1, 2, 2,…
library(ggplot2)
<-ggplot(df2, aes(x=x, y=y, group=letter)) +
pgeom_line(aes(color=letter))+
geom_point(aes(color=letter))
p
Consider the linear function written below with parameters \(a\) and \(b\).
\[y(x;a,b) = ax + b\] where \(x\) is the independent variable and \(y\) is the dependent variable. Here, the slope of the line is \(a\) and intercept is \(b\).
Write an R function which outputs the dependent variable \(y\) and takes in the independent variable \(x\), and the parameters \(a\) and \(b\).
Using 3 different values of \(a\) and \(b\) (A:\(a=1,b=1\), B:\(a=1.25,b=1.2\), and C:\(a=2,b=1.5\)), create a dataframe with columns “line”, “x”, and “y”. Use the seq
command to create a vector of \(x\) values from 0 to 1 with length 10. Use the rep
command to generate \(x\) values for the three groups. Use your R function to generate \(y\) values using your \(x\) values as inputs. The “line” column should contain the line groups A, B, and C. Use glimpse
to show your dataframe.
Use ggplot
to plot the lines on the same figure with proper line group labels.
Consider the county
data set, which can be found in the usdata
R package. Also, this dataset can be accessed as a csv file, county.
Use ggplot
to plot histograms of “pop_change” variable with categories from the “metro” variable. Make sure to label the x and y axis properly. What can you tell about the change in population whether the county has access to a metro or not?
Use ggplot
to plot histograms of “per_capita_income” variable with categories from the “median_edu” variable. Make sure to label the x and y axis properly. Describe the distributions between the levels of the “median_edu” variable. Is there a strong association between “median_edu” and “per_capita_income”? Explain why.
Use ggplot
to plot a scatterplot of “unemployment_rate” versus “poverty _rate” variables with categories from the “median_edu”. Make sure to label the x and y axis properly. Is there an association between “unemployment_rate” and “poverty _rate”? Are there any differences across the “median_edu” levels?
Create a subset of the data where we only take rows with the states Washington, Oregon, and California.
Using the subset you just created, create boxplots of “pop_change” variable with categories from the “state” variable. Make sure to label the x and y axis properly. Based on the medians shown on the boxplots, which state has the lowest and highest population change?
Using the subset you just created, create boxplots of “per_capita_income” variable with categories from the “median_edu” variable. Make sure to label the x and y axis properly. Based on the medians shown on the boxplots, which state has the lowest and highest per capita income?