dplyr
Upon completing today’s lab activity, students should be able to do the following using R and RStudio:
Use the dplyr package to efficiently subset data frames.
Quantifying the strength of a linear relationship by computing the correlation coefficient of a given data.
Fitting a linear regression model into a given data.
Interpreting and visualizing regression models.
Quantifying errors through the computation and visualization of residuals.
dplyr
Data is often messy and difficult to organize or manage. Typically you may need to do the following on your data set:
Constraining your options to help manage your data and figure out the things you need/want.
Easy and intuitive R commands to filter and subset your data.
Efficient and fast subsetting when dealing with large-scale data.
The dplyr
package makes the above tasks easier. Note that this package is part of the tidyverse
package.
Throughout this section, we will use the starwars
data set, which is available through the dplyr
package.
First, install the dplyr
package if you have not done it already.
install.packages("dplyr")
Now, we can begin.
library(dplyr)
starwars
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth … 202 136 none white yellow 41.9 male mascu…
## 5 Leia O… 150 49 brown light brown 19 fema… femin…
## 6 Owen L… 178 120 brown, grey light blue 52 male mascu…
## 7 Beru W… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs … 183 84 black light brown 24 male mascu…
## 10 Obi-Wa… 182 77 auburn, wh… fair blue-gray 57 male mascu…
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Notice that the printed data set above is a data frame with 87 rows and 14 columns. In tidyverse terminologies, this is called a “tibble”, a modern reimagining of the data frame. Similar to using the glimpse
, function a tibble data frame automatically print a summary shown as a table like above. You can learn more about tibbles at https://tibble.tidyverse.org; in particular you can convert data frames to tibbles with as_tibble()
.
In dplyr
, the functions used to subset data are called verbs. Below is a list of verbs and there descriptions.
filter()
chooses rows based on column values.slice()
chooses rows based on location.arrange()
changes the order of the rows.select()
changes whether or not a column is included.rename()
changes the name of columns.mutate()
changes the values of columns and creates new columns.relocate()
changes the order of the columns.summarise()
collapses a group into a single row.filter()
<- starwars %>% filter(skin_color == "brown", eye_color == "brown")
sub_fil sub_fil
## # A tibble: 2 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Wicket S… 88 20 brown brown brown 8 male mascu…
## 2 Eeth Koth 171 NA black brown brown NA male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
#Equivalent R base codes
<- starwars[starwars$skin_color == "brown" & starwars$eye_color == "brown", ]
sub_fil_0 <- subset(starwars, skin_color == "brown" & eye_color == "brown") sub_fil_1
arrange()
<- starwars %>% arrange(height, mass) # ascending order
sub_arr_asc sub_arr_asc
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yoda 66 17 white green brown 896 male mascu…
## 2 Ratts T… 79 15 none grey, blue unknown NA male mascu…
## 3 Wicket … 88 20 brown brown brown 8 male mascu…
## 4 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
## 5 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 6 R4-P17 96 NA none silver, r… red, blue NA none femin…
## 7 R5-D4 97 32 <NA> white, red red NA none mascu…
## 8 Sebulba 112 40 none grey, red orange NA male mascu…
## 9 Gasgano 122 NA none white, bl… black NA male mascu…
## 10 Watto 137 NA black blue, grey yellow NA male mascu…
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
# Use desc() to order a column in descending order
<- starwars %>% arrange(desc(height)) # descending order
sub_arr_dsc sub_arr_dsc
slice()
# choose rows with indices 5 to 10
%>% slice(5:10) starwars
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Or… 150 49 brown light brown 19 fema… femin…
## 2 Owen La… 178 120 brown, grey light blue 52 male mascu…
## 3 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 4 R5-D4 97 32 <NA> white, red red NA none mascu…
## 5 Biggs D… 183 84 black light brown 24 male mascu…
## 6 Obi-Wan… 182 77 auburn, wh… fair blue-gray 57 male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
# choose the first 3 rows
%>% slice_head(n = 3) starwars
# choose the last 3 rows
%>% slice_tail(n = 3) starwars
# randonmly sample 5 rows
%>% slice_sample(n = 5) starwars
# randonmly sample 5 rows
%>% slice_sample(prop = 0.1) starwars
select()
%>% select(hair_color, skin_color, eye_color) # select columns by name starwars
## # A tibble: 87 × 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # … with 77 more rows
# Select all columns between hair_color and eye_color (inclusive)
%>% select(hair_color:eye_color) starwars
# Select all columns except those from hair_color to eye_color (inclusive)
%>% select(!(hair_color:eye_color)) starwars
# Select all columns ending with color
%>% select(ends_with("color")) starwars
mutate()
# add a new column using the hight column devided by 100
%>%
starwars mutate(height_m = height / 100) %>%
select(height_m, height, everything()) # to see the columns
## # A tibble: 87 × 15
## height_m height name mass hair_color skin_color eye_color birth_year sex
## <dbl> <int> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 1.72 172 Luke … 77 blond fair blue 19 male
## 2 1.67 167 C-3PO 75 <NA> gold yellow 112 none
## 3 0.96 96 R2-D2 32 <NA> white, bl… red 33 none
## 4 2.02 202 Darth… 136 none white yellow 41.9 male
## 5 1.5 150 Leia … 49 brown light brown 19 fema…
## 6 1.78 178 Owen … 120 brown, gr… light blue 52 male
## 7 1.65 165 Beru … 75 brown light blue 47 fema…
## 8 0.97 97 R5-D4 32 <NA> white, red red NA none
## 9 1.83 183 Biggs… 84 black light brown 24 male
## 10 1.82 182 Obi-W… 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
%>%
starwars mutate(
height_m = height / 100,
BMI = mass / (height_m^2)
%>%
) select(BMI, everything())
# for keeping new variables/columns
%>%
starwars transmute(
height_m = height / 100,
BMI = mass / (height_m^2)
)
relocate()
%>% relocate(sex:homeworld, .before = height) starwars
## # A tibble: 87 × 14
## name sex gender homeworld height mass hair_color skin_color eye_color
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr>
## 1 Luke Sk… male mascu… Tatooine 172 77 blond fair blue
## 2 C-3PO none mascu… Tatooine 167 75 <NA> gold yellow
## 3 R2-D2 none mascu… Naboo 96 32 <NA> white, bl… red
## 4 Darth V… male mascu… Tatooine 202 136 none white yellow
## 5 Leia Or… female femin… Alderaan 150 49 brown light brown
## 6 Owen La… male mascu… Tatooine 178 120 brown, gr… light blue
## 7 Beru Wh… female femin… Tatooine 165 75 brown light blue
## 8 R5-D4 none mascu… Tatooine 97 32 <NA> white, red red
## 9 Biggs D… male mascu… Tatooine 183 84 black light brown
## 10 Obi-Wan… male mascu… Stewjon 182 77 auburn, w… fair blue-gray
## # … with 77 more rows, and 5 more variables: birth_year <dbl>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
summarise()
%>% summarise(height = mean(height, na.rm = TRUE)) starwars
%>%
starwars group_by(sex) %>%
select(height, mass) %>%
summarise(
height = mean(height, na.rm = TRUE),
mass = mean(mass, na.rm = TRUE)
)
## Adding missing grouping variables: `sex`
## # A tibble: 5 × 3
## sex height mass
## <chr> <dbl> <dbl>
## 1 female 169. 54.7
## 2 hermaphroditic 175 1358
## 3 male 179. 81.0
## 4 none 131. 69.8
## 5 <NA> 181. 48
If you want to explore more on using the dplyr
package, you can visit the following resources.
Throughout this section we are using the iris
data set, which is part of the base R installation.
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
First, let’s look at scatter plots of selected variables.
pairs(data = iris, ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
main = "Edgar Anderson's Iris Data",
pch = 21,
bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)],
oma=c(3,3,3,15))
par(xpd = TRUE)
legend("bottomright", fill = unique(iris$Species), legend = c( levels(iris$Species)))
We can look at all of the correlations for all relevant variables and save it as a data frame. Note that when we compute the correlations, we ignore the species groupings for now and use all data points.
<- iris %>% select(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)
sub_4 <- data.frame(cor(sub_4))
cor_4 cor_4
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
The formula for computing correlation is given by
\[r = \frac{\sum_{i=1}^n{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}} \]
where the terms \(\bar{x}\) and \(\bar{y}\) are the mean of the \(x\) and \(y\) variables. The mean is computed as
\[ \bar{x} = \frac{\sum_{i=1}^n{x_i}}{n} \]
and
\[ \bar{y} = \frac{\sum_{i=1}^n{y_i}}{n} \]
where \(n\) is the number of observations.
Notice that the results from the previous code snippet show that the the \(r_{Petal.Length,Petal.Width} = 0.9628\) is equal to \(r_{Petal.Width,Petal.Length} = 0.9628\). This means that the correlation equation is symmetric if you switch the \(x\) and \(y\) variables. R computes correlation using this formula behind the scenes.
Note that the correlation coefficient is very sensitive to outliers.
First, let’s take a look at the scatter plot of the Petal.Width
vs Petal.Length
of the iris
data set without coloring by species.
library(ggplot2)
<- ggplot(data=iris, aes(x = Petal.Width, y = Petal.Length))
scatter + geom_point(size = 1.5) + # scatter plot
scatter xlab("Petal Width") + ylab("Petal Length") + # x and y label
ggtitle("Edgar Anderson's Iris Data")
Now, we want create a linear model that predicts Petal.Length
using Petal.Width
as the explanatory variable. The response variable is the Petal.Length
. We can write the linear model as follows.
\[\hat{y} = b_0 + b_1 x\]
where \(\hat{y}\) is the response variable (predicted Petal.Length
) and \(x\) is the explanatory variable (Petal.Width
as predictor). The term \(e\) is the error. Below is a code snippet using the lm
function to compute the best fit linear model given the data. That is finding the sample statistics \(b_0\) and \(b_1\) that best fits the data.
<- lm(Petal.Length ~ Petal.Width, data = iris)
lm_mod summary(lm_mod) # details of the model
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33542 -0.30347 -0.02955 0.25776 1.39453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08356 0.07297 14.85 <2e-16 ***
## Petal.Width 2.22994 0.05140 43.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
## F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
<- lm(Petal.Length ~ Petal.Width, data = iris)
lm_mod print(lm_mod) # less details of the model
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
##
## Coefficients:
## (Intercept) Petal.Width
## 1.084 2.230
We can write the linear model using the estimated values of \(b_0\) and \(b_1\).
\[\widehat{Petal.Length} = 1.083558 + 2.230 * Petal.Width \]
Now, we can make predictions/estimates using the model.
<- data.frame(Petal.Width=c(0.50, 1, 2)) # using three example widths
new.df <- new.df %>% mutate(Petal.Length.hat = predict(lm_mod, new.df))
predictions_0 print(predictions_0)
## Petal.Width Petal.Length.hat
## 1 0.5 2.198528
## 2 1.0 3.313499
## 3 2.0 5.543439
<- data.frame(Petal.Width=iris$Petal.Width) # using the actual widths
new.df <- new.df %>% mutate(Petal.Length.hat = predict(lm_mod, new.df))
predictions_1 glimpse(predictions_1)
## Rows: 150
## Columns: 2
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2…
## $ Petal.Length.hat <dbl> 1.529546, 1.529546, 1.529546, 1.529546, 1.529546, 1.9…
Now, we can make a line to show the linear model into the scatterplot.
library(ggplot2)
<- ggplot(data=iris, aes(x = Petal.Width, y = Petal.Length))
scatter + geom_point(size = 1.5) + # scatter plot
scatter geom_abline(aes(slope = coef(lm_mod)[[2]], intercept = coef(lm_mod)[[1]], colour='Best Fit Linear Model'), size=1) +
labs(colour="") +
xlab("Petal Width") + ylab("Petal Length") + # x and y label
ggtitle("Edgar Anderson's Iris Data")
The residual for the \(i^{th}\) observation are computed as follows.
\[ e_i = y_i - \hat{y_i} \]
where \(y_i\) is the actual observation and \(\hat{y_i}\) is the estimated/predicted outcome using the linear model.
R can automatically compute the residuals using the resid
function.
<- resid(lm_mod)
lm_res <- data.frame(resid = lm_res)
lm_res <- lm_res %>% mutate(Petal.Width = iris$Petal.Width) lm_res
Now, we can make plots of the residuals.
library(ggplot2)
<- ggplot(data=lm_res, aes(x=resid))
histogram + geom_histogram(binwidth=0.2, color="black") +
histogram xlab("Residuals") + ylab("Frequency") + ggtitle("Histogram of Residuals")
library(ggplot2)
<- ggplot(data=lm_res, aes(x = Petal.Width, y =resid))
scatter + geom_point(size = 1.5) + # scatter plot
scatter xlab("Petal Width") + ylab("Residual") + # x and y label +
geom_hline(yintercept=0, linetype='dashed', col = 'red') +
ggtitle("Petal Width vs Residuals")
The histogram shows a nearly normal residual distribution while the residual scatter plot shows the variation of the residuals vs the explanatory variable.
Next, we can compute the residual sum of squares (rss) or the sum of squared errors (sse).
\[ rss = \sum_{i=1}^n{(y_i - \hat{y}_i)^2} \]
# residual sum of squares (rss) or the sum of squared errors (sse)
<- lm_res %>% transmute(resid^2)
sse <- sum(sse) sse
The sse gives you Left-over variability in the \(y\) values if we know \(x\). It tells you how well does your model fit the actual data. The residual sum of squares is used to help you decide if a statistical model is a good fit for your data. Normally you would compare sse values among linear models. The lower the sse, the better.
Next, we can compute the total sum of squares (tss).
\[ sst = \sum_{i=1}^n{(y_i - \bar{y})^2} \]
<- iris %>% transmute((Petal.Length - mean(Petal.Length))^2)
sst <- sum(sst) sst
Finally, using the sse and sst, we can compute the coefficient of determination.
\[ R^2 = \frac{sst - sse}{sst} = 1 - \frac{sse}{sst} \]
The \(R^2\) value tells you the amount (or percent) of variation explained of you model fit into the data. It is a ratio of a measure of variability around the line divided by a measure of total variability. It tells you how well the model fits. the higher the \(R^2\), the better. This value falls between 0 and 1.
<- 1 - (sse/sst)
R_squared R_squared
## [1] 0.9271098
Load the iris
dataset. Note that this dataset is in the datasets
package which is already included in the base R installation. Use dplyr
to make subsets (Species: setosa, versicolor, and virginica) of data using variables Sepal.Length
and Petal.Length
. You should have three subsets. One for each species.
For each species find the best fit linear model to predict Petal.Length
using Sepal.Length
as the predictor. You should have three Linear models. One for each Species. Include writing the linear model equations with their corresponding sample statistic (intercept and slope). For example, write the equations for \(\hat{y}_{setosa}\), \(\hat{y}_{versicolor}\), and \(\hat{y}_{virginica}\). Interpret the slopes of the models for each species and compare them with each other.
Using your fitted models. Produce a scatterplot which contains the data points colored according to Species. Add a line for each linear model for each species on the same scatterplot. Make sure to put legends and labels correctly.
For each subset you made in problem 1. Compute the correlations. Describe the correlations in this context. Does each correlation reflect what’s on the scatterplot and compare them with each species?
duke_forest
data, which is available using the openintro
package. Below is a table of the description of the variables of the duke_forest
data set. For each variable, indentify whether they are numerical (discrete or continuous) or categorical (nominal or ordinal).Variable | Description |
---|---|
price | Sale price, in USD |
bed | Number of bedrooms |
bath | Number of bathrooms |
area | Area of home, in square feet |
year_built | Year the home was built |
cooling | Cooling system: central or other |
lot | Area of the entire property, in acres |
Produce a paired scatterplots using the pairs
function in R. You must use the numerical variables.
Compute the correlations of the bed
, bath
, area
, year_built
, and lot
variables with the price
variable. Which explanatory variable has the highest correlation? Does it reflect a consistent pattern in the scatter plot?
Let the response variable (or the outcome) be the price
variable. Create a linear model with a predictor chosen with the best coefficient of determination with the price
variable.
Produce a scatterplot and residual plot for the linear model with the best coefficient of determination in problem 4. Interpret the slopes and explain the results according to this context. Based on the residuals, identify any outliers and explain why we need to resort to a more complicated model.