dplyrUpon completing today’s lab activity, students should be able to do the following using R and RStudio:
Use the dplyr package to efficiently subset data frames.
Quantifying the strength of a linear relationship by computing the correlation coefficient of a given data.
Fitting a linear regression model into a given data.
Interpreting and visualizing regression models.
Quantifying errors through the computation and visualization of residuals.
dplyrData is often messy and difficult to organize or manage. Typically you may need to do the following on your data set:
Constraining your options to help manage your data and figure out the things you need/want.
Easy and intuitive R commands to filter and subset your data.
Efficient and fast subsetting when dealing with large-scale data.
The dplyr package makes the above tasks easier. Note that this package is part of the tidyverse package.
Throughout this section, we will use the starwars data set, which is available through the dplyr package.
First, install the dplyr package if you have not done it already.
install.packages("dplyr")
Now, we can begin.
library(dplyr)
starwars## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth … 202 136 none white yellow 41.9 male mascu…
## 5 Leia O… 150 49 brown light brown 19 fema… femin…
## 6 Owen L… 178 120 brown, grey light blue 52 male mascu…
## 7 Beru W… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs … 183 84 black light brown 24 male mascu…
## 10 Obi-Wa… 182 77 auburn, wh… fair blue-gray 57 male mascu…
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Notice that the printed data set above is a data frame with 87 rows and 14 columns. In tidyverse terminologies, this is called a “tibble”, a modern reimagining of the data frame. Similar to using the glimpse, function a tibble data frame automatically print a summary shown as a table like above. You can learn more about tibbles at https://tibble.tidyverse.org; in particular you can convert data frames to tibbles with as_tibble().
In dplyr, the functions used to subset data are called verbs. Below is a list of verbs and there descriptions.
filter() chooses rows based on column values.slice() chooses rows based on location.arrange() changes the order of the rows.select() changes whether or not a column is included.rename() changes the name of columns.mutate() changes the values of columns and creates new columns.relocate() changes the order of the columns.summarise() collapses a group into a single row.filter()sub_fil <- starwars %>% filter(skin_color == "brown", eye_color == "brown")
sub_fil## # A tibble: 2 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Wicket S… 88 20 brown brown brown 8 male mascu…
## 2 Eeth Koth 171 NA black brown brown NA male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
#Equivalent R base codes
sub_fil_0 <- starwars[starwars$skin_color == "brown" & starwars$eye_color == "brown", ]
sub_fil_1 <- subset(starwars, skin_color == "brown" & eye_color == "brown")arrange()sub_arr_asc <- starwars %>% arrange(height, mass) # ascending order
sub_arr_asc## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yoda 66 17 white green brown 896 male mascu…
## 2 Ratts T… 79 15 none grey, blue unknown NA male mascu…
## 3 Wicket … 88 20 brown brown brown 8 male mascu…
## 4 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
## 5 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 6 R4-P17 96 NA none silver, r… red, blue NA none femin…
## 7 R5-D4 97 32 <NA> white, red red NA none mascu…
## 8 Sebulba 112 40 none grey, red orange NA male mascu…
## 9 Gasgano 122 NA none white, bl… black NA male mascu…
## 10 Watto 137 NA black blue, grey yellow NA male mascu…
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
# Use desc() to order a column in descending order
sub_arr_dsc <- starwars %>% arrange(desc(height)) # descending order
sub_arr_dscslice()# choose rows with indices 5 to 10
starwars %>% slice(5:10) ## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Or… 150 49 brown light brown 19 fema… femin…
## 2 Owen La… 178 120 brown, grey light blue 52 male mascu…
## 3 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 4 R5-D4 97 32 <NA> white, red red NA none mascu…
## 5 Biggs D… 183 84 black light brown 24 male mascu…
## 6 Obi-Wan… 182 77 auburn, wh… fair blue-gray 57 male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
# choose the first 3 rows
starwars %>% slice_head(n = 3)# choose the last 3 rows
starwars %>% slice_tail(n = 3)# randonmly sample 5 rows
starwars %>% slice_sample(n = 5)# randonmly sample 5 rows
starwars %>% slice_sample(prop = 0.1)select()starwars %>% select(hair_color, skin_color, eye_color) # select columns by name## # A tibble: 87 × 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # … with 77 more rows
# Select all columns between hair_color and eye_color (inclusive)
starwars %>% select(hair_color:eye_color)# Select all columns except those from hair_color to eye_color (inclusive)
starwars %>% select(!(hair_color:eye_color))# Select all columns ending with color
starwars %>% select(ends_with("color"))mutate()# add a new column using the hight column devided by 100
starwars %>%
mutate(height_m = height / 100) %>%
select(height_m, height, everything()) # to see the columns## # A tibble: 87 × 15
## height_m height name mass hair_color skin_color eye_color birth_year sex
## <dbl> <int> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 1.72 172 Luke … 77 blond fair blue 19 male
## 2 1.67 167 C-3PO 75 <NA> gold yellow 112 none
## 3 0.96 96 R2-D2 32 <NA> white, bl… red 33 none
## 4 2.02 202 Darth… 136 none white yellow 41.9 male
## 5 1.5 150 Leia … 49 brown light brown 19 fema…
## 6 1.78 178 Owen … 120 brown, gr… light blue 52 male
## 7 1.65 165 Beru … 75 brown light blue 47 fema…
## 8 0.97 97 R5-D4 32 <NA> white, red red NA none
## 9 1.83 183 Biggs… 84 black light brown 24 male
## 10 1.82 182 Obi-W… 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
starwars %>%
mutate(
height_m = height / 100,
BMI = mass / (height_m^2)
) %>%
select(BMI, everything())# for keeping new variables/columns
starwars %>%
transmute(
height_m = height / 100,
BMI = mass / (height_m^2)
)relocate()starwars %>% relocate(sex:homeworld, .before = height)## # A tibble: 87 × 14
## name sex gender homeworld height mass hair_color skin_color eye_color
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr>
## 1 Luke Sk… male mascu… Tatooine 172 77 blond fair blue
## 2 C-3PO none mascu… Tatooine 167 75 <NA> gold yellow
## 3 R2-D2 none mascu… Naboo 96 32 <NA> white, bl… red
## 4 Darth V… male mascu… Tatooine 202 136 none white yellow
## 5 Leia Or… female femin… Alderaan 150 49 brown light brown
## 6 Owen La… male mascu… Tatooine 178 120 brown, gr… light blue
## 7 Beru Wh… female femin… Tatooine 165 75 brown light blue
## 8 R5-D4 none mascu… Tatooine 97 32 <NA> white, red red
## 9 Biggs D… male mascu… Tatooine 183 84 black light brown
## 10 Obi-Wan… male mascu… Stewjon 182 77 auburn, w… fair blue-gray
## # … with 77 more rows, and 5 more variables: birth_year <dbl>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
summarise()starwars %>% summarise(height = mean(height, na.rm = TRUE))starwars %>%
group_by(sex) %>%
select(height, mass) %>%
summarise(
height = mean(height, na.rm = TRUE),
mass = mean(mass, na.rm = TRUE)
)## Adding missing grouping variables: `sex`
## # A tibble: 5 × 3
## sex height mass
## <chr> <dbl> <dbl>
## 1 female 169. 54.7
## 2 hermaphroditic 175 1358
## 3 male 179. 81.0
## 4 none 131. 69.8
## 5 <NA> 181. 48
If you want to explore more on using the dplyr package, you can visit the following resources.
Throughout this section we are using the iris data set, which is part of the base R installation.
glimpse(iris)## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
First, let’s look at scatter plots of selected variables.
pairs(data = iris, ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
main = "Edgar Anderson's Iris Data",
pch = 21,
bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)],
oma=c(3,3,3,15))
par(xpd = TRUE)
legend("bottomright", fill = unique(iris$Species), legend = c( levels(iris$Species)))We can look at all of the correlations for all relevant variables and save it as a data frame. Note that when we compute the correlations, we ignore the species groupings for now and use all data points.
sub_4 <- iris %>% select(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)
cor_4 <- data.frame(cor(sub_4))
cor_4## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
The formula for computing correlation is given by
\[r = \frac{\sum_{i=1}^n{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}} \]
where the terms \(\bar{x}\) and \(\bar{y}\) are the mean of the \(x\) and \(y\) variables. The mean is computed as
\[ \bar{x} = \frac{\sum_{i=1}^n{x_i}}{n} \]
and
\[ \bar{y} = \frac{\sum_{i=1}^n{y_i}}{n} \]
where \(n\) is the number of observations.
Notice that the results from the previous code snippet show that the the \(r_{Petal.Length,Petal.Width} = 0.9628\) is equal to \(r_{Petal.Width,Petal.Length} = 0.9628\). This means that the correlation equation is symmetric if you switch the \(x\) and \(y\) variables. R computes correlation using this formula behind the scenes.
Note that the correlation coefficient is very sensitive to outliers.
First, let’s take a look at the scatter plot of the Petal.Width vs Petal.Length of the iris data set without coloring by species.
library(ggplot2)
scatter <- ggplot(data=iris, aes(x = Petal.Width, y = Petal.Length))
scatter + geom_point(size = 1.5) + # scatter plot
xlab("Petal Width") + ylab("Petal Length") + # x and y label
ggtitle("Edgar Anderson's Iris Data")Now, we want create a linear model that predicts Petal.Length using Petal.Width as the explanatory variable. The response variable is the Petal.Length. We can write the linear model as follows.
\[\hat{y} = b_0 + b_1 x\]
where \(\hat{y}\) is the response variable (predicted Petal.Length) and \(x\) is the explanatory variable (Petal.Width as predictor). The term \(e\) is the error. Below is a code snippet using the lm function to compute the best fit linear model given the data. That is finding the sample statistics \(b_0\) and \(b_1\) that best fits the data.
lm_mod <- lm(Petal.Length ~ Petal.Width, data = iris)
summary(lm_mod) # details of the model##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33542 -0.30347 -0.02955 0.25776 1.39453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08356 0.07297 14.85 <2e-16 ***
## Petal.Width 2.22994 0.05140 43.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
## F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
lm_mod <- lm(Petal.Length ~ Petal.Width, data = iris)
print(lm_mod) # less details of the model##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
##
## Coefficients:
## (Intercept) Petal.Width
## 1.084 2.230
We can write the linear model using the estimated values of \(b_0\) and \(b_1\).
\[\widehat{Petal.Length} = 1.083558 + 2.230 * Petal.Width \]
Now, we can make predictions/estimates using the model.
new.df <- data.frame(Petal.Width=c(0.50, 1, 2)) # using three example widths
predictions_0 <- new.df %>% mutate(Petal.Length.hat = predict(lm_mod, new.df))
print(predictions_0)## Petal.Width Petal.Length.hat
## 1 0.5 2.198528
## 2 1.0 3.313499
## 3 2.0 5.543439
new.df <- data.frame(Petal.Width=iris$Petal.Width) # using the actual widths
predictions_1 <- new.df %>% mutate(Petal.Length.hat = predict(lm_mod, new.df))
glimpse(predictions_1)## Rows: 150
## Columns: 2
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2…
## $ Petal.Length.hat <dbl> 1.529546, 1.529546, 1.529546, 1.529546, 1.529546, 1.9…
Now, we can make a line to show the linear model into the scatterplot.
library(ggplot2)
scatter <- ggplot(data=iris, aes(x = Petal.Width, y = Petal.Length))
scatter + geom_point(size = 1.5) + # scatter plot
geom_abline(aes(slope = coef(lm_mod)[[2]], intercept = coef(lm_mod)[[1]], colour='Best Fit Linear Model'), size=1) +
labs(colour="") +
xlab("Petal Width") + ylab("Petal Length") + # x and y label
ggtitle("Edgar Anderson's Iris Data")The residual for the \(i^{th}\) observation are computed as follows.
\[ e_i = y_i - \hat{y_i} \]
where \(y_i\) is the actual observation and \(\hat{y_i}\) is the estimated/predicted outcome using the linear model.
R can automatically compute the residuals using the resid function.
lm_res <- resid(lm_mod)
lm_res <- data.frame(resid = lm_res)
lm_res <- lm_res %>% mutate(Petal.Width = iris$Petal.Width)Now, we can make plots of the residuals.
library(ggplot2)
histogram <- ggplot(data=lm_res, aes(x=resid))
histogram + geom_histogram(binwidth=0.2, color="black") +
xlab("Residuals") + ylab("Frequency") + ggtitle("Histogram of Residuals")library(ggplot2)
scatter <- ggplot(data=lm_res, aes(x = Petal.Width, y =resid))
scatter + geom_point(size = 1.5) + # scatter plot
xlab("Petal Width") + ylab("Residual") + # x and y label +
geom_hline(yintercept=0, linetype='dashed', col = 'red') +
ggtitle("Petal Width vs Residuals")The histogram shows a nearly normal residual distribution while the residual scatter plot shows the variation of the residuals vs the explanatory variable.
Next, we can compute the residual sum of squares (rss) or the sum of squared errors (sse).
\[ rss = \sum_{i=1}^n{(y_i - \hat{y}_i)^2} \]
# residual sum of squares (rss) or the sum of squared errors (sse)
sse <- lm_res %>% transmute(resid^2)
sse <- sum(sse)The sse gives you Left-over variability in the \(y\) values if we know \(x\). It tells you how well does your model fit the actual data. The residual sum of squares is used to help you decide if a statistical model is a good fit for your data. Normally you would compare sse values among linear models. The lower the sse, the better.
Next, we can compute the total sum of squares (tss).
\[ sst = \sum_{i=1}^n{(y_i - \bar{y})^2} \]
sst <- iris %>% transmute((Petal.Length - mean(Petal.Length))^2)
sst <- sum(sst)Finally, using the sse and sst, we can compute the coefficient of determination.
\[ R^2 = \frac{sst - sse}{sst} = 1 - \frac{sse}{sst} \]
The \(R^2\) value tells you the amount (or percent) of variation explained of you model fit into the data. It is a ratio of a measure of variability around the line divided by a measure of total variability. It tells you how well the model fits. the higher the \(R^2\), the better. This value falls between 0 and 1.
R_squared <- 1 - (sse/sst)
R_squared## [1] 0.9271098
Load the iris dataset. Note that this dataset is in the datasets package which is already included in the base R installation. Use dplyr to make subsets (Species: setosa, versicolor, and virginica) of data using variables Sepal.Length and Petal.Length. You should have three subsets. One for each species.
For each species find the best fit linear model to predict Petal.Length using Sepal.Length as the predictor. You should have three Linear models. One for each Species. Include writing the linear model equations with their corresponding sample statistic (intercept and slope). For example, write the equations for \(\hat{y}_{setosa}\), \(\hat{y}_{versicolor}\), and \(\hat{y}_{virginica}\). Interpret the slopes of the models for each species and compare them with each other.
Using your fitted models. Produce a scatterplot which contains the data points colored according to Species. Add a line for each linear model for each species on the same scatterplot. Make sure to put legends and labels correctly.
For each subset you made in problem 1. Compute the correlations. Describe the correlations in this context. Does each correlation reflect what’s on the scatterplot and compare them with each species?
duke_forest data, which is available using the openintro package. Below is a table of the description of the variables of the duke_forest data set. For each variable, indentify whether they are numerical (discrete or continuous) or categorical (nominal or ordinal).| Variable | Description |
|---|---|
| price | Sale price, in USD |
| bed | Number of bedrooms |
| bath | Number of bathrooms |
| area | Area of home, in square feet |
| year_built | Year the home was built |
| cooling | Cooling system: central or other |
| lot | Area of the entire property, in acres |
Produce a paired scatterplots using the pairs function in R. You must use the numerical variables.
Compute the correlations of the bed, bath, area, year_built, and lot variables with the price variable. Which explanatory variable has the highest correlation? Does it reflect a consistent pattern in the scatter plot?
Let the response variable (or the outcome) be the price variable. Create a linear model with a predictor chosen with the best coefficient of determination with the price variable.
Produce a scatterplot and residual plot for the linear model with the best coefficient of determination in problem 4. Interpret the slopes and explain the results according to this context. Based on the residuals, identify any outliers and explain why we need to resort to a more complicated model.