If you are unable to attend class, you can still earn participation credit by completing the following activity:
Watch the day’s lecture video (which will usually be posted by around 3pm PST).
Write a short response to the video that includes:
A 1 - 2 paragraph summary of the main ideas and topics discussed.
A 1 - 2 paragraph discussion of 1 real-world example of the theory, method or application in the lecture that has pertinence to your life, or was in the news, or that you’ve found interesting; for example, the lecture may have discussed the decomposition of an image using the grammar of graphics, and you could find one image from a newsroom and discuss the geom
metric shapes, aes
thetic attributes, and data
variables that appear in this image.
One question you have about the content covered in the lecture video.
Send your response to Nate on Slack (either as a message or attached image / .pdf file) before the start of the next class day.
Lecture Video (Reed Kerberus log-in required)
Course structure
Note that the listed reading assignments should be completed prior to class
Here is a link to the Tampa Bay Times article on race and the Stand Your Ground defense from which today’s data was taken.
Here is an article by Jeff Witmer, who did some analysis on the TBT data.
Here is a short video (~4 min) further discussing the statistical phenomenon of Simpson’s Paradox that we observed in class today.
R and RStudio
Structure of Data
Note that the listed reading assignments should be completed prior to class
Read Sections 1.1 - 1.4 in ModernDive
Complete Learning Checks LC 1.4, LC 1.5, LC 1.6 and submit to Gradescope.
Note that the listed assignments should be completed prior to class
Here is complete Netflix
dataset that was discussed in class today: Netflix.csv
This data set was obtained from the TidyTuesday project here.
TidyTuesday is a weekly data investigation project organized by a group of data scientists, with the purpose to practice and collaborate on data summarizing and arranging tasks.
Note that the listed reading assignments should be completed prior to class
Sections to Read Sections 2.1 and 2.2 in ModernDive
Reading Questions (Submit answers on Gradescope)
An in-depth discussion of the graphical reasoning surrounding the Challenger disaster can be found in Edward Tufte’s booklet Visual and Statistical Thinking: Displays of Evidence for Making Decisions (available for check-out in the Reed library)
In 1986, President Reagan a presidential commission tasked with investigating the Space Shuttle Challenger distaster. Their findings were given in the Rogers Commission Report, and summarized in live congressional testimony.
Note: Class will be held remotely on Zoom on Monday, 1-31. The zoom link is available in the #announcements-wells
channel of our Slack workspace.
ggplot2
: Scatterplots, Linegraphs and moreNote that the listed reading assignments should be completed prior to class
Sections to Read Sections 2.3 - 2.5 in ModernDive
Reading Questions (Submit answers on Gradescope)
ggplot2
continued: Histograms, Boxplots, Barcharts, and customizationNote that the listed reading assignments should be completed prior to class
Sections to Read Sections 2.6 - 2.9 in ModernDive
Reading Questions (Submit answers on Gradescope)
Portland Biketown is a bike-sharing system owned by the Portland Bureau of Transportation, managed by Lyft, and sponsored by Nike. The program allows uses to rent bikes at any station throughout the city, ride, and then deposite the bike at any station.
The biketown program logs data on each ride, including start/end location and time, disance traveled, fair type (subscriber vs casual), and more. The data is uplaoded to the program’s website and then made publically available here.
Here is a the biketown.csv file of a sample of 9999 observations that I’ve been using in class Monday and Wednesday.
ggplot2
Labs are due by 11:59pm the day before your next lab meeting
Note that the listed reading assignments should be completed prior to class
Sections to Read Sections 4.1, 4.3, 5.2 - 5.5 in Introduction to Modern Statistics Note this is not the ModernDive textbook
Reading Questions (Submit answers on Gradescope)
What are two different values that can be used to measure the center of a quantitative data set? What are two different values that can be used to measure the spread of a quantitative data set?
True or false? You can compute the mean of a categorical variable. Explain.
Data Wrangling: The Verbs (Filter, Select, Mutate, Arrange, Summarize) and the Pipe
Here is an .Rmd file of solutionsto the day’s wrangling activity. And here is the output of the code.
Note that the listed reading assignments should be completed prior to class
Sections to Read Sections 3.1 - 3.6 in ModernDive
Reading Questions (Submit answers on Gradescope)
What is one “problem” the pipe operator solves when coding?
Answer LC3.2, LC3.6 from the text
Data Wrangling: More practice, advanced wrangling
Sections to Read Sections 3.8, 3.9, 4.1, 4.2 in ModernDive
Reading Questions (Submit answers on Gradescope)
Answer LC3.18, LC4.1 from the text
In addition to reading questions, please complete this anonymous survey for use in Wednesday’s class. You do not need to submit the survey results to gradescope.
Data wrangling with dplyr
More Data Wrangling .Rmd template
Modified my_starwars.csv
Sections to Read Sections 2.1, 2.2 and 2.3 in Introduction to Modern Statistics
Reading Questions (Submit answers on Gradescope)
The website Rotten Tomatoes shows a proportion of audience respondents who were satisfied with a film. If a particular film has an audience score of 50%, do you think this means that 50% of all audience members are dissatisfied with the film? Why or why not?
For each of the following two research questions, what is the implied population and what represents an individual observation?
Have daily high temperature readings increased in Portland, OR over the past 20 years?
Does the Moderna COVID-19 vaccine reduce the death rate in patients with severe cases of COVID-19?
Note that the listed reading assignments should be completed prior to class
Sections to Read Review Sections 2.1 - 2.3 in Introduction to Modern Statistics (this was also the same reading as last Friday). Then spend some time exploring the Gapminder Bubbles Charts
Reading Questions (Submit answers on Gradescope)
On the Gapminder Bubbles Charts, change some of the variables on the x- and y-axes. What were some variables that seemed strongly correlated? Do you expect any of these variables to have a causal relationship? Explain.
Describe one interesting trend or observation you discovered using the Gapminder bubble charts.
Sections to Read Sections 7.1 - 7.2 in Introduction to Modern Statistics
Reading Questions (Submit answers on Gradescope)
Suppose the residual of an observation is negative, based on a certain linear model. Does this mean the model over-estimated or under-estimate the true value of the outcome?
A linear model to predict the cost of Reed tuition (in thousands of dollars) might be \[ \textrm{Cost} = 43 + 1.8 \cdot \textrm{Year} \] What do the values of the slope and intercept represent in the context of the model?
Sections to Read Chapter 5 Intro and sections 5.1 and 5.3 in ModernDive
Reading Questions (Submit answers on Gradescope)
LC 5.1 (you don’t need to include your actual data or visualization, just your response to the question)
What is the largest difference between the treatment of Linear Regression in ModernDive Section 5.1 and its treatment in Intro to Modern Statistics Section 7.1. i.e.What are you able to do after reading ModernDive that you weren’t able to do with just Intro to Modern Stat?
Note that the listed reading assignments should be completed prior to class
Sections to Read Section 5.2 in ModernDive
Reading Questions (Submit answers on Gradescope)
Note that the listed reading assignments should be completed prior to class
Sections to Read Section 6.1 in ModernDive
Reading Questions (Submit answers on Gradescope)
What is one essential difference between the interaction model and the parallel slopes model for multiple linear regression?
What is one conclusion we could draw from either the interaction or parallel slopes model for UT Austin evaluation scores in Section 6.1, that we could not draw from the simple linear model for UT Austin evaluation scores as a function of age (as in Section 5.1)?
Multilinear Regression: Geometry of the Model and Multiple Quantitative Response Variables
Note that the listed reading assignments should be completed prior to class
Sections to Read Section 6.2 in ModernDive
Reading Questions (Submit answers on Gradescope)
LC6.2 (You don’t need to include your data visualizations in your submission, just your answer to the question)
Random Sampling
Note that the listed reading assignments should be completed prior to class
Sections to Read 7.1 and 7.2 in ModernDive
Reading Questions (Submit answers on Gradescope)
The Sampling Distribution
Note that the listed reading assignments should be completed prior to class
Sections to Read 7.3, 7.5 in ModernDive
Reading Questions (Submit answers on Gradescope)
Note that the listed reading assignments should be completed prior to class
Sections to Read None
Reading Questions (Submit answers on Gradescope)
Note that the listed reading assignments should be completed prior to class
Sections to Read 8.1 and 8.2 in ModernDive (This reading is optional, and will be revisited on Wednesday’s class)
Reading Questions (Submit answers on Gradescope)
Note that the listed reading assignments should be completed prior to class
Sections to Read 8.3 in ModernDive (Review Sections 8.1 and 8.2 if you did not do so for Wednesday’s class)
Reading Questions (Submit answers on Gradescope)
Note that the listed reading assignments should be completed prior to class
Sections to Read 8.4 and 8.5 in ModernDive
Reading Questions (Submit answers on Gradescope)
What is one advantage offered by the infer
package method for bootstrap confidence intervals compared to the “original workflow” discussed at the start of Section 8.4?
Assuming that you have to use the same sample either way, which confidence interval has a higher certainty of containing the population parameter: a wider interval or a narrow interval? Explain.
Note that the listed reading assignments should be completed prior to class
Sections to Read 11.1 - 11.3 in Introduction to Modern Statistics
Reading Questions (Submit answers on Gradescope)
In your own words, briefly explain what the null distribution for a test statistic represents.
Suppose Nate has a coin that he flips repeatedly, recording the results each time. What type of evidence from a sequence of heads / tails would convince you that the coin is not fair?
Note that the listed reading assignments should be completed prior to class
Sections to Read 9.1, 9.2 and 9.3 in ModernDive
Reading Questions (Submit answers on Gradescope)
In the study on gender and promotion rate in Section 9.1, what are two possible explanations for the observed difference in promotion rate in the sample?
LC 9.3
infer
Exploring Permutation Tests
Note that the listed reading assignments should be completed prior to class
Sections to Read 9.4 - 9.6 in ModernDive
Reading Questions (Submit answers on Gradescope)
3/21 - 3/25
Axioms of Probability
Conditional Probability
Note that the listed reading assignments should be completed prior to class
Sections to Read Section 3.1 and 3.2 in THIS EXCERPT from OpenIntro Statistics Note: This is neither the ModernDive textbook nor the Intro to Modern Statistics textbook.
Reading Questions (Submit answers on Gradescope)
“A fair coin is flipped 10 times and lands heads each time.”
Suppose a fair coin is flipped 10 times. What is the probability that all 10 flips are heads? What is the probability that either all 10 flips are heads or all 10 flips are tails?
Suppose a fair coin is flipped twice. What is the conditional probability that the second flip is a heads given that neither flip is tails?
Note that the listed reading assignments should be completed prior to class
Sections to Read Section 3.4 - 3.5 in THIS EXCERPT from OpenIntro Statistics Note: This is neither the ModernDive textbook nor the Intro to Modern Statistics textbook.
Reading Questions (Submit answers on Gradescope)
In your own words, describe the difference between a quantitative variable and a random variable.
Give an example of a random process you think could be well-represented by a discrete random variable. Give an example of a different random process you think could be well-represented by a continuous random variable.
Suppose we model the length of a randomly selected earthworm as a continuous variable with mean 14 inches. What is the probability that the length of a randomly selected earthworm is exactly 14 inches? Explain.
Note that the listed reading assignments should be completed prior to class
Sections to Read Sections 13.1 - 13.3 in Introduction to Modern Statistics
Reading Questions (Submit answers on Gradescope)
The Quincunx, bean machine, or ``Galton Board’’ was invented by 19th century English scientist Sir Francis Galton to demonstrate fundamental principles in probability and statistics. In its basic form, the Quincunx consists of an upright triangular board with evenly spaced pegs lying above evenly spaced bins. Balls are dropped one-by-one from a central chute at the top of the board and bounce either left or right as they hit the pegs. Eventually, they are collected in the bins at the bottom of the board.
Spend some time playing around with the Galton Board here. (After you adjust sliders, be sure to hit the “restart” button as well.)
Viewing the stacks of balls at the bottom of the board as a histogram, what named distribution is the histogram similar to?
What effect does increasing the size slider have on the shape of the histogram? What effect does increasing the Left/Right slider have on the shape?
What effect does increasing the Speed slider have on the shape?
During which time interval will the shape of the histogram change more? (a) between the 1st and the 100th balls, or (b) between the 901st and 1000th balls? Explain.
Note that the listed reading assignments should be completed prior to class
Sections to Read Review Sections 13.1 - 13.3 in Introduction to Modern Statistics from Friday
Reading Questions (Submit answers on Gradescope)
Note that the listed reading assignments should be completed prior to class
Sections to Read None
Reading Questions (Submit answers on Gradescope)
Note that the listed reading assignments should be completed prior to class
Sections to Read 16.1 - 16.2 in Introduction to Modern Statistics
Reading Questions (Submit answers on Gradescope)
Suppose you are interested in determining whether a majority of Americans disapprove of the US president. In a simple random sample of 100 Americans, you find that 60% disapprove of the current president, which gives a p-value of 0.02, and leads you to reject the null hypothesis. Explain what this means in everyday language in context of this problem.
Suppose we want to construct a confidence interval for a population proportion \(p\). Based on the Central Limit Theorem, the standard error of the sample proportion \(\hat p\) is \[ SE(\hat p) = \sqrt{\frac{p(1-p)}{n}} \] Explain why we cannot directly apply this formula to create the confidence interval for \(p\).
Note that the listed reading assignments should be completed prior to class
Sections to Read 17.3 in Introduction to Modern Statistics
Reading Questions (Submit answers on Gradescope)
In order to perform hypothesis testing or create confidence intervals based on a difference in sample proportions \(\hat p_1 - \hat p_2\), we need to check 2 conditions. What are those conditions?
Suppose you perform two 2-sided hypothesis tests for a difference in proportion. In the first test, you obtain a test statistic of \(z = -2.05\) and in the second test, you obtain a test statistic of \(z = 0.04\). Which test gives better evidence to reject the null hypothesis? Explain.
Note that the listed reading assignments should be completed prior to class
Sections to Read 19.2 in Introduction to Modern Statistics
Reading Questions (Submit answers on Gradescope)
Describe at least 1 similarity and 1 difference between a t distribution and the standard Normal distribution.
Suppose you are interested in investigating the typical course load for Reed students. You obtain a random sample of 25 Reed students and record the number of credits each is currently taking as the variable credits
. If you want to perform inference using the credits
variable, is the parameter of interest a mean or a proportion? Explain how you know.
Note that the listed reading assignments should be completed prior to class
Sections to Read 20.3, 20.4, 21.3 in Introduction to Modern Statistics
Reading Questions (Submit answers on Gradescope)
A study wishes to determine whether automatic and manual transmission cars have the same fuel efficiency. They randomly select 10 automatic cars and 10 manual cars, and measure number of gallons of gas consumed by each after a 100 mile trip. Write null and alternative hypotheses for this research question, both in words and in symbols.
Consider the following two experiments. Which has paired design and which corresponds to two independent samples? Explain how you know.
Does marijuana assist in injury recovery? A randomized experiment assigns subjects with sprained ankles into two groups: 10 receive a THC brownie every evening for 14 days, while another 10 receive an ordinary brownie every evening for the same period. The number of days until symptoms disappear is recorded for each subject.
A campus organization wants to determine whether listening to rock music before bed has an effect on length of sleep. They recruit 20 students and have them track the number of hours they sleep each night over a 14 day period. After these two weeks, the organization then instructs each of the 20 students to listen to rock music for 1 hour each night before going to sleep, and track the number of hours they get each night over a 14 day period. The organization records the average number of hours each student sleeps with and without rock music.
Note that the listed reading assignments should be completed prior to class
Sections to Read 24.1 - 24.6 in Introduction to Modern Statistics
Reading Questions (Submit answers on Gradescope)
Consider two quantitative variables measured on a population of students: length
of index finger and height
. If these two variables are independent, and we repeatedly draw samples of 25 students from the population, computing the regression line for each, what do you anticipate will be the average slope for the regression line?
The scatterplot, residual plot, and histogram of residuals for variables \(Y\) and \(X\) are shown below. Discuss any concerns you might have about whether the data satisfies the conditions for making inference about linear regression, based on these plots.
Note that the listed reading assignments should be completed prior to class
Sections to Read 25.1, 25.2, 8.3, 8.4 in Introduction to Modern Statistics
Reading Questions (Submit answers on Gradescope)
Suppose we are interested in predicting college GPA based on high school GPA and SAT scores. We construct to linear models. The first model is of the form \[ \textrm{college GPA} = \beta_0 + \beta_1 \textrm{HS GPA} \] and the second model is of the form \[ \textrm{college GPA} = \beta_0 + \beta_1 \textrm{HS GPA} + \beta_2 \textrm{SAT} \] Suppose we wish to perform a hypothesis test for the slope on \(\textrm{HS GPA}\) in the two models. State the null hypothesis in each case, and explain the fundamental way in which these two hypotheses differ.
Based on the discussion in Section 8.3 and 8.4, what is one reason we may decide to use parsimonious model over the full model?
Note that the listed reading assignments should be completed prior to class
Sections to Read Review 25.1, 25.2, 8.3, 8.4 in Introduction to Modern Statistics (This was Wednesday’s reading)
Reading Questions (Submit answers on Gradescope)
Note that the listed reading assignments should be completed prior to class
Sections to Read 22.1 - 22.3 in Introduction to Modern Statistics
Reading Questions (Submit answers on Gradescope)
Suppose you are interested in knowing whether a certain date in April is associated with higher than average number of births. To answer this question, you look at the average number of births for each of the 30 days, based data from 100 hospitals, and find that on April 23rd, there is a statistically significant difference at the 5% level in the number of births compared to the overall average. Explain why it would be incorrect to conclude that this gives good evidence that in general, there are more births on average on April 23rd? (Think about how many different tests you are performing at the 5% level)
Consider the 3 sets of boxplots shown below. Which set gives the strongest evidence of a difference in means? Explain. Solid red dots in each box represent the means for each group
Note that the listed reading assignments should be completed prior to class
Sections to Read 18.1 and 18.2 in Introduction to Modern Statistics
Reading Questions (Submit answers on Gradescope)
Describe 1 similarity and 1 difference between the Chi-Squared Test for Independence and the Hypothesis Test for Difference in 2 Proportions.
Consider 2 sections of Math 141. In total, 20% of students are first years, 30% of students are sophomores, 40% of students are juniors, and 10% of students are seniors. If section and year in school are independent, and there are 20 students in the 10am section of Math 141, what are the expected number of 1st Years, Sophomores, Juniors, and Seniors in the 10am section?
Review
Course Summary
Note that the listed reading assignments should be completed prior to class
Sections to Read None
Reading Questions (Submit answers on Gradescope)