class: center, middle # Modeling <img src="img/DAW.png" width="500px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 4 | Fall 2020] </span> --- ## Announcements/Reminders * Project Assignment 1 is due on Friday October 2nd (end of day) on Gradescope. -- * After your lab this week, you should get started on Lab 4. + No coding, just narrative. -- * Quizzes: We have decided not to include quizzes as a form of assessment for the course. --- ## Week 4 Topics * Finish up a couple more Data Wrangling examples * Data collection * **Modeling** --- # Goals for Today * Introduce statistical modeling * Begin with the simple linear regression model * Measuring correlation --- ## Statistical Models ### Two Main Motivations -- You can often tell the modeling motivation from the research question. We will look at studies that ask the following questions: -- > "Can I use remotely sensed data to predict forest types in Alaska?" -- → Motivation: Predict new observations -- > "Do left-handed people live shorter lives than right-handed people?" -- → Motivation: Describe relationships -- We will focus mainly on descriptive modeling in this course. If you want to learn more about predictive modeling, take Math 243: Statistical Learning! --- ## Form of the Model <br><br><br> -- $$ y = f(x) + \epsilon $$ <br><br><br> -- **Goal:** → Determine a reasonable form for `\(f()\)`. (Ex: Line, curve, ...) -- → Estimate `\(f()\)` with `\(\hat{f}()\)` using the data. -- → Generate predicted values: `\(\hat{y} = \hat{f}(x)\)`. --- ### Simple Linear Regression Model Consider this model when: -- * Response variable `\((y)\)`: quantitative -- * Explanatory variable `\((x)\)`: quantitative + Have only ONE explanatory variable. -- * AND, `\(f()\)` can be approximated by a line. --- ### Example: Is there a linear relationship between tree diameter and tree height for the trees at the Woodstock Community Center? <img src="img/woodstock_cc.png" width="70%" style="display: block; margin: auto;" /> --- ### Example: Trees at the Woodstock CC ```r library(pdxTrees) woodstock_cc <- get_pdxTrees_parks(park = "Woodstock Community Center") ggplot(data = woodstock_cc, mapping = aes(x = DBH, y = Tree_Height)) + geom_point() ``` <img src="wk04_fri_files/figure-html/unnamed-chunk-2-1.png" width="360" /> -- Linear trend? Direction of trend? --- ### Example: Trees at the Woodstock CC ```r library(pdxTrees) woodstock_cc <- get_pdxTrees_parks(park = "Woodstock Community Center") ggplot(data = woodstock_cc, mapping = aes(x = DBH, y = Tree_Height)) + geom_point() + stat_smooth(method = "lm", se = FALSE) ``` <img src="wk04_fri_files/figure-html/unnamed-chunk-3-1.png" width="360" /> Linear trend? Direction of trend? --- ### Example: Trees at the Woodstock CC ```r library(pdxTrees) woodstock_cc <- get_pdxTrees_parks(park = "Woodstock Community Center") ggplot(data = woodstock_cc, mapping = aes(x = DBH, y = Tree_Height)) + geom_point() + stat_smooth(method = "lm", se = FALSE) ``` <img src="wk04_fri_files/figure-html/unnamed-chunk-4-1.png" width="360" /> **A simple linear regression model would be suitable for these data.** * But first, let's describe more plots! --- class: center, middle, inverse # But before that: It's time for Trend Stretches! --- <img src="wk04_fri_files/figure-html/unnamed-chunk-5-1.png" width="576" style="display: block; margin: auto;" /> -- **Need a summary statistics that quantifies the strength and relationship of the linear trend!** --- ## (Sample) Correlation Coefficient * Measures the strength and direction of linear relationship between two quantitative variables -- * Symbol: `\(r\)` -- * Always between -1 and 1 -- * Sign indicates the direction of the relationship -- * Magnitude indicates the strength of the linear relationship -- ```r woodstock_cc %>% summarize(cor_ht_dbh = cor(Tree_Height, DBH)) ``` ``` ## # A tibble: 1 x 1 ## cor_ht_dbh ## <dbl> ## 1 0.828 ``` --- <img src="wk04_fri_files/figure-html/unnamed-chunk-7-1.png" width="540" style="display: block; margin: auto;" /> Any guesses on the correlations for A, B, C, or D? -- ```r dat %>% summarize(A = cor(x, y1), B = cor(x, y2), C = cor(x, y3), D = cor(x, y4)) ``` ``` ## # A tibble: 1 x 4 ## A B C D ## <dbl> <dbl> <dbl> <dbl> ## 1 0.695 -0.217 -0.815 -0.113 ``` --- ## New Example ```r # Correlation coefficients dat2 %>% group_by(dataset) %>% summarize(cor = cor(x, y)) ``` ``` ## # A tibble: 13 x 2 ## dataset cor ## <chr> <dbl> ## 1 away -0.0641 ## 2 bullseye -0.0686 ## 3 circle -0.0683 ## 4 dino -0.0645 ## 5 dots -0.0603 ## 6 h_lines -0.0617 ## 7 high_lines -0.0685 ## 8 slant_down -0.0690 ## 9 slant_up -0.0686 ## 10 star -0.0630 ## 11 v_lines -0.0694 ## 12 wide_lines -0.0666 ## 13 x_shape -0.0656 ``` -- * Conclude that `\(x\)` and `\(y\)` have the same relationship across these different datasets because the correlation is the same? --- ### Always graph the data when exploring relationships! <img src="wk04_fri_files/figure-html/unnamed-chunk-11-1.png" width="576" style="display: block; margin: auto;" /> --- ### Simple Linear Regression Let's return to the Example: Trees at the Woodstock CC <img src="wk04_fri_files/figure-html/unnamed-chunk-12-1.png" width="360" /> * A line is a reasonable model form. -- * Where should the line be? + Slope? Intercept? --- ### Simple Linear Regression Let's return to the Example: Trees at the Woodstock CC <img src="wk04_fri_files/figure-html/unnamed-chunk-13-1.png" width="360" /> * A line is a reasonable model form. * Where should the line be? + Slope? Intercept? --- ### Simple Linear Regression Let's return to the Example: Trees at the Woodstock CC <img src="wk04_fri_files/figure-html/unnamed-chunk-14-1.png" width="360" /> * A line is a reasonable model form. * Where should the line be? + Slope? Intercept? --- ### Simple Linear Regression Let's return to the Example: Trees at the Woodstock CC <img src="wk04_fri_files/figure-html/unnamed-chunk-15-1.png" width="360" /> * A line is a reasonable model form. * Where should the line be? + Slope? Intercept?