class: center, middle ## The Structure of Data <img src="img/DAW.png" width="500px"/> <span style="color: #91204D;"> .large[Kelly McConville | Math 141 | Week 1 | Fall 2020] </span> --- ## Announcements * Complete the [Initial Participation Assignment](https://docs.google.com/document/d/1cAlH0bT6OckIXkxJ4DVAhXeRD_ejJONQwP99pjntMGg/edit?usp=sharing) in your lab's channel of the class Slack Workspace + Due before your first lab session. * We will often do worksheets in-class. They won't be turned in but are to help us engage with the main ideas of the day. * We will learn about **Gradescope** for submitting assignments and **RStudio** in Lab this week! - For the readings, you don't need to run the R code (yet). --- ## Week 1 Topics * Statistical Thinking * **Structure of data** * (Lab) Working in RStudio using RMarkdown documents * Data Viz --- ## Goals for Today * Recap: Statistical Thinking * Structure of data --- ## Statistical Thinking Worksheet Recap * Overall, which group was convicted at a higher rate? -- * When the victim was white, which group was convicted at a higher rate? -- * When the victim was a minority, which group was convicted at a higher rate? -- <span style="color: #91204D;"> HOW IS THIS POSSIBLE?</span> --- ## Simpson's Paradox <span style="color: #91204D;"> HOW IS THIS POSSIBLE?</span> <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Minority Defendant Convicted </th> <th style="text-align:right;"> Minority Defendant Acquitted </th> <th style="text-align:right;"> White Defendant Convicted </th> <th style="text-align:right;"> White Defendant Acquitted </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Minority Victim </td> <td style="text-align:right;width: 2cm; "> 19 </td> <td style="text-align:right;width: 2cm; "> 45 </td> <td style="text-align:right;width: 2cm; "> 5 </td> <td style="text-align:right;width: 2cm; "> 19 </td> </tr> <tr> <td style="text-align:left;"> White Victim </td> <td style="text-align:right;width: 2cm; "> 10 </td> <td style="text-align:right;width: 2cm; "> 15 </td> <td style="text-align:right;width: 2cm; "> 40 </td> <td style="text-align:right;width: 2cm; "> 67 </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;width: 2cm; "> 29 </td> <td style="text-align:right;width: 2cm; "> 60 </td> <td style="text-align:right;width: 2cm; "> 45 </td> <td style="text-align:right;width: 2cm; "> 86 </td> </tr> </tbody> </table> **Key factors:** -- For what race of the victim, is the conviction rate higher? -- → The conviction rate is 37.9% for white victims and 27.2% for minority victims. -- When the defendant is white, what tends to be the race of the victim? -- → White defendants tend to have white victims. Minority defendants tend to have minority victims. --- class: middle, inverse, center ## What is "Statistical Thinking?" --- ## Statistical Thinking * Importance of the appropriate measures/metrics. + Proportional reasoning -- * Utilizing multivariate thinking -- * Understanding that context is queen. -- * And so much more! --- class: middle, inverse, center ## What is data? --- * The dictionary definition: > "data: factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation" -- Merriam-Webster -- * Wikipedia: > "Data are characteristics or information, usually numerical, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable." -- * Our textbook definition: > "Data comes to us in a variety of formats, from pictures to text to numbers." -- ModernDive -- * Data Feminism: > "... by the time that information becomes data, it's already been classified in some way. Data after all, is information made *tractable*." -- D'Ignazio and Klein --- ### Data Frames <table class="table table-responsive table-bordered table-striped" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> UserID </th> <th style="text-align:right;"> Tree_Height </th> <th style="text-align:left;"> Common_Name </th> <th style="text-align:left;"> Park </th> <th style="text-align:right;"> DBH </th> <th style="text-align:left;"> Species_Factoid </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 105 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 37.4 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 94 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.5 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> Lavalle Hawthorn </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 9.7 </td> <td style="text-align:left;"> Like most hawthorns, the tree has stout thorns up to 2" long. </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:left;"> Northern Red Oak </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 10.3 </td> <td style="text-align:left;"> Acorns take two years to mature and are an important food source for wildlife. </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 102 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 33.2 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.1 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> </tbody> </table> Data in spreadsheet-like format where: -- * Rows = Observations/cases -- * Columns = Variables --- ### Data Frames <table class="table table-responsive table-bordered table-striped" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> UserID </th> <th style="text-align:right;"> Tree_Height </th> <th style="text-align:left;"> Common_Name </th> <th style="text-align:left;"> Park </th> <th style="text-align:right;"> DBH </th> <th style="text-align:left;"> Species_Factoid </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 105 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 37.4 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 94 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.5 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> Lavalle Hawthorn </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 9.7 </td> <td style="text-align:left;"> Like most hawthorns, the tree has stout thorns up to 2" long. </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:left;"> Northern Red Oak </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 10.3 </td> <td style="text-align:left;"> Acorns take two years to mature and are an important food source for wildlife. </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 102 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 33.2 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.1 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> </tbody> </table> Rows = Observations/cases **What are the cases? What does each row represent?** --- ### Data Frames <table class="table table-responsive table-bordered table-striped" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> UserID </th> <th style="text-align:right;"> Tree_Height </th> <th style="text-align:left;"> Common_Name </th> <th style="text-align:left;"> Park </th> <th style="text-align:right;"> DBH </th> <th style="text-align:left;"> Species_Factoid </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 105 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 37.4 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 94 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.5 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> Lavalle Hawthorn </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 9.7 </td> <td style="text-align:left;"> Like most hawthorns, the tree has stout thorns up to 2" long. </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:left;"> Northern Red Oak </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 10.3 </td> <td style="text-align:left;"> Acorns take two years to mature and are an important food source for wildlife. </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 102 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 33.2 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.1 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> </tbody> </table> Columns = Variables **Variables**: Describe characteristics of the observations -- * **Quantitative**: Numerical in nature -- * **Categorical**: Values are categories -- * **Identification**: Uniquely identify each case --- ### Data Frames <table class="table table-responsive table-bordered table-striped" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> UserID </th> <th style="text-align:right;"> Tree_Height </th> <th style="text-align:left;"> Common_Name </th> <th style="text-align:left;"> Park </th> <th style="text-align:right;"> DBH </th> <th style="text-align:left;"> Species_Factoid </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 105 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 37.4 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 94 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.5 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> Lavalle Hawthorn </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 9.7 </td> <td style="text-align:left;"> Like most hawthorns, the tree has stout thorns up to 2" long. </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:left;"> Northern Red Oak </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 10.3 </td> <td style="text-align:left;"> Acorns take two years to mature and are an important food source for wildlife. </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 102 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 33.2 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.1 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> </tbody> </table> **Important to understand what each variable represents and the units of measurement.** -- Example questions: * For categorical variables, what are the categories? Do those categories adequately represent the data represented by that variable? -- * For quantitative variables, what values are possible? Were the data rounded or binned? Are those values actually encoding categories? --- ## Lab This Week * Introduction to R/RStudio * Introduction to Gradescope