class: center, middle, inverse # Data Science <img src="img/hero_wall_pink.png" width="800px"/> ## Kelly McConville .large[Math 241 | Week 1 | Spring 2021] --- # Announcements * Make sure [to sign-in](https://docs.google.com/document/d/1QMXSF9TxsXj3j8M42mwTatGnB_-GayYvGmHWnnXd-sg/edit?usp=sharing) and denote if on the waitlist. * Complete the [Initial Participation Assignment](https://docs.google.com/document/d/1jGeyMz1TH_axOZOhHr-04XKrt1AW3PHyFJXCLKs_ffU/edit?usp=sharing). * Complete the Week 1 Readings: (by end of the week) * What is Data Science? + Reading: MDSR Ch 1.1 * Data Journalism + Reading: CJGD Executive Summary and Introduction * Data Visualization Principles + Reading: MDSR Ch 2.1 - 2.4 --- # Goals for Today * Discuss *Data Science*. -- * Understand the goals of this course. -- * Go through course structure stuff. -- * Receive Lab 1: Hand-drawn data viz. --- class: center, middle background-image: url(img/jessica-knowlden--AvRbJOVUKA-unsplash.jpg) background-size: cover --- ## Engagement .pull-left[ <img src="img/jessica-knowlden--AvRbJOVUKA-unsplash.jpg" width="400px"/> ] .pull-right[ * Being actively present is key. {{content}} ] -- * During class, remove distractions. + Close email, social media, news, etc. + Hide your phone (unless using Slack to post/answer questions). {{content}} -- * Turn on your video. + This helps me a ton with pacing! {{content}} -- * What you get out of Math 241 will depend on what you put into it. {{content}} --- ## What is data science? * How is **data science** not the same thing as **statistics**? -- * Wikipedia says: > "Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to **extract knowledge** and insights from structured and unstructured **data**." -- * But isn't statistics the field of extracting knowledge from data? --- ### Data Science: The Original Venn Diagram Definition <div class="figure" style="text-align: center"> <img src="img/Data_Science_VD1.png" alt="Drew Conway (2012)" width="60%" /> <p class="caption">Drew Conway (2012)</p> </div> --- ### Data Science: And the Chorus of Follow-up Venn Diagrams <div class="figure" style="text-align: center"> <img src="img/Data_Science_VD2.png" alt="Steven Geringer (2014)" width="80%" /> <p class="caption">Steven Geringer (2014)</p> </div> --- ### Data Science: And the Chorus of Follow-up Venn Diagrams <div class="figure" style="text-align: center"> <img src="img/Data_Science_VD3.png" alt="Stephan Kolassa (2015)" width="60%" /> <p class="caption">Stephan Kolassa (2015)</p> </div> --- ### Data Science: And the Chorus of Follow-up Venn Diagrams <div class="figure" style="text-align: center"> <img src="img/Data_Science_VD4.png" alt="Gartner (2016)" width="90%" /> <p class="caption">Gartner (2016)</p> </div> --- ## Expert Opinions (on Twitter) > "Statistician: person using data in [scientifically] rigorous ways. Data Scientist: person writings blog posts, giving talks … about data." — Dirk Eddelbuettel (@eddelbuettel) -- > "@eddelbuettel my cynical definition: a data scientist is a statistician who is useful ;)" — Hadley Wickham -- > "Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician." — Josh Wills --- ### Defining Data Science through the Data Analysis Workflow <img src="img/DAW.png" width="70%" style="display: block; margin: auto;" /> * Is the data analysis workflow different for data scientists than it is for statisticians? --- ## Traditional Example Biologist wants to understand the difference in plant growth for three possible treatments. -- * **Question Formulation**: + Statistician helps define the problem in terms of random variables. + Y = growth after 30 days + X = three possible treatments -- * **Data Acquisition**: + Statistician helps create an appropriate experimental design (using random assignment) so that causation can be inferred. + Biologist conducts the experiment on 90 plants and records results with pencil and paper. --- ## Traditional Example * **Data Wrangling**: + Statistician tells biologist to type results into a spreadsheet and save it as a csv file. + Statistician loads the data into a statistical program for analysis. -- * **Exploration and Visualization**: + Statistician computes mean growth by group. + Statistician creates boxplots of growth by group. --- ## Traditional Example * **Modeling and Inference**: + Statistician constructs a one-way ANOVA model (and checks conditions!). + Statistician runs an F test to look for evidence that the means differ. + Statistician constructs pairwise confidence intervals for difference in means by treatment. -- * **Communicating Findings**: + Biologist and statistician write a peer-reviewed journal article about their findings. + It takes two years to get the results published. --- ## Traditional Example * This was a *simple* analysis example that you should mostly be able to do after Math 141. -- * Statisticians have done a lot of work to handle **more difficult** examples. * New **experimental designs** for when you have several treatment combinations to consider and not much data. -- * Models to use when ANOVA assumptions (such as normality) aren't met. -- * More complex **models** if you want to include additional variables. -- * Robust **inference** techniques to handle the multiple comparison problem. --- ## Traditional Example * But, statisticians may struggle if this problem is modified in the following ways... + Data are to be culled from a gardening app. -- + Biologist wants to incorporate additional, messy data sources (such as weather and dates). -- + End goal is an interactive visualization or blog post, geared toward a wide audience. --- ## Back to the Data Analysis Workflow <img src="img/DAW.png" width="40%" style="display: block; margin: auto;" /> Statisticians have made great strides in... -- + (Some aspects of) Data Acquisition -- + Modeling and Inference --- ## Back to the Data Analysis Workflow <img src="img/DAW.png" width="40%" style="display: block; margin: auto;" /> Statisticians and statistics education, generally speaking, have fallen behind on... -- * (Some aspects of) Data Acquisition (e.g., web scraping) -- * Data Wrangling -- * Data Viz -- * Broader communication --- <img src="img/DAW.png" width="40%" style="display: block; margin: auto;" /> **Ways data science seems to differ**: -- * Push to handle messy and different kinds of data. + Including big data. -- * Larger focus on using data visualizations to tell a story. + Interactive visualizations. -- * Increased interested in predictive modeling. -- * Faster outputting of findings. --- ## Let's Discuss Data Science Hype -- * Data Science is not going to solve all of the world's problems. + And, many problems are still best answered using the traditional statistical tools. -- * Data Science `\(\neq\)` Big Data. + It is about answering questions with data (big, medium, or small). -- * You can't do data science without statistics. + Don't forget all you know about random variables, sampling, quantifying uncertainty, generalizability, bias, modeling... + Remember, Math 141 is a pre-req for this course! --- ### Math 241: What will we be doing? **Goals**: - **Goal 1**: Develop the ability to effectively write about data for a non-technical audience. -- → Action: Mini-projects include a blog post component. -- - **Goal 2**: Gain data acumen for a variety of data types: -- → Action: Explore - Spatial data via `leaflet`, `sf`, and `ggspatial` - Text/strings via `stringr` and `tidytext` - Factors via `forcats` - Dates via `lubridate` - Relational databases with `SQL` - Data on the web via `rvest` and `httr` --- ### Math 241: What will we be doing? * **Goal 3**: Develop coding habits that align with best practices in the field. -- → Action: Use a style guide, practice creating reproducible examples, and learn to refactor our code. -- * **Goal 4**: Understand the current ethical issues around data and be able to reason through these issues using the ASA Ethical Guidelines. -- → Action: Read the guidelines and practice on case studies. -- * **Goal 5**: Create data visualizations, sometimes with interactivity, of multivariate data. -- → Action: Create visualizations with - Our own hands! - `ggplot2` - `gganimate` - `plotly` - `shiny` --- ### Math 241: What will we be doing? * **Goal 6:** Learn to disseminate data. -- → Action: In Mini-Project 1, we will each create an R data package! -- * **Goal 7:** Acquire a reproducible and collaborative workflow for data analyses. -- → Action: Use - R Markdown files for our labs and blog posts. - GitHub for storing and sharing our projects. --- ## Course Structure * Live lectures on T and TH on Zoom. -- * Course communication via [Slack](https://reed-statistics.github.io/math241s21/reed-math241s21.slack.com). + Check daily M-F. -- * Submitting work: [Gradescope.com](gradescope.com) -- * Course work and materials: [rstudio.reed.edu](rstudio.reed.edu). + Some will also be shared via a [Google Drive Folder](https://drive.google.com/drive/folders/10SuL0hnAhLZ5uN8EMw2pJGeEtJzB-Wg9?usp=sharing). -- * Syllabus, schedule, class slides, office hours: [Course website](https://reed-statistics.github.io/math241s21/syllabus.html) --- ## Math 241: Data Science **Components:** * Weekly readings: + Links found in the [Schedule on the course website](https://reed-statistics.github.io/math241s21/schedule.html). + I recommend finishing the week's readings early in the week. + Repeated readings will come from: + [Modern Data Science with R by Baumer, Horton, and Kaplan](https://mdsr-book.github.io/mdsr2e/) + [The Curious Journalist’s Guide to Data by Stray](https://www.cjr.org/tow_center_reports/the_curious_journalists_guide_to_data.php) + [R for Data Science by Grolemund and Wickham](https://r4ds.had.co.nz/) -- * Participation: + In class + During the first four weeks of the semester, you must: + Attend office hours at least twice (mine or the course assistant's). + Contribute at least 2 content related posts to Slack -- * What Slack posts count? + Asking a question about course content + Answering someone else's question + Posting a useful resource and why you found it helpful + Creating an example that illustrates a recent concept --- ## Math 241: Data Science **Components:** * (Weekly-ish) Homework + Composed of R Markdown lab assignments. + Due Thursdays at 8:30am on Gradescope. -- * Projects - Practice team aspect of data science. - Emphasis on creating sharable, high-quality work for a broad audience. - Check out [Project Due Dates on the course website](https://reed-statistics.github.io/math241s21/projects.html) - Mini Project 1: Creating an R Data Package - Mini Project 2: Data Viz and Wrangling - Final Project --- ## Math 241: Data Science The level of "group work" varies among the different forms of assessment. * Projects: + Fully group work. + Will report individual contributions to ensure work was fairly distributed. -- * Homework: + Can work with others but what you turn in **must be in your own words**. + Sharing all/most the code for an assignment is copying. --- ## Learning to Be a Data Scientist * Many of the assignments will provide opportunities to stretch yourself and learn code not covered in class. -- * **Goal:** Develop our abilities to effectively search for and evaluate potential solutions. -- * **Potential Erroneous Side Effect:** Concluding that you are not good at data science because you can't solve the problem "on your own" or because you find the answers on StackOverflow confusing/unhelpful. -- * I have high expectations but know that all of you (regardless of your stats or computing background) have the ability to meet them. --- ## Learning to Be a Data Scientist * **Goal:** Develop our abilities to effectively search for and evaluate potential solutions. * I encourage the following strategy: → Try the problem. -- → If and when you get stuck, look to the internet for solutions. -- → If you find some promising answers, try them out. -- → If still stuck, post your question on Slack and/or come to office hours. Don't just stay stuck. -- * Try to find the right balance between independent learning and supported learning. + And, get help **before** the frustration sets in! --- ## Code of Conduct We expect all members of Math 241 to make participation a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation. We expect everyone to act and interact in ways that contribute to an open, welcoming diverse, inclusive, and healthy community of learners. Examples of unacceptable behavior include: using sexualized language or imagery, making insulting or derogatory comments, harassing someone publicly or privately, or other unprofessional conduct. Instead you can contribute to a positive learning environment by demonstrating empathy and kindness, being respectful of differing viewpoints and experiences, and giving and gracefully accepting constructive feedback. This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/version/2/0/code_of_conduct.html), version 2.0. --- ## Hand-Drawn Data Viz * Two key aspects of data visualization: + Mapping the data to visual objects. + Figuring out how to tell the computer to do that mapping. -- * Hand-drawn data visualizations allow us to focus on the first part and with full control over the creative process! --- ## Hand-Drawn Data Viz Examples * [Dear Data](http://www.dear-data.com/theproject) > "Each week, and for a year, we collected and measured a particular type of data about our lives, used this data to make a drawing on a postcard-sized sheet of paper, and then dropped the postcard in an English “postbox” (Stefanie) or an American “mailbox” (Giorgia)!" --- ## Dear Data Examples <img src="img/dearDataTime.png" width="100%" style="display: block; margin: auto;" /> --- ## Dear Data Examples <img src="img/dearDataComplaints.png" width="100%" style="display: block; margin: auto;" /> --- ## Mapping Manhattan * Becky Cooper handed out hand-drawn maps of Manhattan to strangers and asked them to ["map their Manhattan."](https://www.goodreads.com/book/show/15842664-mapping-manhattan?from_search=true) <div class="figure" style="text-align: center"> <img src="img/mapmanhattan.png" alt="Map drawn by New Yorker staff writer Patricia Marx" width="100%" /> <p class="caption">Map drawn by New Yorker staff writer Patricia Marx</p> </div> --- ## Lab Assignment 1 .pull-left[ * Create your own Dear Data "postcards". + Let's grab Lab 1 from the shared folder on the RStudio Server. ] .pull-right[ <img src="img/supplies.png" width="60%" style="display: block; margin: auto;" /> ]