Overview

One of the most important functions of the working statistician is to investigate and answer significant research questions by analyzing real-world data, using a variety of elementary and advanced modeling techniques, and to distill the results into reports that are accessible to the non-statistician.

You will work in small groups to research a topic of interest to you, and then summarize your results in a short video presentation to the class and as well as a technical report submitted to me.

Project Goals

  • Investigate a real-world data set by performing exploratory data analysis and visualization.

  • Formulate a research question and hypothesis.

  • Create a data biography by exploring its context and source.

  • Perform appropriate statistical inference to answer the research question.

  • Craft a clear, engaging narrative answering your research question in a technical report and short pre-recorded presentation.

Project Timeline

Name Description Due Date
Group Formation Submit the name of 1 other person you’d like to be in a group with. 5pm PST Friday, February 19th (Week 4)
Assignment 1 Data Exploration via wrangling, summarizing, and visualization 5pm PST Friday, March 12th (Week 7)
Assignment 2 Data Biography to contextualize data. Friday, March 26th (Week 9)
Assignment 3 Statistical analysis to answer research question. 5pm PST Friday, April 23rd (Week 12)
Technical Report 2 - 3 page technical report outlining results of Assignments 1 - 3 May 9th (Last Day of Reading Week)
Presentation 10 minute pre-recorded presentation outlining project results May 9th (Last Day of Reading Week)

Components

Group Formation

By 5pm on Friday, February 19th, please send me via email or Slack message the name of 1 other person you’d like to be in a group with. If you don’t submit anything, we will randomly pair you with another student in your lecture class. We will then combine pairs to form project groups of 4 students (with a few groups of 3 or 5 students).


Assignment 1

Goals

  • Determine your research question(s), along with the dataset and variables.
  • Practice inspecting data.
  • Practice visualizing and summarizing data.

Notes

  • The project assignments will be fairly open-ended and much less prescribed than your lab assignments, mimicking a more real-world situation where you are tasked with extracting knowledge from data.
  • Think carefully when selecting your research questions since you will explore these same questions for the whole project.
  • Make sure everyone in the group is interested in the selected research questions.
  • Make sure to read the provided background information about the data.

Tasks

  1. As a group, determine
    • Which dataset you want to investigate for your project.
    • Two research questions you want to explore.
      • Each question should relate to 1-3 of the variables in the dataset.
      • The questions should all have the same general theme but may involve different variables.
      • The questions can (and likely will) relate to subsets of the data. For example, maybe you want to focus on how COVID related behaviors differ between residents of two states in the US or you want to focus on protests in a given year and region of the world.
  2. For each research question, start investigating the question by:
    • Producing useful summaries of the variables and their relationships.
    • Graphing each variable and the relationships between variables.
    • Completing any useful data wrangling.
  3. In an Rmd file, write a two page summary that includes:
    • States your research questions and some initial answers/findings related to the questions
    • Introduces the data and addresses what/who the data represent (for your variables of interest)
    • Presents two or three summary statistics and two data visualizations that start to answer your one or both of your research questions.
    • Includes your R code
  4. Turn in the .pdf of your summary on Gradescope by 5pm PST on Friday, March 12th.

Crafting Research Questions

Usually you should start with a research question and then search for data to help you address the question. For feasibility reasons, we are asking you to work backwards. Here are some tips for generating your research questions:

  • Read over the background information about the dataset that interests you and your group and start considering what relationships you might want to explore.
  • Pick out a few specific variables and (re)frame your question around exploring the relationship between those variables.
  • Make sure your question is focused enough that it can be answered with the data at hand.
  • Here are some generic examples to get your group started:
    • EX: Does country A have a higher rate of X than country B?
    • EX: Is X positively related to Y? (In other words, as X increases, does Y tend to increase?)
    • EX: Is there evidence that trend X is becoming more popular over time?
    • EX: Is there a relationship between X and Y?
    • EX: How well do the following factors, X and Y, predict the variable Z?
    • EX: Are there differences in X by Y?

Rubric

You will be assessed on the following:

  • The informativeness of your summary with respect to one or both of your research questions
  • The appropriateness of the chosen graphs and summary statistics
  • The degree to which each graph makes appropriate use of geoms and their aesthetics, scale, and context
  • The degree to which the graphs are clear and engaging
  • The degree to which the graphs, summary statistics, and narrative support each other
  • The degree to which the text and code are well organized and well-written
  • The originality of work

Project Assignment 2

Goals

  • Create a data biography by answering the following key questions about the data:
    • Where did the data come from?
    • When were the data collected?
    • Why were the data collected?
    • How were the data collected?
    • Who are the data supposed to represent?
      • Who is present? Who is absent?
      • What evidence is there that the data are representative? What evidence is there that the data are not representative?
  • Better understand the context of our data to reduce the assumptions and biases we are placing on the data.

Notes

  • The project assignments will be fairly open-ended and much less prescribed than your lab assignments, mimicking a more real-world situation where you are tasked with extracting knowledge from data.
  • We encourage you to do some sleuthing here to answer these questions! Don’t just rely on the provided data dictionaries.
  • You should cite your sources at the end of your data biography, using your preferred citation style (but enough information should be included that a reader can track down your source)

Assignment

  1. Write a 2-3 page data biography that attempts to answer the questions provided in the Goals.
    • Your write-up should be presented as a narrative, using complete sentences and paragraphs.
  2. Turn in the pdf of your biography on Gradescope by 5pm on Friday, March 26th.

Rubric

You will be assessed on the following:

  • The informativeness of your data biography with respect to each the key questions provided in the Goals Section
  • The degree to which the text is supported by references and the appropriateness of the selected references
  • The degree to which the text is well organized and well-written

Further Reading

The following pair of articles discuss the importance of data biographies, and outline the process of creating a good data biography:

Heather Krause, Data Biographies: How to Get to Know Your Data on DATASSIST:

Catherine D’Ignazio, Putting data back into context on DataJournalism.com:


Project Assignment 3

Goals

  • Conduct statistical inference on your research questions.

Notes

  • The project assignments will be fairly open-ended and much less prescribed than your lab assignments, mimicking a more real-world situation where you are tasked with extracting knowledge from data.

Assignment

  1. Conduct a hypothesis test for one of your research questions.
    • For the hypothesis test,
      • Explicitly state the hypotheses in both words and symbols.
      • Include the method used, the test statistic, and the p-value.
      • Determine an appropriate significance level based on the consequences for type I/II errors
      • Check assumptions. (If violated, still finish the test but be cautious in your conclusion.)
      • Interpret the p-value in the context of the problem.
      • Discuss conclusions about the conjecture.
      • Describe whether the observed effect has practical significance, based on your understanding of the data context.
  2. Construct a confidence interval for one of your research questions. (It can be the same question explored in 1. or a different question.)
    • For the confidence interval,
      • Include the method used, confidence level, and interval values.
      • Describe why you choose the confidence level you did, based on the relationship between confidence level and margin of error, as well as the specific data context.
      • Check assumptions. (If violated, still construct the confidence interval but be cautious in your conclusion.)
      • Discuss conclusions about the conjecture.
  3. Write a 1-2 page summary of your findings that includes all the pieces specified in 1. and 2. Include appropriate visualizations for the confidence intervals and hypothesis tests.

  4. Turn in the pdf of your summary on Gradescope by 5pm on Friday, April 23rd.

Rubric

You will be assessed on the following:

  • For the hypothesis test,
    • Selecting an appropriate parameter of interest
    • Including the correct method, correct test statistic, and correct p-value.
    • Checking assumptions.
    • Correctly interpreting the p-value in the context of the problem.
    • Accurately discussing conclusions about the conjecture.
  • For the confidence interval:
    • Including the correct method and interval values.
    • Including confidence level.
    • Checking assumptions.
    • Accurately discussing conclusions about the conjecture.
  • The degree to which the text is well organized and well-written

Technical Report

Goals

  • Craft a clear, engaging, accurate story about one of your research questions.

Notes

  • The project assignments will be fairly open-ended and much less prescribed than your lab assignments, mimicking a more real-world situation where you are tasked with extracting knowledge from data.

Assignment

  1. Create a 2-3 page technical report that addresses the following:
    • Your research question
    • Your data source
    • Exploratory graphs and summary statistics and what they tell you about your research question
    • An inference procedure (and any assumptions) and interpretation of results
    • Conclusions about your research question
  2. The technical reports are due on Gradescope by 11:59pm PDT on Sunday, May 9th.

Notes

  • In this project assignment, you likely won’t need to conduct any additional analysis. Instead, you will be summarizing content from the previous project assignments.
    • However, it is okay if you do conduct additional analysis
  • Reports will be graded both for content and for how well the material is discussed
  • You should address your topic and statistical content at a level that is appropriate for a Math 141 audience.
  • If both of your research questions are related to each other, you are welcome (but not required) to address both in the report if space allows.

Technical Report Details

  • Simply combining your work in Assignments 1, 2, and 3 will produce a document that is much longer than 2 - 3 pages.
    • Instead, think carefully about what the most important details are for your analysis, and curate your previous assignments to highlight and support these.
  • You do not need to include the code used to perform your analysis in the .pdf document itself. You should however, include summary statistics, visualizations, and the results of any inference where appropriate.
    • To have code run when you knit, but not display, use replace the chunk header {r} with {r echo = F}.
    • If you also don’t want the output of the code to display, use {r echo = F, include = F }
  • You can control the size of included graphics by adding {r fig.width =..., fig.height=...} to your chunk options, where ... is replaced with the desired width/height of graphic in inches.
  • Your report can have a title page listing the project title, the project authors, and date. This page does not count towards the page limit.

Rubric

You will be assessed on the following:

  • Length: Technical reports that are not between the 2-3 pages single-spacedwill be penalized.
  • Content: Demonstrates a full and accurate understanding of the material presented.
    • The report addresses each item listed in (1).
  • Style The degree to which the text is well organized and well-written
  • Sources: At least 2 appropriate references are included. The references should be on a separate page from the report and are not included in the page count.

Presentation

Goals

  • Craft a clear, engaging, accurate story about one of your research questions.

Notes

  • The project assignments will be fairly open-ended and much less prescribed than your lab assignments, mimicking a more real-world situation where you are tasked with extracting knowledge from data.

Assignment

  1. Create a 5-10 minute video presentation that addresses the following:
    • Your research question
    • Your data source
    • Exploratory graphs and summary statistics and what they tell you about your research question
    • An inference procedure (and any assumptions) and interpretation of results
    • Conclusions about your research question
  2. The presentations are due on 11:59pm PDT on Sunday, May 9th. One group member should turn drop the video into this Ensemble dropbox folder (Reed Kereberus credentials required to upload)

Notes

  • In this project assignment, you likely won’t need to conduct any additional analysis. Instead, you will be summarizing content from the previous project assignments.
    • However, it is okay if you do conduct additional analysis
  • Videos will be graded both for content and for how well the material is delivered.
  • You should address your topic and statistical content at a level that is appropriate for a Math 141 audience.
  • If both of your research questions are related to each other, you are welcome (but not required) to address both in the video if time permits.

Video Creation

  • We suggest you first create a set of slides (e.g., Google Slides, Beamer, Powerpoint, Keynote…).
  • Then you create a video by recording a presentation of the slides and the corresponding audio.
  • For recording, one option would be to use Zoom. Here is a useful CUS website.
  • When recording, only one person should be in-charge of sharing their screen.
  • Practice the presentation several times before recording to ensure a polished final product.
  • For short videos like this, it is easeir to do another take than it is to edit the video afterwards.

Rubric

You will be assessed on the following:

  • Participation: All group members must be speakers in the video.
  • Length: Videos that are not between the 5 - 10 minutes will be penalized.
  • Content: Demonstrates a full and accurate understanding of the material presented.
    • The video addresses each item listed in (1).
  • Delivery: How well the content is presented. If the presentation is polished and clear.
    • This does NOT mean you must create a video with a lot of technological bells and whistles!
    • This does mean that you should write a script beforehand and practice, practice, practice so that it doesn’t sound like you are reading a script.
    • This does mean that you should structure your presentation so that it is easy to follow the main points and how they connect.
  • Sources: At least 2 appropriate references are included.