Main Objectives

The Data Science project is one of the main assignments required for this course. There are four main objectives for this project, which is in line with the learning objectives of MATH 241 course:

  1. To learn skills in data acumen, wrangling, and visualization.
  2. To learn effective and concise communication of data.
  3. To learn and show teamwork, collegiality, and the basics of a peer review process.
  4. To learn new and existing methods and to hone independent learning.

At the end of the course, your project will be evaluated according to the above objectives and the learning outcomes outlined in the Syllabus.


Expectations

There will be four deliverables for the Data Science project.

  1. Project proposal (pdf) - Due March 18
  2. Progress report (rmd, html, and all relevant files) - Due April 15
  3. Final report (rmd, html, and all relevant files) - Due April 29
  4. Peer review assignment (pdf) - Due May 9

The final report will be posted on or before May 9. Note that you can choose to have your work not posted online or choose to exclude your name on a project.



Peer Review Assignment

Posted: April 28, 2022

Due: May 9, 2022

Peer review assignments submitted after May 9, 2022 will not be accepted.

Each student will be assigned one random number corresponding to the report number in the Final Project Reports Page. A student will not be assigned to review their own report. Thereby, each project will have at least one peer review response. The posted report will be initially anonymized and will be updated when the reviews are done. The peer review responses will be anonymized and will be sent to the authors. Random number assignments will be sent out after all reports are submitted on May 4.

  • Peer Review Assignment - [Rmd] [pdf]



Final Report

Posted: April 20, 2022

Due: April 29, 2022 to May 4, 2022

Final reports submitted after May 4, 2022 will not be accepted.

Submission Guidelines: Your team must submit a zipped file containing the rmd, html, and all relevant files of your progress report using this Google form. We suggest you assign a dedicated person to submit your team’s finished final report.

Side Note: The file size limit to upload is only 100MB. If your data sets are too large, do not include them into your zipped file but rather provide detailed documentation in your report on how you obtained the data sets.

Tasks:

  • Make sure that your final report is publication-ready. Publication-ready means the following:

    1. Your rmd file can be html-knitted by the instructor. In order for it be knitted successfully, you need to include all relevant files into your zipped file submission. All logical, syntax, and knitting errors must be fixed. Contact the instructor if you need help on fixing any kind errors.

    2. Your figures (static and interactive), animations, and tables are clear. All legends, labels, and titles must be non-overlapping. Do not use the data frame variable names if its not comprehensible. Figure and table captions are encouraged but not required.

    3. Your references lists and citations are in APA format. Use the apa.csl citation style file similar to what we have done in the Modules.

    4. Your document sections must be formatted correctly. Use R Markdown syntax ## for the main section titles and ### for sub-section titles.

    5. You R code blocks must be well-commented. All of the R code presented in your final report must be well-commented, meaning that comments should not duplicate the code but makes the code organized and/or clear. Please have your code blocks with message=FALSE and warning=FALSE if it shows unnecessary outputs.

    6. Your YAML preamble code is formatted correctly. Have your YAML preamble code similar to the example below. Note that the references.bib and apa.csl must exist along with your rmd file. Make sure you put your final project title and list of names of your group members.

    ---
    title: 'Titanic The Movie'
    author: 'Leonardo Da Vinci, Titanic, and Iceburg'
    output: 
      html_document:
        code_folding: hide
    urlcolor: red
    bibliography: references.bib
    csl: apa.csl
    link-citations: yes
    ---

Contents: The following main sections should be included in your final report. Note that you already wrote most of these contents from your progress report. You just need to organize your work and format it correctly.

  • Introduction. This section should include your problem statement, background, and data set descriptions. The objectives of your project should be also be stated here as well.

  • Methods and Results. This section should include paragraphs explaining the methods on how you wrangled and organized your data, and your data explorations. Include observations of your results and the mathematical/statistical methods used in your analysis. The majority of your R code blocks and visualization should be in this section. Create sub-sections if needed. Sub-sections can include data explorations and the methods and initial findings sections you wrote from your progress report.

  • Conclusions. This section should include concluding remarks of your data analysis or data explorations. You must re-state your objectives and provide a paragraph summarizing the main points of your project. Include remarks on any future work that you think are worth pursuing.



Progress Report

Posted: March 31, 2022

Due: April 15, 2022

Submission Guidelines: Your team must submit a zipped file containing the rmd, html, and all relevant files of your progress report using this Google form. We suggest you assign a dedicated person to submit your team’s finished progress report.

Side Note: The file size limit to upload is only 100MB. If your data sets are too large, do not include them into your zipped file but rather provide detailed documentation in your report on how you obtained the data sets.

Tasks:

  • Create a html-knitted document using an R markdown file. This html document serves an initial draft of you project that is to be posted online by May 9. There is no page limit on this document because it is in an html form but you must be mindful of writing your report as concise as possible.

  • Make sure that your rmd file can be html-knitted. In order for the instructor to test your project for reproducibility, you must submit the rmd and all relevant files. If it successfully knits on your end, then most likely the instructor will successfully knit it on their end. If the instructor encounters minor knitting errors because of the difference in machine used, the instructor will have tools to remedy it and you don’t have to worry about that. The instructor will initially re-knit the rmd file and will examine which parts of the rmd has syntax and logical errors if it exists. The instructor will note of the errors and will be part of your next feedback.

Side Note: Consult with the instructor if there are major problems with knitting before submission. That includes, if you think there are special things (e.g. special UTF-8 characters, special packages that needs installing, etc) the instructor has to implement for the rmd to knit successfully, please let them know.

Contents: The following sections should be included in your progress report.

  • Title. An initial title that describes your project. This can be tentative and may change if you have new information or discoveries about your data and analysis.

  • Authors. A list of names of your team.

  • Problem Statement and Background. You already wrote an initial draft of this section from your project proposal. Apply changes if necessary.

  • Data Descriptions. Part of this section was already written in your project proposal. In this section, you need to add detailed descriptions (e.g. the variable types, meaning of each categorical level, or the meaning/units of a numerical value) of the variables that was explored.

  • Data Explorations. When you performed data explorations, not all discoveries are relevant to your project objectives. This sections should only include those data discoveries relevant to your project goals. You may add a short subsection at the end of the document that shows part of the data that you think is interesting but not necessarily directly related to your objectives. Note that this section can be written together with the “Methods and Initial Findings” section.

  • Methods and Initial Findings. This section should include a description of the mathematical and statistical methods that was used in your analysis, or the algorithms used in your data wrangling, explorations, and simulations. Include paragraphs of your initial findings and interpretations of the computed metrics (e.g. p-value, confidence intervals, estimated rates, etc). Also include evaluation metrics if you have use a model in your analysis (e.g. receiver operating characteristic, sum of squared residuals, etc). Note that this section can be written together with the “Data Explorations” section.

  • Applying the Instructor’s Feedback. Note that this is not an actual section in your report. Apply the feedback that was given to you by the instructor if the feedback merits it. Note that the instructors feedback can be in a form of a question or statements with suggestions on how to improve your approaches. Your applied changes from the project proposal can exist in any part of your report.



Project Proposal

Posted: March 7, 2022

Due: March 18, 2022

Submission Guidelines: Your team must submit a single pdf-knitted R Markdown using this Google form. We suggest you assign a dedicated person to submit your team’s finished project proposal.

Tasks:

  • Establish a team (1-3 members). Working with a team is essential in any scientific work because it can bring different viewpoints and expertise onto your projects. If you want to work by yourself, you can still get valuable insights from other people by consultation.

  • Create a 2+ page Project Proposal. For ideas on what type of projects has been done before in MATH 241, you can view this blog posts of Data Science projects from last year’s MATH 241 students, which was taught by Kelly McConville. Scientific writing is similar to writing you have done in your humanities and social science classes, where proper formatting of paragraphs, figures, and citations within the document should be followed, and it is expected that you have a well-commented R codes. See the Quick Notes page in the course website for R and R Markdown cheat sheets and for other course tips.

Contents: The following sections should be included in your proposal.

  • Team Members. A list of the members of your team and their expected duties. Clarify if members have similar backgrounds and you may provide descriptions of equitable tasks for each member.

  • Problem Statement and Background. A description of the problem you want to solve. If possible, try to draft questions you want to explore; however, this may be difficult until conducting exploratory data analysis. Give some context for the topic you’re working on: why it’s fascinating, who’s interested, what’s known about it, some references, and so on.

  • Data Source(s). Provide a description of the data you will be using. Your data should be substantial and complex enough for you perform advanced data wrangling and visualizations. Consult with the instructor if you are unsure about a data set you want to use or you can use any of the project recommended data sets in the Data page in the course website. Explain how you intend to collect the data, or how you obtained it if you already have it. Provide an overview of the data cleaning/joining that you anticipate to perform. Ideally, you should have all of the data in hand before beginning your project.

  • Scope of Analysis. Define key objectives of your project. This can be in the form of testable hypothesis with well-defined metrics, complex visualization plans (interactive or static), plans for making a new R package, or plans for a data-driven mathematical/statistical modeling. These can be tentative, and you do not have to adhere to them throughout your project.

  • Tools. Describe the tools (R packages, etc.) you intend to utilize for the duration of the project. The project will go through numerous stages, as you could imagine using the Data Science lifecycle. Exploratory data analysis, modeling, and maybe fixing syntax and logical errors will be among them. This section can be tentative, and we will provide comments on your analytical approach as part of the project evaluation. We require you to mainly use R and R Markdown, but you may use Python - or other programming languages - as a complementary tool along with R for your project. If you decide to use other programming languages as complementary tools, make sure to give detailed documentation of your code.