10: Mask and Covid

hide

library(dplyr)

Background

It is widely believed that wearing a mask can reduce the chance of being infected by the virus, and it can also reduce the chance of having serious symptoms even if infected. Research by Fischer (2021) demonstrate that states without mask mandate have a higher covid rate compared to states with a mandate on wearing masks. A more focused experiment by Lyu & Wehby (2020) showed that more than 200,000 covid cases are averted among sixteen states due to mask mandate. Both studies agree that mask is essential for preventing transmission of virus. In addition, Bulfone et al. (2020) provides evidence for the importance of wearing a mask indoor. However, all the studies do not focus on the difference of masks. Higher standard masks, like N95 masks, are believed to provide better protection, but studies on massive implementation of these masks are rare.

Data

Data Source

The Delphi Research Group of the Carnegie Mellon University provides a COVIDcast API that aggregates a variety of data that are related to covid (David C. Farrow (2015)). There are raw data of covid case number or death number by John Hopkins University, or by US Department of Health and Human Service. The API provides a 7-day-averaged version of daily covid case number, which I will use in this project. The averaged covid data should provide a better correlation than the raw data because it reduces the size of peaks and valleys in the data, but if the correlation is not strong enough, I will take the difference of the consecutive days to see whether the change in covid rate gives a better result.

The data on mask rate is part of a survey that the Delphi group provides on Facebook. Similar to the covid data, it also has a 7-day-averaged version. The data are saved in the same format, so whatever data wrangling I did for covid data should work the same for mask data.

Method and Result

Data Wrangling

The data are downloaded from the COVIDcast API, and there are three data sets used in this project.

hide

load("covid_raw.rda")
load("mask_raw.rda")
load("county_census.rda")

The first covid_raw and mask_raw data sets are the raw values for each variable. The data sets are organised in the same pattern, where most of the columns are not used in this project. The only three columns used are geo_value, time_value, and value. geo_value is the FIPS code for the county where the data is collected. time_value is the date when the data is collected. value is the actual data collected. The value for the covid data set is the daily case counts, and the value for vaccine and mask are daily proportions. The data sets contain records from February 2021 to March 2022.

The county_census data set is the population estimates of each county in 2019. The only two columns used are FIPS and POPESTIMATE2019. FIPS is the FIPS code for each county, and POPESTIMATE2019 is the population estimates of that county in 2019.

The raw data downloaded from the API are raw numbers of each day, and we also want to have a data set of the weekly changes of each variable.

hide

# rewrite value of covid_raw as change in value

covid_change <- covid_raw
# list of counties in the data set
covid_county <- covid_change$geo_value %>% 
  unique()
# list of days in the data set
covid_date <- covid_change$time_value %>% 
  unique()

## check if all of the counties have full values
full_county_covid <- c()

for (county in covid_county) {
  df <- covid_raw %>%
    filter(geo_value == county)
  if (nrow(df) == length(covid_date)) {
    full_county_covid <- append(full_county_covid, county)
  }
}

length(full_county_covid) == length(covid_county)

hide

# lag for weekly changes
lag = 7

# calculate the weekly changes
for (county in covid_county) {
  value <- covid_raw %>%
    filter(geo_value == county)
  # get difference of covid cases
  diff <- 
    (value$value[(lag+1):nrow(value)] - value$value[1:(nrow(value)-lag)])
  # add NAs to the front of the vector
  diff_sub <- c(rep(NA, 7), diff)
  # find the index of all rows that matches the county
  index <- covid_change[,"geo_value"] == county
  # rewrite value
  covid_change[index, "value"] <- diff_sub
}

covid_change <- covid_change %>% 
  subset(geo_value %in% full_county_covid)

The code for mask is similar.

hide

# rewrite value of vacc_raw as change in value
mask_change <- mask_raw
# list of counties in the data set
mask_county <- mask_change$geo_value %>% 
  unique()
# list of counties in the data set
mask_date <- mask_change$time_value %>% 
  unique()

## check if all of the counties have full values
full_county_mask <- c()

for (county in mask_county) {
  df <- mask_raw %>% 
    filter(geo_value == county)
  if (nrow(df) == length(mask_date)) {
    full_county_mask <- append(full_county_mask, county)
  }
}

length(full_county_mask) == length(mask_county)

hide

# lag for weekly changes
lag = 7

for (county in full_county_mask) {
  value <- mask_raw %>%
    filter(geo_value == county)
  # get difference of mask cases
  diff <- 
    (value$value[(lag+1):nrow(value)] - value$value[1:(nrow(value)-lag)])
  # add NAs to the front of the vector
  diff_sub <- c(rep(NA, 7), diff)
  # find the index of all rows that matches the county
  index <- mask_change[,"geo_value"] == county
  # rewrite value
  mask_change[index, "value"] <- diff_sub
}

mask_change <- mask_change %>% 
  subset(geo_value %in% full_county_mask)

hide

## It takes a really long time to run the above code sections
## especially for the covid_change part
## Thus for the sake of time, I only run the code once and 
## save the results for future

# save(covid_change, file = "covid_change.rda")
# save(mask_change, file = "mask_change.rda")

load("covid_change.rda")
load("mask_change.rda")

The values in the covid_change data set are changes of the daily case counts, and we want to keep everything in rates. Thus we will change the numbers to rates.

hide

# change covid number to rate

# list of counties in the data set
covid_county <- covid_change$geo_value %>% 
  unique()

for (county in covid_county) {
  # get the index of the county in the county_census data set
  index_pop <- match(county, county_census$FIPS)
  # take the population estimate
  pop <- county_census$POPESTIMATE2019[index_pop]
  # calculate rate
  case <- covid_change %>%
    subset(geo_value == county)
  rate <- case$value/pop
  # rewrite values
  index <- covid_change[,"geo_value"] == county
  covid_change[index, "value"] <- rate
}

Finding the Best Lag

We expect that more people wearing masks would reduce covid rate, but in practice if the covid rate is high, more people are willing to wear masks. To eliminate the influence of the latter senario, we will shift the data so that the we are taking the correlation between mask of some days ago and the covid of today. After some trail, we find out that a 49-day lag gives the best correlation.

hide

## to take the correlation we will merge the data sets together 
## if we merge them directly, the mask data of some day will be 
##  in the same row as the covid data of the same day
## thus we will shift the time value of the covid data to 49 days later
covid_change$time_value <- covid_change$time_value - 49

hide

## merge the data sets
merged <- covid_change %>% 
  rename(covid_change = value) %>% 
  merge(mask_change, by = c("time_value", "geo_value")) %>% 
  rename(mask_change = value) %>% 
  select(time_value, geo_value, covid_change, mask_change) %>% 
  merge(mask_raw, by = c("time_value", "geo_value")) %>%
  rename(mask_raw = value) %>% 
  select(time_value, geo_value, covid_change, mask_change, mask_raw)

Result

hide

cor(merged$mask_raw, merged$covid_change, use = "complete.obs")

[1] -0.01827771

The 49-day lag between raw rates of mask wearing and change in rates of covid has a correlation of -0.0182, so one person wearing mask would reduce the increase in covid rate by \(1.82\%\).

Conclusion

It is proven that wearing masks is associated with a lower covid rate. The coefficient is not as large as what people would expect. This might be due to the fact that the mask data is obtained from a Facebook survey, so the data might not be representative for the whole US population. However, the sample size is large enough to say that the result is significant.

Class Peer Reviews

Reviewer 1
1. State the authors’ objectives and the general questions that the authors are considering. Do the results and figures support the conclusions made in the report?
The author’s objective is to find how much mask wearing can decrease COVID infection rate, specifically by finding the correlation between the two sets of values. The results from data wrangling show that a 49-day lag between the raw rates of mask wearing and the change in COVID rates has a correlation of -0.0182, which is interpreted as that one person wearing mask would reduce the increase in COVID rate by 1.82%. This result supports the conclusion which states that wearing masks is associated with a lower COVID rate but not by a coefficient that is as large as what people would expect.
1. Discuss the foundations of data visualizations relative to the figures presented in the report.
The report does not use any data visualization. However, a visualization that might help enhance the project, such as a line graph or scatter plot that shows the correlation against the number of lag. This would allow the reader to compare the correlations and see how the 49-day lag is the best.
1. State 3 things that are strong about their report and 2 things that can be improved.
The report includes multiple credible references in the background and data source sections, which indicates a relatively thorough preparation and understanding of the project, especially the literature review. It also incorporates large detailed portions of data wrangling with code writing and calculation. The third strong aspect is the clear conclusion drawn from the results.

One thing that can be improved is that the objective of the project could be stated more clearly in the introduction. The second thing is that more explanation could be added in between the code blocks to help the reader fully understand and follow the wrangling methods and the author’s reasoning.
Reviewer 2
1. State the authors’ objectives and the general questions that the authors are considering. Do the results and figures support the conclusions made in the report?
The objective of the report is to analyze the relationship between mask-wearing rate and number of new covid cases. The author first found the best “lag” in considering covid rate, since mask-wearing affects covid rates in the future, not on the day they wear (or don’t wear) the mask. The author continues in finding the correlation between the mask rate and the new covid cases.

The author found that the best lag was 49 days, but there was no evidence shown, and no explanation given. The author also find a small -0.018 correlation between mask wearing and new covid cases. I’m not sure if this suggests there is a connection, and a hypothesis test might be necessary to show there even is a relation between the two variables.
1. Discuss the foundations of data visualizations relative to the figures presented in the report.
There was no data visualization done in this report, so there is no foundations of data visualizations to discuss.
1. State 3 things that are strong about their report and 2 things that can be improved.
I like the fact that the author identified the issue of lag in covid cases, and accounted for it. The author presented sources for their data very clearly, which gives the results more weight. The author also made their objective for the report very clear, and the report is well structured.

I would have liked if the author showed how they found a 49 day lag is the most significant. The report could also be improved by showing there is a relationship between mask wearing and covid cases.

Bulfone, T. C., Malekinejad, M., Rutherford, G. W., & Razani, N. (2020). Outdoor Transmission of SARS-CoV-2 and Other Respiratory Viruses: A Systematic Review. The Journal of Infectious Diseases, 223(4), 550–561. https://doi.org/10.1093/infdis/jiaa742

David C. Farrow, A. R., Logan C. Brooks. (2015). Delphi epidata API. https://github.com/cmu-delphi/delphi-epidata

Fischer, N. A. S., Charlie B. AND Adrien. (2021). Mask adherence and rate of COVID-19 across the united states. PLOS ONE, 16(4), 1–10. https://doi.org/10.1371/journal.pone.0249891

Lyu, W., & Wehby, G. L. (2020). Community use of face masks and COVID-19: Evidence from a natural experiment of state mandates in the US. Health Affairs, 39(8), 1419–1425. https://doi.org/10.1377/hlthaff.2020.00818

Mask and Covid

Background

Data

Data Source

Method and Result

Data Wrangling

Finding the Best Lag

Result

Conclusion

Class Peer Reviews

References

Reuse