Mask and Covid

Peter Chen (Data Science at Reed College)https://reed-statistics.github.io/math241-spring2022/
May 5, 2022
hide

Background

It is widely believed that wearing a mask can reduce the chance of being infected by the virus, and it can also reduce the chance of having serious symptoms even if infected. Research by Fischer (2021) demonstrate that states without mask mandate have a higher covid rate compared to states with a mandate on wearing masks. A more focused experiment by Lyu & Wehby (2020) showed that more than 200,000 covid cases are averted among sixteen states due to mask mandate. Both studies agree that mask is essential for preventing transmission of virus. In addition, Bulfone et al. (2020) provides evidence for the importance of wearing a mask indoor. However, all the studies do not focus on the difference of masks. Higher standard masks, like N95 masks, are believed to provide better protection, but studies on massive implementation of these masks are rare.

Data

Data Source

The Delphi Research Group of the Carnegie Mellon University provides a COVIDcast API that aggregates a variety of data that are related to covid (David C. Farrow (2015)). There are raw data of covid case number or death number by John Hopkins University, or by US Department of Health and Human Service. The API provides a 7-day-averaged version of daily covid case number, which I will use in this project. The averaged covid data should provide a better correlation than the raw data because it reduces the size of peaks and valleys in the data, but if the correlation is not strong enough, I will take the difference of the consecutive days to see whether the change in covid rate gives a better result.

The data on mask rate is part of a survey that the Delphi group provides on Facebook. Similar to the covid data, it also has a 7-day-averaged version. The data are saved in the same format, so whatever data wrangling I did for covid data should work the same for mask data.

Method and Result

Data Wrangling

The data are downloaded from the COVIDcast API, and there are three data sets used in this project.

hide
load("covid_raw.rda")
load("mask_raw.rda")
load("county_census.rda")

The first covid_raw and mask_raw data sets are the raw values for each variable. The data sets are organised in the same pattern, where most of the columns are not used in this project. The only three columns used are geo_value, time_value, and value. geo_value is the FIPS code for the county where the data is collected. time_value is the date when the data is collected. value is the actual data collected. The value for the covid data set is the daily case counts, and the value for vaccine and mask are daily proportions. The data sets contain records from February 2021 to March 2022.

The county_census data set is the population estimates of each county in 2019. The only two columns used are FIPS and POPESTIMATE2019. FIPS is the FIPS code for each county, and POPESTIMATE2019 is the population estimates of that county in 2019.

The raw data downloaded from the API are raw numbers of each day, and we also want to have a data set of the weekly changes of each variable.

hide
# rewrite value of covid_raw as change in value

covid_change <- covid_raw
# list of counties in the data set
covid_county <- covid_change$geo_value %>% 
  unique()
# list of days in the data set
covid_date <- covid_change$time_value %>% 
  unique()

## check if all of the counties have full values
full_county_covid <- c()

for (county in covid_county) {
  df <- covid_raw %>%
    filter(geo_value == county)
  if (nrow(df) == length(covid_date)) {
    full_county_covid <- append(full_county_covid, county)
  }
}

length(full_county_covid) == length(covid_county)
hide
# lag for weekly changes
lag = 7

# calculate the weekly changes
for (county in covid_county) {
  value <- covid_raw %>%
    filter(geo_value == county)
  # get difference of covid cases
  diff <- 
    (value$value[(lag+1):nrow(value)] - value$value[1:(nrow(value)-lag)])
  # add NAs to the front of the vector
  diff_sub <- c(rep(NA, 7), diff)
  # find the index of all rows that matches the county
  index <- covid_change[,"geo_value"] == county
  # rewrite value
  covid_change[index, "value"] <- diff_sub
}

covid_change <- covid_change %>% 
  subset(geo_value %in% full_county_covid)

The code for mask is similar.

hide
# rewrite value of vacc_raw as change in value
mask_change <- mask_raw
# list of counties in the data set
mask_county <- mask_change$geo_value %>% 
  unique()
# list of counties in the data set
mask_date <- mask_change$time_value %>% 
  unique()

## check if all of the counties have full values
full_county_mask <- c()

for (county in mask_county) {
  df <- mask_raw %>% 
    filter(geo_value == county)
  if (nrow(df) == length(mask_date)) {
    full_county_mask <- append(full_county_mask, county)
  }
}

length(full_county_mask) == length(mask_county)
hide
# lag for weekly changes
lag = 7

for (county in full_county_mask) {
  value <- mask_raw %>%
    filter(geo_value == county)
  # get difference of mask cases
  diff <- 
    (value$value[(lag+1):nrow(value)] - value$value[1:(nrow(value)-lag)])
  # add NAs to the front of the vector
  diff_sub <- c(rep(NA, 7), diff)
  # find the index of all rows that matches the county
  index <- mask_change[,"geo_value"] == county
  # rewrite value
  mask_change[index, "value"] <- diff_sub
}

mask_change <- mask_change %>% 
  subset(geo_value %in% full_county_mask)
hide
## It takes a really long time to run the above code sections
## especially for the covid_change part
## Thus for the sake of time, I only run the code once and 
## save the results for future

# save(covid_change, file = "covid_change.rda")
# save(mask_change, file = "mask_change.rda")

load("covid_change.rda")
load("mask_change.rda")

The values in the covid_change data set are changes of the daily case counts, and we want to keep everything in rates. Thus we will change the numbers to rates.

hide
# change covid number to rate

# list of counties in the data set
covid_county <- covid_change$geo_value %>% 
  unique()

for (county in covid_county) {
  # get the index of the county in the county_census data set
  index_pop <- match(county, county_census$FIPS)
  # take the population estimate
  pop <- county_census$POPESTIMATE2019[index_pop]
  # calculate rate
  case <- covid_change %>%
    subset(geo_value == county)
  rate <- case$value/pop
  # rewrite values
  index <- covid_change[,"geo_value"] == county
  covid_change[index, "value"] <- rate
}

Finding the Best Lag

We expect that more people wearing masks would reduce covid rate, but in practice if the covid rate is high, more people are willing to wear masks. To eliminate the influence of the latter senario, we will shift the data so that the we are taking the correlation between mask of some days ago and the covid of today. After some trail, we find out that a 49-day lag gives the best correlation.

hide
## to take the correlation we will merge the data sets together 
## if we merge them directly, the mask data of some day will be 
##  in the same row as the covid data of the same day
## thus we will shift the time value of the covid data to 49 days later
covid_change$time_value <- covid_change$time_value - 49
hide
## merge the data sets
merged <- covid_change %>% 
  rename(covid_change = value) %>% 
  merge(mask_change, by = c("time_value", "geo_value")) %>% 
  rename(mask_change = value) %>% 
  select(time_value, geo_value, covid_change, mask_change) %>% 
  merge(mask_raw, by = c("time_value", "geo_value")) %>%
  rename(mask_raw = value) %>% 
  select(time_value, geo_value, covid_change, mask_change, mask_raw)

Result

hide
cor(merged$mask_raw, merged$covid_change, use = "complete.obs")
[1] -0.01827771

The 49-day lag between raw rates of mask wearing and change in rates of covid has a correlation of -0.0182, so one person wearing mask would reduce the increase in covid rate by \(1.82\%\).

Conclusion

It is proven that wearing masks is associated with a lower covid rate. The coefficient is not as large as what people would expect. This might be due to the fact that the mask data is obtained from a Facebook survey, so the data might not be representative for the whole US population. However, the sample size is large enough to say that the result is significant.

Class Peer Reviews

Bulfone, T. C., Malekinejad, M., Rutherford, G. W., & Razani, N. (2020). Outdoor Transmission of SARS-CoV-2 and Other Respiratory Viruses: A Systematic Review. The Journal of Infectious Diseases, 223(4), 550–561. https://doi.org/10.1093/infdis/jiaa742
David C. Farrow, A. R., Logan C. Brooks. (2015). Delphi epidata API. https://github.com/cmu-delphi/delphi-epidata
Fischer, N. A. S., Charlie B. AND Adrien. (2021). Mask adherence and rate of COVID-19 across the united states. PLOS ONE, 16(4), 1–10. https://doi.org/10.1371/journal.pone.0249891
Lyu, W., & Wehby, G. L. (2020). Community use of face masks and COVID-19: Evidence from a natural experiment of state mandates in the US. Health Affairs, 39(8), 1419–1425. https://doi.org/10.1377/hlthaff.2020.00818

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".