It is widely believed that wearing a mask can reduce the chance of being infected by the virus, and it can also reduce the chance of having serious symptoms even if infected. Research by Fischer (2021) demonstrate that states without mask mandate have a higher covid rate compared to states with a mandate on wearing masks. A more focused experiment by Lyu & Wehby (2020) showed that more than 200,000 covid cases are averted among sixteen states due to mask mandate. Both studies agree that mask is essential for preventing transmission of virus. In addition, Bulfone et al. (2020) provides evidence for the importance of wearing a mask indoor. However, all the studies do not focus on the difference of masks. Higher standard masks, like N95 masks, are believed to provide better protection, but studies on massive implementation of these masks are rare.
The Delphi Research Group of the Carnegie Mellon University provides
a COVIDcast API
that aggregates a variety
of data that are related to covid (David C. Farrow (2015)). There are raw data of covid case
number or death number by John Hopkins University, or by US Department
of Health and Human Service. The API provides a 7-day-averaged version
of daily covid case number, which I will use in this project. The
averaged covid data should provide a better correlation than the raw
data because it reduces the size of peaks and valleys in the data, but
if the correlation is not strong enough, I will take the difference of
the consecutive days to see whether the change in covid rate gives a
better result.
The data on mask rate is part of a survey that the Delphi group provides on Facebook. Similar to the covid data, it also has a 7-day-averaged version. The data are saved in the same format, so whatever data wrangling I did for covid data should work the same for mask data.
The data are downloaded from the COVIDcast API, and there are three data sets used in this project.
The first covid_raw
and mask_raw
data sets
are the raw values for each variable. The data sets are organised in the
same pattern, where most of the columns are not used in this project.
The only three columns used are geo_value
,
time_value
, and value
. geo_value
is the FIPS code for the county where the data is collected.
time_value
is the date when the data is collected.
value
is the actual data collected. The value
for the covid data set is the daily case counts, and the
value
for vaccine and mask are daily proportions. The data
sets contain records from February 2021 to March 2022.
The county_census
data set is the population estimates
of each county in 2019. The only two columns used are FIPS
and POPESTIMATE2019
. FIPS
is the FIPS code for
each county, and POPESTIMATE2019
is the population
estimates of that county in 2019.
The raw data downloaded from the API are raw numbers of each day, and we also want to have a data set of the weekly changes of each variable.
# rewrite value of covid_raw as change in value
covid_change <- covid_raw
# list of counties in the data set
covid_county <- covid_change$geo_value %>%
unique()
# list of days in the data set
covid_date <- covid_change$time_value %>%
unique()
## check if all of the counties have full values
full_county_covid <- c()
for (county in covid_county) {
df <- covid_raw %>%
filter(geo_value == county)
if (nrow(df) == length(covid_date)) {
full_county_covid <- append(full_county_covid, county)
}
}
length(full_county_covid) == length(covid_county)
# lag for weekly changes
lag = 7
# calculate the weekly changes
for (county in covid_county) {
value <- covid_raw %>%
filter(geo_value == county)
# get difference of covid cases
diff <-
(value$value[(lag+1):nrow(value)] - value$value[1:(nrow(value)-lag)])
# add NAs to the front of the vector
diff_sub <- c(rep(NA, 7), diff)
# find the index of all rows that matches the county
index <- covid_change[,"geo_value"] == county
# rewrite value
covid_change[index, "value"] <- diff_sub
}
covid_change <- covid_change %>%
subset(geo_value %in% full_county_covid)
The code for mask is similar.
# rewrite value of vacc_raw as change in value
mask_change <- mask_raw
# list of counties in the data set
mask_county <- mask_change$geo_value %>%
unique()
# list of counties in the data set
mask_date <- mask_change$time_value %>%
unique()
## check if all of the counties have full values
full_county_mask <- c()
for (county in mask_county) {
df <- mask_raw %>%
filter(geo_value == county)
if (nrow(df) == length(mask_date)) {
full_county_mask <- append(full_county_mask, county)
}
}
length(full_county_mask) == length(mask_county)
# lag for weekly changes
lag = 7
for (county in full_county_mask) {
value <- mask_raw %>%
filter(geo_value == county)
# get difference of mask cases
diff <-
(value$value[(lag+1):nrow(value)] - value$value[1:(nrow(value)-lag)])
# add NAs to the front of the vector
diff_sub <- c(rep(NA, 7), diff)
# find the index of all rows that matches the county
index <- mask_change[,"geo_value"] == county
# rewrite value
mask_change[index, "value"] <- diff_sub
}
mask_change <- mask_change %>%
subset(geo_value %in% full_county_mask)
## It takes a really long time to run the above code sections
## especially for the covid_change part
## Thus for the sake of time, I only run the code once and
## save the results for future
# save(covid_change, file = "covid_change.rda")
# save(mask_change, file = "mask_change.rda")
load("covid_change.rda")
load("mask_change.rda")
The values in the covid_change
data set are changes of
the daily case counts, and we want to keep everything in rates. Thus we
will change the numbers to rates.
# change covid number to rate
# list of counties in the data set
covid_county <- covid_change$geo_value %>%
unique()
for (county in covid_county) {
# get the index of the county in the county_census data set
index_pop <- match(county, county_census$FIPS)
# take the population estimate
pop <- county_census$POPESTIMATE2019[index_pop]
# calculate rate
case <- covid_change %>%
subset(geo_value == county)
rate <- case$value/pop
# rewrite values
index <- covid_change[,"geo_value"] == county
covid_change[index, "value"] <- rate
}
We expect that more people wearing masks would reduce covid rate, but in practice if the covid rate is high, more people are willing to wear masks. To eliminate the influence of the latter senario, we will shift the data so that the we are taking the correlation between mask of some days ago and the covid of today. After some trail, we find out that a 49-day lag gives the best correlation.
## to take the correlation we will merge the data sets together
## if we merge them directly, the mask data of some day will be
## in the same row as the covid data of the same day
## thus we will shift the time value of the covid data to 49 days later
covid_change$time_value <- covid_change$time_value - 49
## merge the data sets
merged <- covid_change %>%
rename(covid_change = value) %>%
merge(mask_change, by = c("time_value", "geo_value")) %>%
rename(mask_change = value) %>%
select(time_value, geo_value, covid_change, mask_change) %>%
merge(mask_raw, by = c("time_value", "geo_value")) %>%
rename(mask_raw = value) %>%
select(time_value, geo_value, covid_change, mask_change, mask_raw)
cor(merged$mask_raw, merged$covid_change, use = "complete.obs")
[1] -0.01827771
The 49-day lag between raw rates of mask wearing and change in rates of covid has a correlation of -0.0182, so one person wearing mask would reduce the increase in covid rate by \(1.82\%\).
It is proven that wearing masks is associated with a lower covid rate. The coefficient is not as large as what people would expect. This might be due to the fact that the mask data is obtained from a Facebook survey, so the data might not be representative for the whole US population. However, the sample size is large enough to say that the result is significant.
Reviewer 1
The author’s objective is to find how much mask wearing can decrease COVID infection rate, specifically by finding the correlation between the two sets of values. The results from data wrangling show that a 49-day lag between the raw rates of mask wearing and the change in COVID rates has a correlation of -0.0182, which is interpreted as that one person wearing mask would reduce the increase in COVID rate by 1.82%. This result supports the conclusion which states that wearing masks is associated with a lower COVID rate but not by a coefficient that is as large as what people would expect.
The report does not use any data visualization. However, a visualization that might help enhance the project, such as a line graph or scatter plot that shows the correlation against the number of lag. This would allow the reader to compare the correlations and see how the 49-day lag is the best.
The report includes multiple credible references in the background and data source sections, which indicates a relatively thorough preparation and understanding of the project, especially the literature review. It also incorporates large detailed portions of data wrangling with code writing and calculation. The third strong aspect is the clear conclusion drawn from the results.
One thing that can be improved is that the objective of the project could be stated more clearly in the introduction. The second thing is that more explanation could be added in between the code blocks to help the reader fully understand and follow the wrangling methods and the author’s reasoning.
Reviewer 2
The objective of the report is to analyze the relationship between mask-wearing rate and number of new covid cases. The author first found the best “lag” in considering covid rate, since mask-wearing affects covid rates in the future, not on the day they wear (or don’t wear) the mask. The author continues in finding the correlation between the mask rate and the new covid cases.
The author found that the best lag was 49 days, but there was no evidence shown, and no explanation given. The author also find a small -0.018 correlation between mask wearing and new covid cases. I’m not sure if this suggests there is a connection, and a hypothesis test might be necessary to show there even is a relation between the two variables.
There was no data visualization done in this report, so there is no foundations of data visualizations to discuss.
I like the fact that the author identified the issue of lag in covid cases, and accounted for it. The author presented sources for their data very clearly, which gives the results more weight. The author also made their objective for the report very clear, and the report is well structured.
I would have liked if the author showed how they found a 49 day lag is the most significant. The report could also be improved by showing there is a relationship between mask wearing and covid cases.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".