Investigating New York Times News Bias

Declan Cruz (Data Science at Reed College)https://reed-statistics.github.io/math241-spring2022/ , Josie Bicknell (Data Science at Reed College)https://reed-statistics.github.io/math241-spring2022/
May 4, 2022

Introduction

The Western media coverage of the 2022 Ukrainian-Russian conflict has called attention to news coverage bias. Media networks from the United States’s Washington Post(Ellison & Andrews, 2022) to Qatar’s Al Jazeera(Staff, 2022) have published articles that report the double standards of Western coverage of the Ukrainian-Russian conflict compared to their respective coverage of the 2003 Iraq War. Many of these articles specifically respond to a quotation from CBS News senior foreign correspondent Charlie D’Agata, who said that Ukraine “isn’t a place, with all due respect, like Iraq or Afghanistan, that has seen conflict raging for decades. This is a relatively civilized, relatively European – I have to choose those words carefully, too – city, one where you wouldn’t expect that or hope that it’s going to happen.”(Lambert, 2022) While there is a breadth of coverage and discussion investigating this bias, they are largely informal blog posts and infographics.

Inspired by this discussion, we aim to explore articles from the New York Times (New York Times ) to visualize how Western media may have covered these conflicts differently. Additionally, as much of the discussion of this potential coverage bias touches on the different portrayals of Ukraine versus Iraq, we will investigate sentimentality and language use in titles about each conflict. This investigation falls into our more general exploration of the media coverage of different world regions over time. We will create visualizations that will attempt to provide a clear representation of coverage to present how language use and sentimentality may be explicitly or implicitly biased coverage.

Methods

Our Data:

Our data was acquired from the New York Times Developer portal. The New York Times maintains an API of all past articles which includes an abstract, a byline, document type, keywords, headline, lead paragraph, news desk the article is from, publication date, section name, snippet, subsection name, and word count. We created developer accounts for the New York Times which gave us API login keys so we could make particular calls for articles. Using Mkearney’s data scraping tool we went about compiling all of our data.(Kearney, 2017) First, we started by acquiring all the articles published in the first month after the US invasion of Iraq. These are all the articles published from March 20th, 2003 to April 20th, 2003, and all the articles published from February 24th, 2022 to March 24th, 2022. Then, we performed the same act but for the more recent first month of the invasion of Ukraine by Russia. Third, we created a random sampling tool. This gave us five random sets of years, months, and days between 2010 and 2020. From these randomly selected days, we collected the 100 articles published from that day towards now in time. This gave us three datasets, one from the start of the Iraq war, one from the start of the Ukraine-Russia war, and one randomly sampled.

Each dataset had the variables discussed above: abstract, a byline, document type, keywords, headline, lead paragraph, news desk the article is from, publication date, section name, snippet, subsection name, and word count. All of these should be self-explanatory. From this, we created a multitude of subsets. From the randomly sampled articles, we selected for the names of the sections, Africa, the Americas, Asia Pacific, Australia, Europe, and the Middle East. Each of these subsections refers to the region the article is about. Further, for both the Iraq and Ukraine data we concatenated the headline, abstract, and lead paragraph into a larger more descriptive body. From here we tokenized the data into both unigrams and bigrams and filtered for tidytexts stopwords. From here, we broke these down into a dataset where we exclusively had articles that included Ukraine or Iraq and a dataset that had all the articles. With the exclusively Ukraine and Iraq data, we performed an emotional analysis using the lexicon_nrc to acquire the apt emotions.(Mohammad, n.d.) Using the data with all articles, we calculated the proportions of articles published about Iraq and Ukraine during their respective time periods. This left us with five key datasets: the randomly sampled one referring to the various regions, two emotional analysis ones, and two frequency ones.

Sample 1 of Data Wrangling: Tokenization

A critical component of our analysis was the tokenization of the New York Times articles that mentioned Iraq and Ukraine. To tokenize our data, we created two functions: unigram_tokenizer and bigram_tokenizer. The arguments of the functions are the date, in dttm format, and the dataset. These functions take the initial Iraq and Ukraine datasets obtained from New York Times Developer portal and convert the combined texts variable (containing the abstract, lead paragraph, and headline data) into either single words (unigrams) or two words (bigrams) tokens. The function then removes words characterized as stopwords (eg. words including a, as, in, of, etc). We combined this function with a for loop that takes the vector of dates from the first thirty-one days of each conflict and ultimately reports the segmented words, the number of occurrences of the word, and the date on which that word was published.

hide
# load master data sets: these contain Ukraine and Iraq data 
emotions <- read_csv("emotions.csv")
drop_emotions <- read_csv("drop_emotions.csv")
unigrams <- read_csv("unigrams.csv")
bigrams <-read_csv("bigrams.csv")
occurrences4 <- read_csv("occurrences4.csv")
frequency <- read_csv("frequency.csv")

#load Iraq data sets: 
iraq <- read.csv("iraq.csv") 
new_iraq <-read.csv("new_iraq.csv")
iraq_counts <-read.csv("iraq_counts.csv")
iraq_data <-read_csv("iraq.csv")
iraq_texts<-read_csv("iraq_texts.csv")
iraq_texts_dates<-read_csv("iraq_texts_dates.csv")

#load Ukraine data sets :
refinedukraine <- read.csv("refinedukraine.csv")
ukraine_counts <- read.csv("ukraine_counts.csv")
ukraine_data <- read.csv("ukraine_data.csv")
ukraine_data <-read_csv("ukraine_data.csv")
ukraine_texts<-read_csv("ukraine_texts.csv")
ukraine_texts_dates<-read_csv("ukraine_texts_dates.csv")

# Emotions Data
iraq_4_emotions<- read_csv("iraq_4_emotions.csv")
iraq_emotions <- read_csv("iraq_emotions.csv")
ukraine_4_emotions<-read_csv("ukraine_4_emotions.csv")
ukraine_emotions<- read_csv("ukraine_emotions.csv")
swd_list <- read_csv("swd_list.csv")
emotions_4_with_intensity <- read_csv("emotions_4_with_intensity.csv")

#load World data sets:
refinednytworld <- read.csv("refinednytworld.csv")
nytworld <- read.csv("nytworld.csv")

Sample 2 of Data Wrangling: Compiling Iraq Information

In order to perform our analysis of emotions in New York Times coverage of Iraq versus Ukraine, we compiled the text data from the headline, the abstract, and the lead paragraph of each article into a new variable called texts. After increasing our text data sample, we filtered for only articles that mentioned Iraq or Ukraine respectively. In the interest of creating a time series, we also wanted to make sure our new dataset would contain the dates of each publication reported as the day number of the conflict. To do this, we converted the dttm format of the publication date to numbers 1 through 31 using the case when function. The following data set contained three new variables in addition to the original dataset: combined texts, mention of the respective country, and day.

hide
# Combining key columns from the NYT API data
texts2 <- paste0(ukraine_data$lead_paragraph, ukraine_data$abstract, ukraine_data$headline)
ukraine_texts <- ukraine_data %>%
  mutate(texts = texts2) %>% 
  # filtering down for articles with ukraine
   mutate(mentions = case_when(grepl("ukraine", texts, ignore.case = TRUE) 
                                  ~ "news that mentioned ukraine",
  # takes any lead paragprah that does not mention Iraq as new that did not mention iraq
                                  !grepl("ukraine", texts, ignore.case = TRUE) 
                                  ~"news that did not mention ukraine")) %>%
  filter(mentions == "news that mentioned ukraine")

# Renaming the dates so they can work with the Iraq data
ukraine_texts_dates <- ukraine_texts %>% 
    mutate(date = as_date(pub_date)) %>% 
   mutate(date = case_when(date == "2022-02-18"~ 1, 
                         date == "2022-02-19"~ 2, 
                         date == "2022-02-20"~ 3, 
                         date == "2022-02-21"~ 4, 
                         date == "2022-02-22"~ 5,
                         date == "2022-02-23-"~ 6,
                         date == "2022-02-24"~ 7,
                         date == "2022-02-25"~ 8,
                         date == "2022-02-26"~ 9,
                         date == "2022-02-27"~ 10,
                         date == "2022-02-28"~ 11,
                         date ==  "2022-03-01"~ 12,
                         date ==  "2022-03-02"~ 13,
                         date ==  "2022-03-03"~ 14,
                         date ==  "2022-03-04"~ 15,
                         date ==  "2022-03-05"~ 16,
                         date ==  "2022-03-06"~ 17,
                         date ==  "2022-03-07"~ 18,
                         date ==  "2022-03-08"~ 19,
                         date ==  "2022-03-09"~ 20,
                         date ==  "2022-03-10"~ 21,
                         date ==  "2022-03-11"~ 22,
                         date ==  "2022-03-12"~ 23,
                         date ==  "2022-03-13"~ 24,
                         date ==  "2022-03-14"~ 25,
                         date ==  "2022-03-15"~ 26,
                         date ==  "2022-03-16"~ 27,
                         date ==  "2022-03-17"~ 28,
                         date ==  "2022-03-18"~ 29,
                         date ==  "2022-03-19"~ 30,
                         date ==  "2022-03-20"~ 31,
                         )) %>% 
    arrange(date)

Sample 3 of Data Wrangling: Aquiring Emotion Data Over Time

We then took this data and tokenized it into unigrams, meaning we divided the text variable into individual words. We also removed all stop words. After the text variable was tokenized, we used the full join function to merge our data set with the NRC lexicon’s emotions data, containing the four emotions of fear, anger, sadness, and joy, and took the count of the number of words that fell into each emotional category to compute the frequency of each emotion in the New York Times coverage of the first 31 days of each conflict.

hide
# Tokenizing
iraq_emotions <-iraq_texts_dates %>%
  group_by(texts, date) %>% 
  # tokenize by unigrams (words)
  unnest_ngrams(word, texts, n = 1) %>% 
  group_by( date, word) %>% 
  # count the words
  summarise(count = n(), .groups = "drop") %>%
  # filter out the stopwords
  filter(!word %in% swd_list)

# Finding the emotions by date
iraq_4_emotions <- iraq_emotions %>%
  left_join(emotions_4_with_intensity, by = c("word" = "term")) %>%
  drop_na() %>%
  group_by(AffectDimension, date) %>%
  summarise(count = sum(count), .groups = "drop")

# Finding the number and frequency of emotions over each day
iraq_4_emotions <-iraq_4_emotions %>% 
   group_by(date) %>% 
  summarise(sum_count = count/sum(count),
            AffectDimension = AffectDimension)

Results

Word Clouds

hide
# Selecting the correct variables to apply to the cloud, the word and counts
iraq_counts_cloud <- iraq_counts %>% 
  select(word,n)
# Creating wordcloud
wordcloud2(data = iraq_counts_cloud, size=1.6, color='random-light', backgroundColor="black")
hide
# Selecting the correct variables to apply to the cloud, the word and counts
ukraine_counts_cloud <- ukraine_counts %>% 
 select(word,n)
# Creating wordcloud
wordcloud2(data = ukraine_counts_cloud, size=1.6, color='random-light', backgroundColor="black")

First, to broadly examine the New York Times coverage of the two conflicts we created word clouds that visualize the frequency of words associated with Iraq and Ukraine.(Lang, 2022) Using the unigrams we compiled and the counts for each word we created interactive word clouds that depict the frequency of a word by its size. Further, it is interactive such that you can hover over a word and it will show you its specific count. The Iraq word cloud is very war-centric. There are many words like “P.O.W.”, “dead”, “captured”, or “control”. There are more minimal words such as “Bush”, “America”, or “Brittons” which do not seem to be explicitly war-related. Our Ukraine word cloud on the other hand is drastically less explicit. It contains more toned down descriptions such as words that seem related to war, such as “horrors”, “destruction”, and “invasion.” There is a much greater amount of toned down words such as “briefing”, “plan”, “oil” or “crisis”. This difference in tone could indicate a broader underlying bias. As we move forward in the project, we decided to more thoroughly examine the sentiments between the conflicts to ascertain whether the diction used truly is all that different.

Article Publication Frequency and Counts

hide
# Plotting Counts  
p <- occurrences4 %>%
  ggplot( aes(x=date, y=frequency, color = case)) +
    geom_line() +
     scale_color_OkabeIto()+
    labs(title = "Frequence of New York Times Articles Published", y = "Frequency", 
    x = "Day of Conflict",
    color = "Country" )+
    theme_ipsum()
  
  
# Turn it interactive with ggplotly
p <- ggplotly(p, dynamicTicks = TRUE) %>% 
  rangeslider() %>% 
  layout(hovermode = "x")
p
hide
# Plotting Frequencies
p2 <-  frequency %>%
  ggplot( aes(x=date, y= n, color = case)) +
    geom_line(aes(date, n, color = case)) +
     scale_color_OkabeIto()+
    labs(title = "Number New York Times Articles Published", y = "Number of Articles", 
    x = "Day of Conflict",
    color = "Country " )+
    theme_ipsum()
  
p2 <- ggplotly(p2, dynamicTicks = TRUE) %>% 
  rangeslider() %>%
  layout(hovermode = "x")
p2

To further establish a baseline comparison between the respective New York Times coverage of the Iraq and Ukraine conflicts, we analyzed the frequency and count of articles that contained the keywords “Iraq” and “Ukraine” in the first thirty-one days of each conflict. Frequency refers to the respective proportion of articles that mentioned Iraq and Ukraine out of the total sum of articles the New York Times published during these time periods. In our analysis of article frequency, our visualization reported that the New York Times published a higher overall frequency of articles that mentioned Iraq than articles that mentioned Ukraine. In the first fifteen days of conflict, the New York Times frequency of Iraq and Ukraine containing articles are nearly equivalent, however, as the conflicts near the thirty-first day, these trends diverge. On day thirty, the publication of articles that mention Iraq has a frequency of 0.70, while the publication of articles that mention Ukraine has a frequency of 0.06. These results track with the word clouds from earlier. The word clouds indicated a potentially less emotional or visceral reaction to the Ukraine war than there was to the Iraq war. It then makes sense that the New York Times also would publish fewer articles per capita about the Ukraine invasion. However, it is surprising that it starts to diverge so much towards the end. Either, this is indicative of a trend in which the New York Times slowly is publishing fewer articles about the Ukraine invasion, or it means that we have too small of a sample size of data. It could also simply be that in the 19 years between the conflicts, the New York Times publishes drastically more articles and does not feel they should compensate for this increase in articles with an increase in coverage for particular kinds of events. Expanding our dataset to the more recent days of the conflict may be a good way to better gauge whether this disparity in reporting is a broader trend.

Sentiment Density

hide
# Plotting Sentiments
 ggplot(data= drop_emotions , aes(x= n, group= AffectDimension, fill= AffectDimension)) +
  geom_density(adjust= 7, alpha=.4) +
  facet_wrap(~case)+
  theme_calc()+
  theme(panel.background = element_blank())+
  labs(fill = "Emotions", 
       title = "Sentiment Density in First 31 Days of Conflict ",
       y = "Density of Sentiment", 
       x = "Number of Articles")

The density plots depict the number of articles on the x-axis with the density of particular emotions on the y-axis. The emotions are shown through colors as well. They are anger, fear, joy, and happiness. The density plots visualize the distribution shape of the emotions, which allows the viewer to see emotional trends more clearly than a standard frequency histogram. Iraq’s density plots show that as the number of articles increases there seems to be more fear for coverage of the Iraq conflict. In comparison, the Ukraine conflict has a more consistent combination of all emotions. These plots served as our preliminary investigation of the distribution of emotions in New York Times media coverage of the Iraq and Ukraine conflicts. The results compelled us to chart these emotional distributions over the first 31 days of conflict rather than simply the number and frequency of article publications.

Emotion Frequency Over the First Month of Conflict

hide
# Plotting Emotions Over Time
 ggplot(data = iraq_4_emotions, aes(x = date, y = sum_count,color = AffectDimension)) +
  geom_line(size = 0.7) +
  scale_color_OkabeIto()+
  theme_bw()+
 scale_x_continuous( breaks = scales::pretty_breaks(n = 10))+
  labs(x = "Day of Conflict",
       y = "Frequency of Word Counts",
       color = "Affect Dimension",
       title = "Emotions of NY Times Iraq Coverage over Time (2003)")
hide
 ggplot(data = ukraine_4_emotions, aes(x = date, 
                                            y = sum_count,
                                            color = AffectDimension)) +
  geom_line(size = 0.7) +
  scale_color_OkabeIto()+
  theme_bw()+
   scale_x_continuous( breaks = scales::pretty_breaks(n = 10))+
  labs(x = "Day of Conflict",
       y = "Frequency of Word Counts",
       color = "Affect Dimension",
       title = "Emotions of NY Times Ukraine Coverage over Time (2022)")

Coverage of both the Ukraine invasion and the Iraq war is primarily fear-based. This makes sense given the gravity of both situations. Nuclear-armed countries waging war on smaller states for tenuous reasons at best is a fearful situation for many. In the case of Iraq, there was a larger proportion of fear however than there was in the case of Ukraine. This may be because Iraq was instigated by the U.S. and the New York Times is a U.S.-based news source so its coverage would be more personal. The rates of anger, joy, and sadness seem to be comparable between the two. These observations offer interesting insights into the behavior of New York Times coverage bias. While there appear to be significant differences in fear that highlight potential coverage bias, the comparable rates of anger, joy, and sadness may demonstrate that bias in New York Times coverage operates under some metrics but not others. One might assume that the U.S.-based news source would have as much reporting on the potential sadness of war if their country was involved. However, even with events such as the atrocities committed in Bucha Ukraine, the amount of sad, angry, and joyous coverage seems to be similar between the two events. Some columnists have commented that Ukraine has become a pseudo proxy war however which may be one reason reporting seems to be quite similar, at least from our data.(Kaplan, 2022)

Number of Articles Published in Various Regions

hide
# Barplot of the Number of Articles Published for Each Region
refinednytworld %>%
  subset(subsection_name == "Africa" | subsection_name == "Americas" |
           subsection_name == "Asia Pacific" | subsection_name == "Australia" |
           subsection_name == "Europe" | subsection_name == "Middle East") %>%
  ggplot(aes(x = subsection_name, fill = subsection_name)) + 
  geom_bar() + scale_fill_OkabeIto() +
  labs(
    title = "Occurrences of Regions from Sampled NYT Data",
    y = "Number of Occurrences",
    x = " Region",
    fill = ""
  ) +
  theme_calc()+
  theme(legend.position="none")

We hoped to use the comparison of the Iraq and Ukraine conflicts as a means to examine media bias in smaller-scale events. However, to examine large-scale events we wanted to look at different coverage of different regions in the world. Because the New York Times divides its articles in the API into regions of the world, Africa, the Americas, Asia Pacific, Australia, Europe, and the Middle East, we had hoped that these could be used to gauge cross-regional sentiment. Using 500 articles, sampled from five randomly selected days between 2010 and 2020, we used the NRC lexicon to compare the emotions involved.(Robinson & Silge, 2021),(Mohammad, n.d.) Unfortunately the New York Times data had a drastically smaller number of articles in these sections than we had hoped. The small sample size of articles restricted our ability to conduct a comprehensive emotional analysis of New York Times regional coverage. To uncover emotional trends in regional coverage, we had hoped to have an extensive sample of randomly selected publication dates. Instead, we plotted the counts of articles covering each region over the randomly selected days. In the future, we hope to sample more articles over a wider randomized period of days. While establishing publication counts is an important first step when analyzing the distribution of media coverage, there are several next steps we wish to make to better analyze the emotional distribution. First, we would find the top words associated with each and based on these top words, find each region’s positivity score. From here, we would make an interactive map that shows the color of each region by the positivity score, and by hovering over the regions you would be able to see the words most associated with them. This final visualization would allow viewers to see the emotional analysis accompanied by the spatial relationships between each region.

Conclusion

Both the U.S. invasion of Iraq in 2003 and Russia’s invasion of Ukraine in 2022 had massive consequences which reverberated around the globe. News coverage shapes public opinion and public understanding. Thus, the coverage of these events is central to informing and educating the public. News coverage bias is a massive issue because it means people are being presented with incomplete or simply biased information that may greatly affect their opinions. In the case of these two conflicts, our data from the New York Times indicates that there may be some coverage bias.

Our word clouds indicated that there are some differences in the top words used that are associated with Ukraine and Iraq. The Iraq wordcloud involved significantly more brutal war-based terms than the Ukraine wordcloud did. This could be for a multitude of reasons such as the New York Times more readily highlighting the visceral nature of war when the country they are based in is involved. Thus, the word clouds did suggest there was some bias.

Our data surrounding publication count and frequency of articles in both conflicts indicated that Ukraine had a similar number of articles published, however, its proportion of articles was drastically lower. This means that there was a similar base amount of coverage for both Iraq and Ukraine, but, because between 2003 and 2022 the New York Times began to publish many more articles per day, the relative proportion of coverage for Ukraine is lower.

Our density plots are somewhat hard to interpret. However, they do show that initially, coverage of the Iraq conflict was much more emotional than in Ukraine but as more articles are published it evens out. The time series plots we made indicate that most coverages involved similar levels of anger, joy, and sadness, however, fear was much higher in the coverage of Iraq. This makes sense because of the U.S.’s direct involvement in the conflict. However, it could also indicate underlying bias. Finally, in our regional publication plot, we see a strong skew towards reporting about Europe over all other regions.

From all of our analysis, we found that there are some differences between the New York Times’ reporting of the Ukraine and Iraq conflicts, such as article frequency, tone, and emotions present. However, because of our small time scale, a small selection of events, and single news source, this information is not generalizable. Further, because the New York Times is a U.S. news organization, it is logical that they would have different approaches to reporting a U.S.-instigated conflict. Regardless, as evidenced by our emotion time series, frequency plots, wordclouds, density plot, and barplot, the New York Times does present bias in its coverage of certain regions and events.

Class Peer Reviews

Ellison, S., & Andrews, T. M. (2022). “They seem so like us”: In depicting ukraine’s plight, some in media use offensive comparisons. The Washington Post. https://www.washingtonpost.com/media/2022/02/27/media-ukraine-offensive-comparisons/
Kaplan, F. (2022). Everyone is starting to admit something frightening about ukraine. Slate. https://slate.com/news-and-politics/2022/04/ukraine-nato-russia-proxy-war.html
Kearney, M. W. (2017). Nytimes: New york times APIs. https://github.com/mkearney/nytimes
Lambert, H. (2022). CBS reporter calls ukraine “relatively civilized” as opposed to iraq and afghanistan, outrage ensues (video). The Wrap. https://www.thewrap.com/cbs-charlie-dagata-backlash-ukraine-civilized/
Lang, D. (2022). wordcloud2: Create word cloud by htmlWidget. https://github.com/lchiffon/wordcloud2
Mohammad, S. M. (n.d.). NRC lexicon. https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
Robinson, D., & Silge, J. (2021). Tidytext: Text mining using dplyr, ggplot2, and other tidy tools. https://github.com/juliasilge/tidytext
Staff, A. J. (2022). “Double standards”: Western coverage of ukraine war criticised. Al Jazeera. https://www.aljazeera.com/news/2022/2/27/western-media-coverage-ukraine-russia-invasion-criticism

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".