7: Foreign Views in Lenta.ru News Articles

hide

library(tidytext)
library(ggplot2)
library(ggraph)
library(ggthemes)
library(tidyverse)
library(knitr)
library(tidymodels)
library(ggmap)
library(maps)
library(gridExtra)
library(sf)
library(dplyr)
library(stringi)
library(kableExtra)
library(plotly)

Introduction

For this project I plan to study a dataset provided from kaggle that shows every news article from the news source lenta.ru, (News Dataset from Lenta.ru, 2020). Lenta.ru is the most popular Russian language news source, and because of this I find it interesting to analyze the mentions of foreign countries in the text data of each article. I hope that this data, once extracted and visualized, will allow for an insight into Russian vision of the world from a Russian perspective.

Methods and Results

In this section, I will explain the processes for data wrangling and the production of the network visualizations of the Lenta.ru dataset, (News Dataset from Lenta.ru (2020)).

CSV Files

To load the data set, I randomly sampled in order to conserve memory, as the dataset is 1.9 GB. This sampled dataset I then saved and used. The set seed and code for the original sampling is provided below. I then loaded the sampled dataset as a csv file, which is news2.csv. I recommend downloading the dataset if you wish to increase the sample size. I also found a list of Russian stopwords to filter out common Russian prepositions and conjunctions, (Russian Stopwords Set, n.d.). In order to create my spatial dataset further in the process, I used a dataset that contains all the world countries and their latitude and longitude, (Countries.csv (2012)). This dataset I then will join with another dataset that has each country’s name in Russian, (Gabos (2022)).

hide

#making a vector with the topics I will analyze
topics <- c("Россия", "Мир", "Экономика", "Интернет и СМИ")
#set seed for my dataset.
#set.seed(1002)
#loading the Lenta.ru dataset using the entire dataset
  #headlines <- read.csv("news.csv") %>%
  
  #sampling 1% of the dataset
   #sample_frac(0.1) %>%
  
  #selecting these specific columns
    #select(text,topic,date) %>%
  
  #only taking in the topics I plan to analayze
    #filter(topic %in% topics) %>%

  # group by topic for sampling later
    #group_by(topic) %>%
    #ungroup()
#loading the randomly sampled dataset for memory conservation
headlines <- read.csv("news2.csv")
# List of Russian Stopwords
banned_words <- read.csv("stopwords.csv")

#creating a vector in order to filter text data through
banned <- banned_words$word

#list of countries in Russian langauge
countries_set <- read.csv("countries.csv")

#creating a vector to filter text data through for countries
countries <- countries_set$name %>% 
  tolower()
#loading a dataset that has all of the geogrpahic coordinates for each country
locations2 <- read.csv("countriesloc.csv", 
                      col.names = c("abbv", 
                                    "english", 
                                    "Latitude", 
                                    "Longitude" ))
locations <- locations2 %>%
  mutate(tolower(abbv))

Initial Data Exploration

For my initial exploration into this dataset, I took the time to analyze the number and topic of articles in regard to the year and month they were published. I first did some data wrangling to produce a better formulated date structure to be able to map aesthetics onto year and month separately. I did this using the mutate function and creating string subsets of the original date.

hide

headlines_sub_year <- headlines %>%
  #selecting only topic and date to reduce size 
    select(topic, date) %>%
  #creating a substring of the year by selecting first 4 characters
    mutate(year = str_sub(date, 0, nchar(date)-6)) %>%
  
  #createing a substring of the mounth by selecting only month characters
    mutate(month = str_sub(date, 6, nchar(date)-3)) %>%
  
  #creating a substring of the day 
    mutate(day = str_sub(date, 9, nchar(date)))

I then used this new dataframe that I made to process and visualize the dates and their relationship to the number of articles published.

hide

year <- ggplot(headlines_sub_year, aes(x = year, fill = month)) + 
          geom_bar() + 
          scale_color_colorblind()  + 
  labs(title = "Yearly Amount of Publications by Month") +
  facet_wrap(~topic) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))



monthly <- ggplot(headlines_sub_year, aes(x = month, fill = year)) + 
  geom_bar() + 
  labs(title = "Monthly amount of Publicaitons by Year")

year

hide

monthly

As a result of this visualization, it appears that the amount that Lenta.ru published articles remained mostly steady across topic and year. I do note that in 2008 there was a spike in the ‘Economic’ topic. I am unsure of what this could mean in regard to Russian views of global developments in this year. This degree of analysis falls outside the scope of this project, but this angle would be a prudent topic for further research with this dataset in regard to other developments of the same timeframe. It is also evident that the most common topic is ‘Russia’ or ‘Domestic affairs’.

Wrangling Text Data

For this section, I show the process that I used to wrangle the data from the text variable. Since this text variable represents the entire text of the article, I am only using the randomly sampled dataset from before to conserve space. I recommend downloading the dataset, (News Dataset from Lenta.ru (2020)), if you wish to include more articles into the text wrangling. After I unnested all of the words in the text variable and then saved them into the word variable of the new headlines_sub, I counted the number of each word. This was to set up the dataset to be ready to filter by countries, so that I have a dataset of the mentions of each country. In this process of text wrangling, I filtered the dataset through a list of Russian stopwords, Russian Stopwords Set (n.d.), in order to remove prepositions, conjunctions, and filler words. This is the same as the process for removing English stopwords, but specifically for the Russian language and its grammar. I have provided a glimpse of the wrangling.

hide

headlines_sub<- headlines %>% 
  
  # Filtering by topic in the four topics I chose
  filter(topic %in% topics) %>%
  group_by(topic) %>% 
  unnest_tokens(word, text) %>%
 
  
  #Here I am changing the grammatical forms of the word Russia to make it the 
  #same for the counting
  mutate(word = case_when(word == "россии" ~ "россия", 
                         word != "россии" ~ word)) %>%
   count(word) %>%
  ungroup() %>%

  #remvoing stop words, such as and, the, but, etc.
  filter(!word %in% banned) %>%
  arrange(desc(n)) %>%
  na.omit()

#Creating a table using the Kableextra package
headlines_sub %>% 

  #limiting the table to only 20 words, not 200,000
  head(20) %>%
  
  #adding a caption to the table
  kbl(caption = "Glimpse at the full unnested dataset with counts 
      for each word") %>%
  
  #adding a striped pattern to the table
  kable_styling(bootstrap_options = "striped") %>%
  
  #making the dimensions of the table 
  scroll_box(width = "50%", height = "400px")

(#tab:Text tokens)Glimpse at the full unnested dataset with counts for each word
topic	word	n
Россия	россия	5653
Мир	сообщает	4807
Россия	сообщает	4668
Мир	сша	4505
Россия	новости	3492
Россия	риа	3485
Экономика	процента	3365
Россия	словам	3319
Экономика	россия	3264
Экономика	долларов	3257
Россия	заявил	2991
Экономика	компании	2845
Мир	заявил	2599
Россия	рф	2573
Россия	данным	2533
Россия	суд	2067
Экономика	сообщает	2045
Россия	москвы	2028
Мир	словам	2002
Мир	страны	1919

In this step, I am taking the previous dataset, headlines_sub, and filtering it through the countries dataset from above. This process will leave only the countries from the text variable in the original headlines dataset, their topic, and the number of times each country appeared in articles. I did not put a cap on the amount of times a token of a country can come from a single article. This is because I find it illuminating the amount of times countries are mentioned in total, not only the number of articles they are mentioned in. This also allows me to retain the original weight of an article that mentioned “США”, the United States, 20 times in the overall view of the United States in this dataset.

hide

#creating a dataset that holds the english names of each country
english_names <-  left_join(countries_set, locations, by  = 
                        c("alpha2" = "tolower(abbv)")) %>%
  
  #selecting only the Russian and English country names
  select(name,english) %>%
  
  #making lowercase to match the datasets I will join together
  mutate(name = tolower(name))

#Creating a dataset with Russian and English Names and count (n)
headlines_countries <- headlines_sub %>% 
  
  #filtering the unnested `text` variable through the country list
  filter(word %in% countries) %>%
  
  #Reordering by highest to lowest mention (variable n)
  mutate(word = fct_reorder(word,n)) %>%
  
  #grouping by topic
  group_by(topic) %>%
  
  #joining the English country names to this Russian dataset
  full_join(english_names, by = c("word" = "name")) %>%
  
  #removing NA
  na.omit()

#creating a table using the kableextra package
headlines_countries %>%
  
#creating a caption for this table
kbl(caption = "Table of each countries number of mentions, n = number 
    of tokens from text wrangling, word = name of the country, 
    topic = the topic either: Мир, meaning 'world', Россия, meaning 
    Russian but rather domestic affairs, Экономнка, meaning economic; 
    and Интернет и СМИ, meaning the internet and mass media.", 
    position = "center") %>%
  
  #creating the styling of the table
  kable_styling(bootstrap_options = "striped", position = "center") %>%
  
  #making the dimensions of the table
  scroll_box(width = "70%", height = "400px")

(#tab:Counting Countries)Table of each countries number of mentions, n = number of tokens from text wrangling, word = name of the country, topic = the topic either: Мир, meaning ‘world’, Россия, meaning Russian but rather domestic affairs, Экономнка, meaning economic; and Интернет и СМИ, meaning the internet and mass media.
topic	word	n	english
Россия	россия	5653	Russian Federation
Мир	сша	4505	United States
Экономика	россия	3264	Russian Federation
Экономика	сша	1286	United States
Россия	сша	1283	United States
Мир	россия	985	Russian Federation
Интернет и СМИ	россия	612	Russian Federation
Мир	израиль	540	Israel
Интернет и СМИ	сша	416	United States
Мир	кндр	375	North Korea
Мир	ирак	348	Iraq
Мир	иран	284	Iran, Islamic Republic of
Экономика	украина	174	Ukraine
Мир	китай	125	China
Мир	великобритания	108	United Kingdom
Экономика	китай	105	China
Мир	франция	94	France
Мир	сомали	83	Somalia
Мир	пакистан	73	Pakistan
Россия	грузия	71	Georgia
Мир	сирия	63	Syrian Arab Republic
Мир	украина	62	Ukraine
Мир	германия	59	Germany
Мир	чили	56	Chile
Мир	афганистан	55	Afghanistan
Россия	китай	55	China
Мир	япония	52	Japan
Мир	ливан	48	Lebanon
Мир	египет	47	Egypt
Мир	юар	47	South Africa
Экономика	иран	45	Iran, Islamic Republic of
Россия	украина	43	Ukraine
Мир	гаити	41	Haiti
Россия	кндр	41	North Korea
Экономика	белоруссия	41	Belarus
Экономика	япония	35	Japan
Мир	индия	34	India
Мир	испания	33	Spain
Россия	великобритания	33	United Kingdom
Мир	перу	32	Peru
Мир	турция	32	Turkey
Экономика	франция	32	France
Мир	зимбабве	31	Zimbabwe
Мир	италия	31	Italy
Россия	израиль	30	Israel
Россия	ирак	30	Iraq
Экономика	германия	29	Germany
Мир	никарагуа	28	Nicaragua
Мир	оаэ	28	United Arab Emirates
Экономика	великобритания	28	United Kingdom
Россия	афганистан	27	Afghanistan
Мир	колумбия	26	Colombia
Мир	ливия	26	Libyan Arab Jamahiriya
Мир	австралия	24	Australia
Мир	бангладеш	24	Bangladesh
Мир	канада	24	Canada
Мир	сербия	24	Serbia
Россия	сомали	24	Somalia
Россия	франция	24	France
Экономика	венесуэла	24	Venezuela
Экономика	польша	24	Poland
Мир	марокко	23	Morocco
Россия	германия	23	Germany
Экономика	зимбабве	23	Zimbabwe
Экономика	казахстан	23	Kazakhstan
Мир	фиджи	21	Fiji
Россия	иран	21	Iran, Islamic Republic of
Россия	япония	21	Japan
Мир	грузия	20	Georgia
Россия	марокко	20	Morocco
Экономика	ирак	20	Iraq
Мир	кувейт	19	Kuwait
Мир	польша	19	Poland
Россия	пакистан	19	Pakistan
Экономика	азербайджан	19	Azerbaijan
Экономика	грузия	19	Georgia
Мир	венесуэла	18	Venezuela
Мир	куба	17	Cuba
Россия	азербайджан	17	Azerbaijan
Мир	бельгия	16	Belgium
Россия	египет	16	Egypt
Россия	казахстан	16	Kazakhstan
Экономика	индия	16	India
Мир	кипр	15	Cyprus
Экономика	латвия	15	Latvia
Экономика	туркмения	15	Turkmenistan
Россия	перу	14	Peru
Мир	белоруссия	13	Belarus
Мир	индонезия	13	Indonesia
Мир	филиппины	13	Philippines
Россия	польша	13	Poland
Экономика	оаэ	13	United Arab Emirates
Экономика	турция	13	Turkey
Экономика	узбекистан	13	Uzbekistan
Россия	оаэ	12	United Arab Emirates
Россия	таиланд	12	Thailand
Экономика	италия	12	Italy
Мир	болгария	11	Bulgaria
Мир	дания	11	Denmark
Мир	литва	11	Lithuania
Мир	румыния	11	Romania
Мир	судан	11	Sudan
Россия	кипр	11	Cyprus
Россия	турция	11	Turkey
Экономика	венгрия	11	Hungary
Экономика	канада	11	Canada
Экономика	литва	11	Lithuania
Экономика	мексика	11	Mexico
Экономика	молдавия	11	Moldova, Republic of
Экономика	норвегия	11	Norway
Экономика	сингапур	11	Singapore
Экономика	юар	11	South Africa
Мир	австрия	10	Austria
Мир	нидерланды	10	Netherlands
Мир	словакия	10	Slovakia
Россия	таджикистан	10	Tajikistan
Россия	юар	10	South Africa
Экономика	австрия	10	Austria
Экономика	катар	10	Qatar
Интернет и СМИ	израиль	9	Israel
Интернет и СМИ	украина	9	Ukraine
Мир	молдавия	9	Moldova, Republic of
Мир	словения	9	Slovenia
Россия	белоруссия	9	Belarus
Россия	зимбабве	9	Zimbabwe
Россия	италия	9	Italy
Россия	узбекистан	9	Uzbekistan
Экономика	алжир	9	Algeria
Экономика	швейцария	9	Switzerland
Экономика	швеция	9	Sweden
Интернет и СМИ	великобритания	8	United Kingdom
Интернет и СМИ	индия	8	India
Мир	алжир	8	Algeria
Мир	вьетнам	8	Vietnam
Мир	греция	8	Greece
Мир	иордания	8	Jordan
Мир	катар	8	Qatar
Мир	узбекистан	8	Uzbekistan
Мир	чехия	8	Czech Republic
Мир	швейцария	8	Switzerland
Мир	эквадор	8	Ecuador
Россия	литва	8	Lithuania
Экономика	нигерия	8	Nigeria
Экономика	нидерланды	8	Netherlands
Интернет и СМИ	ирак	7	Iraq
Интернет и СМИ	китай	7	China
Мир	азербайджан	7	Azerbaijan
Мир	албания	7	Albania
Мир	ирландия	7	Ireland
Мир	казахстан	7	Kazakhstan
Мир	нигерия	7	Nigeria
Мир	таджикистан	7	Tajikistan
Мир	черногория	7	Montenegro
Мир	швеция	7	Sweden
Россия	испания	7	Spain
Экономика	австралия	7	Australia
Экономика	болгария	7	Bulgaria
Экономика	бразилия	7	Brazil
Экономика	греция	7	Greece
Экономика	израиль	7	Israel
Экономика	индонезия	7	Indonesia
Экономика	оман	7	Oman
Экономика	тонга	7	Tonga
Мир	армения	6	Armenia
Мир	бахрейн	6	Bahrain
Мир	венгрия	6	Hungary
Мир	науру	6	Nauru
Мир	нигер	6	Niger
Мир	норвегия	6	Norway
Мир	таиланд	6	Thailand
Мир	финляндия	6	Finland
Мир	эфиопия	6	Ethiopia
Россия	бельгия	6	Belgium
Россия	киргизия	6	Kyrgyzstan
Россия	норвегия	6	Norway
Россия	судан	6	Sudan
Экономика	испания	6	Spain
Экономика	кндр	6	North Korea
Экономика	лаос	6	Lao People’s Democratic Republic
Экономика	пакистан	6	Pakistan
Интернет и СМИ	германия	5	Germany
Интернет и СМИ	египет	5	Egypt
Интернет и СМИ	иран	5	Iran, Islamic Republic of
Интернет и СМИ	испания	5	Spain
Интернет и СМИ	кндр	5	North Korea
Мир	бразилия	5	Brazil
Мир	люксембург	5	Luxembourg
Мир	мали	5	Mali
Мир	эстония	5	Estonia
Россия	армения	5	Armenia
Россия	дания	5	Denmark
Россия	индия	5	India
Россия	индонезия	5	Indonesia
Россия	канада	5	Canada
Россия	колумбия	5	Colombia
Россия	швейцария	5	Switzerland
Экономика	албания	5	Albania
Экономика	армения	5	Armenia
Экономика	бангладеш	5	Bangladesh
Экономика	дания	5	Denmark
Экономика	египет	5	Egypt
Экономика	науру	5	Nauru
Экономика	словения	5	Slovenia
Экономика	таджикистан	5	Tajikistan
Экономика	финляндия	5	Finland
Экономика	эстония	5	Estonia
Интернет и СМИ	азербайджан	4	Azerbaijan
Интернет и СМИ	грузия	4	Georgia
Интернет и СМИ	франция	4	France
Интернет и СМИ	швейцария	4	Switzerland
Мир	мексика	4	Mexico
Мир	хорватия	4	Croatia
Мир	чад	4	Chad
Россия	бангладеш	4	Bangladesh
Россия	бразилия	4	Brazil
Россия	венесуэла	4	Venezuela
Россия	греция	4	Greece
Россия	латвия	4	Latvia
Россия	ливия	4	Libyan Arab Jamahiriya
Россия	финляндия	4	Finland
Россия	чехия	4	Czech Republic
Экономика	бельгия	4	Belgium
Экономика	боливия	4	Bolivia
Экономика	ирландия	4	Ireland
Экономика	кипр	4	Cyprus
Экономика	колумбия	4	Colombia
Экономика	ливия	4	Libyan Arab Jamahiriya
Экономика	малайзия	4	Malaysia
Экономика	нигер	4	Niger
Экономика	перу	4	Peru
Экономика	португалия	4	Portugal
Экономика	филиппины	4	Philippines
Экономика	чехия	4	Czech Republic
Интернет и СМИ	белоруссия	3	Belarus
Интернет и СМИ	бразилия	3	Brazil
Интернет и СМИ	вьетнам	3	Vietnam
Интернет и СМИ	канада	3	Canada
Интернет и СМИ	колумбия	3	Colombia
Интернет и СМИ	пакистан	3	Pakistan
Интернет и СМИ	сингапур	3	Singapore
Интернет и СМИ	япония	3	Japan
Мир	бурунди	3	Burundi
Мир	гвинея	3	Guinea
Мир	гондурас	3	Honduras
Мир	йемен	3	Yemen
Мир	киргизия	3	Kyrgyzstan
Мир	латвия	3	Latvia
Мир	малави	3	Malawi
Мир	малайзия	3	Malaysia
Мир	монако	3	Monaco
Мир	монголия	3	Mongolia
Мир	португалия	3	Portugal
Мир	сингапур	3	Singapore
Мир	тунис	3	Tunisia
Мир	туркмения	3	Turkmenistan
Мир	цар	3	Central African Republic
Мир	эритрея	3	Eritrea
Россия	австрия	3	Austria
Россия	алжир	3	Algeria
Россия	болгария	3	Bulgaria
Россия	гаити	3	Haiti
Россия	ирландия	3	Ireland
Россия	нидерланды	3	Netherlands
Россия	никарагуа	3	Nicaragua
Россия	парагвай	3	Paraguay
Россия	сербия	3	Serbia
Россия	сирия	3	Syrian Arab Republic
Россия	словакия	3	Slovakia
Россия	филиппины	3	Philippines
Россия	чили	3	Chile
Россия	эстония	3	Estonia
Экономика	бруней	3	Brunei Darussalam
Экономика	вьетнам	3	Vietnam
Экономика	исландия	3	Iceland
Экономика	киргизия	3	Kyrgyzstan
Экономика	люксембург	3	Luxembourg
Экономика	монако	3	Monaco
Экономика	монголия	3	Mongolia
Экономика	таиланд	3	Thailand
Интернет и СМИ	афганистан	2	Afghanistan
Интернет и СМИ	казахстан	2	Kazakhstan
Интернет и СМИ	норвегия	2	Norway
Интернет и СМИ	оаэ	2	United Arab Emirates
Интернет и СМИ	узбекистан	2	Uzbekistan
Интернет и СМИ	швеция	2	Sweden
Интернет и СМИ	юар	2	South Africa
Мир	ботсвана	2	Botswana
Мир	бруней	2	Brunei Darussalam
Мир	вануату	2	Vanuatu
Мир	доминика	2	Dominica
Мир	исландия	2	Iceland
Мир	кения	2	Kenya
Мир	лаос	2	Lao People’s Democratic Republic
Мир	мальта	2	Malta
Мир	мьянма	2	Myanmar
Мир	непал	2	Nepal
Мир	оман	2	Oman
Мир	панама	2	Panama
Мир	сальвадор	2	El Salvador
Россия	аргентина	2	Argentina
Россия	бурунди	2	Burundi
Россия	венгрия	2	Hungary
Россия	вьетнам	2	Vietnam
Россия	гана	2	Ghana
Россия	доминика	2	Dominica
Россия	йемен	2	Yemen
Россия	иордания	2	Jordan
Россия	ливан	2	Lebanon
Россия	люксембург	2	Luxembourg
Россия	малайзия	2	Malaysia
Россия	мали	2	Mali
Россия	мексика	2	Mexico
Россия	молдавия	2	Moldova, Republic of
Россия	нигер	2	Niger
Россия	нигерия	2	Nigeria
Россия	румыния	2	Romania
Россия	хорватия	2	Croatia
Россия	чад	2	Chad
Экономика	ангола	2	Angola
Экономика	бурунди	2	Burundi
Экономика	гайана	2	Guyana
Экономика	гондурас	2	Honduras
Экономика	куба	2	Cuba
Экономика	кувейт	2	Kuwait
Экономика	мьянма	2	Myanmar
Экономика	непал	2	Nepal
Экономика	никарагуа	2	Nicaragua
Экономика	румыния	2	Romania
Экономика	хорватия	2	Croatia
Экономика	чад	2	Chad
Экономика	чили	2	Chile
Экономика	эквадор	2	Ecuador
Интернет и СМИ	австралия	1	Australia
Интернет и СМИ	белиз	1	Belize
Интернет и СМИ	болгария	1	Bulgaria
Интернет и СМИ	боливия	1	Bolivia
Интернет и СМИ	иордания	1	Jordan
Интернет и СМИ	италия	1	Italy
Интернет и СМИ	катар	1	Qatar
Интернет и СМИ	куба	1	Cuba
Интернет и СМИ	ливан	1	Lebanon
Интернет и СМИ	ливия	1	Libyan Arab Jamahiriya
Интернет и СМИ	марокко	1	Morocco
Интернет и СМИ	мексика	1	Mexico
Интернет и СМИ	молдавия	1	Moldova, Republic of
Интернет и СМИ	непал	1	Nepal
Интернет и СМИ	нидерланды	1	Netherlands
Интернет и СМИ	парагвай	1	Paraguay
Интернет и СМИ	перу	1	Peru
Интернет и СМИ	польша	1	Poland
Интернет и СМИ	таджикистан	1	Tajikistan
Интернет и СМИ	таиланд	1	Thailand
Интернет и СМИ	филиппины	1	Philippines
Интернет и СМИ	чад	1	Chad
Мир	аргентина	1	Argentina
Мир	барбадос	1	Barbados
Мир	гана	1	Ghana
Мир	гренада	1	Grenada
Мир	джибути	1	Djibouti
Мир	либерия	1	Liberia
Мир	лихтенштейн	1	Liechtenstein
Мир	мавритания	1	Mauritania
Мир	мадагаскар	1	Madagascar
Мир	мальдивы	1	Maldives
Мир	микронезия	1	Micronesia, Federated States of
Мир	танзания	1	Tanzania, United Republic of
Мир	тонга	1	Tonga
Мир	тувалу	1	Tuvalu
Мир	уругвай	1	Uruguay
Россия	австралия	1	Australia
Россия	гамбия	1	Gambia
Россия	исландия	1	Iceland
Россия	катар	1	Qatar
Россия	кения	1	Kenya
Россия	маврикий	1	Mauritius
Россия	мавритания	1	Mauritania
Россия	мадагаскар	1	Madagascar
Россия	монако	1	Monaco
Россия	монголия	1	Mongolia
Россия	непал	1	Nepal
Россия	португалия	1	Portugal
Россия	сенегал	1	Senegal
Россия	сингапур	1	Singapore
Россия	туркмения	1	Turkmenistan
Россия	уругвай	1	Uruguay
Россия	швеция	1	Sweden
Россия	эквадор	1	Ecuador
Экономика	андорра	1	Andorra
Экономика	аргентина	1	Argentina
Экономика	афганистан	1	Afghanistan
Экономика	бутан	1	Bhutan
Экономика	вануату	1	Vanuatu
Экономика	гаити	1	Haiti
Экономика	гамбия	1	Gambia
Экономика	гватемала	1	Guatemala
Экономика	гвинея	1	Guinea
Экономика	гренада	1	Grenada
Экономика	доминика	1	Dominica
Экономика	иордания	1	Jordan
Экономика	камбоджа	1	Cambodia
Экономика	камерун	1	Cameroon
Экономика	кения	1	Kenya
Экономика	лесото	1	Lesotho
Экономика	либерия	1	Liberia
Экономика	ливан	1	Lebanon
Экономика	лихтенштейн	1	Liechtenstein
Экономика	мальдивы	1	Maldives
Экономика	мальта	1	Malta
Экономика	руанда	1	Rwanda
Экономика	сальвадор	1	El Salvador
Экономика	сенегал	1	Senegal
Экономика	сербия	1	Serbia
Экономика	сирия	1	Syrian Arab Republic
Экономика	словакия	1	Slovakia
Экономика	тунис	1	Tunisia
Экономика	эфиопия	1	Ethiopia

For this graph, I showed the distribution of each country’s article topics. This was to visualize how each country was mentioned in regard to topic. I specifically limited the number of mentions to 50 total in order to not over-saturate the visualization with single mentions of countries. This filter was to hone in on the top countries that are mentioned in the Lenta.ru articles that I analyzed.

hide

#creating a plot of the top mentioned countries
country_plot2 <- ggplot(headlines_countries %>%
                          filter(n > 50), aes(x = english, y= n)) + 
  
  #specifying the type of visualization
  geom_col() +
  
  #placing the country names on the Y-axis by flipping x and y axes
  coord_flip() + 
  
  #adding labels and captions to the graph
  labs(title = "Countries with highest mention in Lenta.Ru by Topic",
       x = "Country ",
       y = "Number of instances in Articles",
       fill = "Topic",
       caption = "Raw count of Countries' mentions in all articles")

hide

#creating a plot that shows the distribution of topic by country
country_plot3 <- ggplot(headlines_countries %>%
                          filter(n > 50), 
                        aes(x = english, y= n, fill = topic)) + 
  
  #setting the positons to 'fill' to visualize the percentage of each country's
  #topic
  geom_col(position = "fill") +
  
  #flipping x and y axes
  coord_flip() + 
  
  #Placing the legend at the bottom of the graph
  theme(legend.position = "bottom") +
  
  #adding labels and captions
  labs(title = "Countries with highest mention in Lenta.Ru by Topic",
       x = "Country",
       y = "Number of instances in Articles",
       fill = "Topic",
       caption = "Distribution of countries with a count higher than 50, 
       (n > 50), by topic of article")

#printing out both graphs
country_plot2

hide

country_plot3

Map Making

In this section of my method, I detail how I went about creating my map graphics and joining my spatial data with my country data. To begin I joined the locations and country data set, from the CSV section above, to create a dataset that has both the name in Russian and English, along with the latitude and longitude of each country. These variables are called lat and long. I also cleaned up the data and removed the duplicate column of the English country names. After creating this I joined the dataset with the country names and their number (n) together. This produced the dataset spatial which holds the count per country (variable n), the Russian name of the country (variable name), the English name of the country (variable english), and the latitude and longitude of each country (lat and long respectively). Later on in this section, I further clean up the dataset and remove unnecessary columns.

hide

#making a temporary dataset to help join my location data to my count total
spatial2 <- left_join(countries_set, locations, by  = 
                        c("alpha2" = "tolower(abbv)")) %>%
  
  #this is used to match the country names in the 'headlines_countries' set
  mutate(tolower(name)) 

#here I am making the final dataset with country names, count, lat, and long
spatial <- left_join(headlines_countries, spatial2, by = 
                       c("word" = "tolower(name)")) %>%
  
  #I am removing this duplicate column
  select(!`english.y`) %>%
  
  #I am renaming this for simplicity 
  rename("english" = "english.x") %>%
  
  #I am reordering my count
  mutate(name = fct_reorder(name,n)) %>%
  
  #I am assigning ID values to each country to use to create the nodes 
  #and edges
  mutate(id = as.numeric(unique(name))) 

#making a variable that holds the id value of Russia
#this will be used to create the edges and nodes coming from Russia
russia_id <- spatial[1,]$id

In this code segment I am separating the datasets into specific topic based datasets. This is because, further in the process, the edges would not allow for multiple topic variables. Because of this, I made 4 separate datasets and 4 maps with each specific dataset. I also removed countries that were not mentioned at least 10 times in each topic. This was to reduce clutter and hone in on the top countries that are mentioned in Lenta.ru articles.

hide

#These datasets are to filter by topic and have a dataset for each
world <- spatial %>% 
  filter(topic == "Мир") %>%
  filter(n > 10 ) 

russia <- spatial %>%
  filter(topic == "Россия")%>%
  filter(n > 10 )

economy <- spatial %>%
  filter(topic == "Экономика")%>%
  filter(n > 10 )

internet <- spatial %>%
  filter(topic == "Интернет и СМИ") %>%
  filter(n > 10)

In this segment, I am making a theme that can be applied to all of the 4 maps, so I did not have to repeat the code for all, and it makes it possible to edit the code for the theme here to affect all the maps at once. I also made the theme for the country layers that shows all of the countries on top of the map. I will go into further depth on the different layers of my map visualizations in the next section.

hide

#Here I am creating a custom theme for the following maps to all share
#The ones commented out are ones I did not choose to use but that can be 

#saving it as a whole theme
maptheme <- #theme(panel.grid = element_blank()) +
  
  #removes the axis text 
  theme(axis.text = element_blank()) +
  
  #removes the axis ticks on the map
  theme(axis.ticks = element_blank()) +
  
  #removes the titles from the axis
  theme(axis.title = element_blank()) +
  
  #Change the font/ size of text in Map
  #theme(text = element_text(size = 12, family = "Times")) +
  
  #sets the position of the legend
  theme(legend.position = "bottom") +
  
  #Controls the background of the map
  theme(panel.background = element_rect(fill = "deepskyblue1")) +
  
  theme(plot.margin = unit(c(0, 0, 0.5, 0), 'cm'))

#This is the theme for the color, and shapes of the countries overlay
country_shapes <- geom_polygon(aes(x = long, y = lat, group = group),
                               data = map_data('world'),
                               
                               #color of countries, and their borders
                               fill = "seagreen2", color = "black",
                               
                               #size/ thickness of the boarders
                               size = 0.05)

#Sets the framing of the map to these specific coords
#I am not using this 
mapcoords <- coord_fixed(xlim = c(-150, 180), ylim = c(-55, 80))

Visualizations

For this segment of my analysis, I used the methods in Konrad (2018) to create my visualizations of the networks overlays. The process to create them involved creating and overlaying layers on top of each other. The first step was creating the nodes and edges data. This is shown in the nodes_world dataset, in which I used the specifically filtered datasets on the topic variable and selected the necessary columns. The nodes were simple to create. For the edges, I used the id that I assigned to each country to make sure that Russia is always the from of the edge. I then assigned the to variable to be the id of each country that was not Russia. Then I joined these together to have the latitude and longitude of both the from country, which is always Russia, and the to country. I then filtered out Russia from this dataset because the to and from variables were the same. This created problems for the overlay of the geom_curve() function, so I removed Russia.

The process to make the map is all the same, and only the specific dataset for the topic variable changes. I made the base dataset the nodes, then overlaid the previous country_shapes dataset, which is the overlay for the green visuals of the countries of the world. The next layer was the curves of the edges. For this I used geom_curve() and assigned the ends and starts to the lat and long of the ending node and beginning nodes. The next Layer was the points themselves. This involved using geom_point. I mapped the size of the points to the number of mentions, n, and I had to run n through log10 in order to make the sizes proportionate to each other. Originality without doing this, the United States and Russian Federation points were so large they covered the entire map. I then used facet_wrap() to create the header that displays the topic name in Russian. This was only for aesthetic purposes, as there is only one topic in each data set. Finally I added the map theme from the previous section, in order to easily have the same base theme. I changed the colors for each topic to better separate by topic. The process is the same for each map.

hide

#Creating the nodes from the specific topic dataset
nodes_world <- world %>%
  select(c("id","name","english","n","topic","Latitude","Longitude")) %>%
  rename(lat = Latitude, lon = Longitude)

#trying to make the nodes all start from Russia's id
edges_world <-  nodes_world %>%
  mutate(from = russia_id) %>%
  mutate(to = id) 

#creating the edges by joing the nodes and edges datasets
edges_for_world <- edges_world %>%
  inner_join(nodes_world %>% select(id, lon, lat), by = c('from' = 'id')) %>%
  inner_join(nodes_world %>% select(id, lon, lat), by = c('to' = 'id')) %>%
  filter(!name =="Россия") 

#beginning the ggplot pipeline for the maps
world_plot <-ggplot(nodes_world) + 
  
  #overlay of country shapes from segment above
  country_shapes +
  
  #adding the edges 
  geom_curve(aes(x = `lon.y`, 
                 y = `lat.y`, 
                 xend = lon, 
                 yend = lat),
             data = edges_for_world,
             curvature = 0.1,
             alpha = 0.5, 
             color = "darkorange1") +
  
  scale_size_continuous(guide = FALSE) +
  
  #adding the country points to the map
  geom_point(aes(x = lon, y = lat),                         
             shape = 21, 
             size = log10(world$n), 
             fill = 'darkorange2',
             color = 'black', stroke = 0.5)  +
  
  #title
  labs(title = "Mentions of Foreign countries in 'World' topic") + 
  
  #adding a facet wrap only for aestetic reasons 
  facet_wrap(~topic) +
  
  #adding the theme from above
  maptheme

#plotting the grpah
world_plot

hide

#Creating the nodes from the specific topic dataset
nodes_economy <- economy %>%
  
  select(c("id","name","english","n","topic","Latitude","Longitude")) %>%
  rename(lat = Latitude, lon = Longitude)

#trying to make the nodes all start from Russia's id
edges_economy <-  nodes_economy %>%
  mutate(from = russia_id) %>%
  mutate(to = id) 

#creating the edges from the nodes and edges datasets by joining them
edges_for_economy <- edges_economy %>%
  inner_join(nodes_economy %>% select(id, lon, lat), by = c('from' = 'id')) %>%
  inner_join(nodes_economy %>% select(id, lon, lat), by = c('to' = 'id')) %>%
  filter(!name =="Россия") %>%
  unique()

#beginning of ggplot pipeline
economy_plot <-ggplot(nodes_economy) + 
  
  #Loading country layer with theme settings from above
  country_shapes +
  
  #overlay of the edges 
  geom_curve(aes(x = `lon.y`, 
                 y = `lat.y`, 
                 xend = lon, 
                 yend = lat),
             data = edges_for_economy,
             curvature = 0.2,
             alpha = 0.5, 
             color = "violetred1") +
  
  scale_size_continuous(guide = FALSE) +
  
  #overlay of the points
  geom_point(aes(x = lon, y = lat),                         
             shape = 21, 
             size = log10(economy$n), 
             fill = 'red',
             color = 'black', stroke = 0.5)  +
  
  labs(title = "Mentions of Foreign countries in 'Economy' topic") + 
  
  #only for aesthetic reasons 
  facet_wrap(~topic) +
  
  #theme from above
  maptheme

economy_plot

hide

#Creating the nodes from the specific topic dataset
nodes_russia <- russia %>%
  select(c("id","name","english","n","topic","Latitude","Longitude")) %>%
  rename(lat = Latitude, lon = Longitude)

#trying to make the nodes all start from Russia's id
edges_russia <-  nodes_russia %>%
  mutate(from = russia_id) %>%
  mutate(to = id) 

#making the edges from the nodes and edges datasets 
edges_for_russia <- edges_russia %>%
  inner_join(nodes_russia %>% select(id, lon, lat), by = c('from' = 'id')) %>%
  inner_join(nodes_russia %>% select(id, lon, lat), by = c('to' = 'id')) %>%
  filter(!name =="Россия") %>%
  unique()

#beginning of ggplot pipeline 
domestic_plot <-ggplot(nodes_russia) + 
  
  #Loading country layer with theme settings from above
  country_shapes +
  
  #making it allign with color of topic
  geom_curve(aes(x = `lon.y`, 
                 y = `lat.y`, 
                 xend = lon, 
                 yend = lat),
             data = edges_for_russia,
             curvature = 0.2,
             alpha = 0.5, 
             color = "hotpink1") +
  
  scale_size_continuous(guide = FALSE) +
  
  #adding the point layer
  geom_point(aes(x = lon, y = lat),                         
             shape = 21, 
             size = log10(russia$n), 
             fill = 'hotpink',
             color = 'black', stroke = 0.5)  +
  
  #title of graph
  labs(title = "Mentions of Foreign countries in 'Domestic' topic") + 
  
  #only for aestetic reasons 
  facet_wrap(~topic) +
  
  #adding the theme from above
  maptheme

domestic_plot

hide

#Creating the nodes from the specific topic dataset
nodes_internet <- internet %>%
  select(c("id","name","english","n","topic","Latitude","Longitude")) %>%
  rename(lat = Latitude, lon = Longitude)

#trying to make the nodes all start from Russia's id
edges_internet <-  nodes_internet %>%
  mutate(from = russia_id) %>%
  mutate(to = id) 

#creating the edges by joing together the edges and nodes datasets
edges_for_internet <- edges_internet %>%
  inner_join(nodes_internet %>% select(id, lon, lat), by = c('from' = 'id')) %>%
  inner_join(nodes_internet %>% select(id, lon, lat), by = c('to' = 'id')) %>%
  filter(!name =="Россия") %>%
  unique()

#beginning of ggplot pipeline
internet_plot <-ggplot(nodes_internet) + 
  
  #Loading country layer with theme settings from above
  country_shapes +
  
  #adding edges layer
  geom_curve(aes(x = `lon.y`, 
                 y = `lat.y`, 
                 xend = lon, 
                 yend = lat),
             data = edges_for_internet,
             curvature = 0.2,
             alpha = 0.5, 
             color = "darkorchid1") +
  
  scale_size_continuous(guide = FALSE) +
  
  #adding point layer
  geom_point(aes(x = lon, y = lat),                         
             shape = 21, 
             size = log10(internet$n), 
             fill = 'purple',
             color = 'black', stroke = 0.5)  +
  
  labs(title = "Mentions of Foreign countries in 'internet' topic") + 
  
  #only for aesthetic reasons 
  facet_wrap(~topic) + 
  
  #adding theme from above
  maptheme

internet_plot

Conclusion

The results of my visualizations and data wrangling show that the country that is mentioned the most in this Lenta.ru data set, is the United States by far. The only other country that is mentioned more than the United States is Russia itself which makes sense. This shows that in Lenta.ru articles the main foreign country that is mentioned is the United States. This lends itself to the idea that Russia is most concerned with the United States in regard to certain topics. The domestic topic is the only topic where Russia is mentioned more than the United States and this makes sense in regard to foreign affairs. The United States, in the Russia view, is a foreign affair not a domestic one.

For the internet topic, in Russian “Интернет и СМИ”, It is interesting that the only country that is mentioned more than 10 times is the United States. I find this interesting and an insight into the relationship between Russian and American internet culture. Clearly, there is not enough to analyze the cultures specifically, but it is shown in this analysis that there is a great number of mentions of one and another. I find this interesting and a potential point of research in the future. Seeing as the internet culture has grown tremendously in the past two decades, this can be an excellent point of research into the relationship between Russian and American cultures.

For further research on this topic, I can foresee further exploration of Russian news agencies being prudent and yielding fruitful results. It would be interesting to see if the same weight of each countries’ mention would hold consistent across other Russian news agencies. It would also be interesting to investigate other countries and the interaction between news articles in multiple countries. This, I predict, could illuminate responses of countries and interaction between them. This can then be visualized with the same method.

Class Peer Reviews

Reviewer 1
1. State the authors’ objectives and the general questions that the authors are considering. Do the results and figures support the conclusions made in the report?
The author’s objective is to provide insight into the Russian vision of the world from a Russian perspective by analyzing mentions of foreign countries in Russia’s most popular news source Lenta.ru. The conclusion of the report frames insight into the Russian vision of the world specifically through the lens of the number of mentions of foreign countries in the categories of World, Economy, Domestic, and Internet. Their conclusion particularly notes that the main foreign country mentioned was the US and demonstrates that in the Internet category, the US was the only foreign country mentioned over 10 times, which they note could be an interesting area of further research. The results and figures well support the author’s conclusions. I appreciate the demonstration of workflow in finding the results as shown by the bar plots reporting publication frequency and mentions of foreign countries. I also find the maps to be a very effective way of visualizing foreign country mentions. The one thing that I think might be helpful in this area would be to provide some sort of figure to show have the size of the node corresponds to the numbers of mentions as it would have been interesting to see the differences in mentions between the different categories.
1. Discuss the foundations of data visualizations relative to the figures presented in the report.
The foundations of data visualization are Verifications (checking for errors and the source of data), Dimensions (the variables shown, variable types, numbers of dimensions), Aesthetics (the scale of the plots, coloring, and size of elements), and Interpretations and Intensions (intended meaning, target audience, and purpose). The author of this report does a great job working with each of these foundations. In terms of verification, it seems they have done their research well to select the most popular source of Russian media. For dimensions, the author uses text data and tokenizes the headlines to select country names and looked at the topic of the articles and the number of mentions of each country. Overall, the bar plots contain 3 to 4 dimensions, with increased dimensions coming from the use of facet_wrap and color to add categories of articles, months, and years. In terms of aesthetics, the author uses primarily uses color and size to add additional information to plots. The interpretations and intentions of the author are clearly stated in the introduction and conclusion. The majority of the plots are presented very objectively and areas, where the author did additional wrangling to change aspects of how the data is presented (as with the size of the nodes in the maps) are transparently stated.
1. State 3 things that are strong about their report and 2 things that can be improved.
Strong: 1. The written component of this report is quite strong. The author makes it very clear what they did in each part of the report such that I believe it would be very possible to reproduce their methods. 2. The data wrangling of this project is well done. I really appreciated the incorporation of the scrolling table that allows the reader to see all the hard work going on behind the scenes of the visualization. 3. I appreciate the flow of the visualizations. Starting with the bar plots to generally visualize the data before providing specific insights into Russian media through the maps was something I thought was very effective.

To be improved: 1. I mentioned this above, but I think the maps could have really benefitted from a legend to allow the reader to easily see how the countries mention change between the categories as well as how the number of mentions changes. 2. In the figure that shows the “Countries with the mention in Lenta.Ru by Topic,” it would have been nice to have a subtitle for each topic giving the translation into English. This would be beneficial as it would be easier to see how this bar plot relates to the maps later on in the report.

Countries.csv. (2012). Dataset Publishing Language. https://developers.google.com/public-data/docs/canonical/countries_csv

Gabos, S. (2022). World countries. https://stefangabos.github.io/world_countries/

Konrad, M. (2018). Three ways of visualizing a grpah on a map. https://datascience.blog.wzb.eu/2018/05/31/three-ways-of-visualizing-a-graph-on-a-map/

News dataset from lenta.ru. (2020). https://www.kaggle.com/datasets/yutkin/corpus-of-russian-news-articles-from-lenta; DMITRYYUTKIN.

Russian stopwords set. (n.d.). https://countwordsfree.com/stopwords/russian.

Foreign Views in Lenta.ru News Articles

Introduction

Methods and Results

CSV Files

Initial Data Exploration

Wrangling Text Data

Map Making

Visualizations

Conclusion

Class Peer Reviews

References

Reuse