Foreign Views in Lenta.ru News Articles

Miles Sanford (Data Science at Reed College)https://reed-statistics.github.io/math241-spring2022/
May 4, 2022
hide

Introduction

For this project I plan to study a dataset provided from kaggle that shows every news article from the news source lenta.ru, (News Dataset from Lenta.ru, 2020). Lenta.ru is the most popular Russian language news source, and because of this I find it interesting to analyze the mentions of foreign countries in the text data of each article. I hope that this data, once extracted and visualized, will allow for an insight into Russian vision of the world from a Russian perspective.

Methods and Results

In this section, I will explain the processes for data wrangling and the production of the network visualizations of the Lenta.ru dataset, (News Dataset from Lenta.ru (2020)).

CSV Files

To load the data set, I randomly sampled in order to conserve memory, as the dataset is 1.9 GB. This sampled dataset I then saved and used. The set seed and code for the original sampling is provided below. I then loaded the sampled dataset as a csv file, which is news2.csv. I recommend downloading the dataset if you wish to increase the sample size. I also found a list of Russian stopwords to filter out common Russian prepositions and conjunctions, (Russian Stopwords Set, n.d.). In order to create my spatial dataset further in the process, I used a dataset that contains all the world countries and their latitude and longitude, (Countries.csv (2012)). This dataset I then will join with another dataset that has each country’s name in Russian, (Gabos (2022)).

hide
#making a vector with the topics I will analyze
topics <- c("Россия", "Мир", "Экономика", "Интернет и СМИ")
#set seed for my dataset.
#set.seed(1002)
#loading the Lenta.ru dataset using the entire dataset
  #headlines <- read.csv("news.csv") %>%
  
  #sampling 1% of the dataset
   #sample_frac(0.1) %>%
  
  #selecting these specific columns
    #select(text,topic,date) %>%
  
  #only taking in the topics I plan to analayze
    #filter(topic %in% topics) %>%

  # group by topic for sampling later
    #group_by(topic) %>%
    #ungroup()
#loading the randomly sampled dataset for memory conservation
headlines <- read.csv("news2.csv")
# List of Russian Stopwords
banned_words <- read.csv("stopwords.csv")

#creating a vector in order to filter text data through
banned <- banned_words$word

#list of countries in Russian langauge
countries_set <- read.csv("countries.csv")

#creating a vector to filter text data through for countries
countries <- countries_set$name %>% 
  tolower()
#loading a dataset that has all of the geogrpahic coordinates for each country
locations2 <- read.csv("countriesloc.csv", 
                      col.names = c("abbv", 
                                    "english", 
                                    "Latitude", 
                                    "Longitude" ))
locations <- locations2 %>%
  mutate(tolower(abbv)) 

Initial Data Exploration

For my initial exploration into this dataset, I took the time to analyze the number and topic of articles in regard to the year and month they were published. I first did some data wrangling to produce a better formulated date structure to be able to map aesthetics onto year and month separately. I did this using the mutate function and creating string subsets of the original date.

hide
headlines_sub_year <- headlines %>%
  #selecting only topic and date to reduce size 
    select(topic, date) %>%
  #creating a substring of the year by selecting first 4 characters
    mutate(year = str_sub(date, 0, nchar(date)-6)) %>%
  
  #createing a substring of the mounth by selecting only month characters
    mutate(month = str_sub(date, 6, nchar(date)-3)) %>%
  
  #creating a substring of the day 
    mutate(day = str_sub(date, 9, nchar(date)))

I then used this new dataframe that I made to process and visualize the dates and their relationship to the number of articles published.

hide
year <- ggplot(headlines_sub_year, aes(x = year, fill = month)) + 
          geom_bar() + 
          scale_color_colorblind()  + 
  labs(title = "Yearly Amount of Publications by Month") +
  facet_wrap(~topic) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))



monthly <- ggplot(headlines_sub_year, aes(x = month, fill = year)) + 
  geom_bar() + 
  labs(title = "Monthly amount of Publicaitons by Year")

year
hide
monthly

As a result of this visualization, it appears that the amount that Lenta.ru published articles remained mostly steady across topic and year. I do note that in 2008 there was a spike in the ‘Economic’ topic. I am unsure of what this could mean in regard to Russian views of global developments in this year. This degree of analysis falls outside the scope of this project, but this angle would be a prudent topic for further research with this dataset in regard to other developments of the same timeframe. It is also evident that the most common topic is ‘Russia’ or ‘Domestic affairs’.

Wrangling Text Data

For this section, I show the process that I used to wrangle the data from the text variable. Since this text variable represents the entire text of the article, I am only using the randomly sampled dataset from before to conserve space. I recommend downloading the dataset, (News Dataset from Lenta.ru (2020)), if you wish to include more articles into the text wrangling. After I unnested all of the words in the text variable and then saved them into the word variable of the new headlines_sub, I counted the number of each word. This was to set up the dataset to be ready to filter by countries, so that I have a dataset of the mentions of each country. In this process of text wrangling, I filtered the dataset through a list of Russian stopwords, Russian Stopwords Set (n.d.), in order to remove prepositions, conjunctions, and filler words. This is the same as the process for removing English stopwords, but specifically for the Russian language and its grammar. I have provided a glimpse of the wrangling.

hide
headlines_sub<- headlines %>% 
  
  # Filtering by topic in the four topics I chose
  filter(topic %in% topics) %>%
  group_by(topic) %>% 
  unnest_tokens(word, text) %>%
 
  
  #Here I am changing the grammatical forms of the word Russia to make it the 
  #same for the counting
  mutate(word = case_when(word == "россии" ~ "россия", 
                         word != "россии" ~ word)) %>%
   count(word) %>%
  ungroup() %>%

  #remvoing stop words, such as and, the, but, etc.
  filter(!word %in% banned) %>%
  arrange(desc(n)) %>%
  na.omit()

#Creating a table using the Kableextra package
headlines_sub %>% 

  #limiting the table to only 20 words, not 200,000
  head(20) %>%
  
  #adding a caption to the table
  kbl(caption = "Glimpse at the full unnested dataset with counts 
      for each word") %>%
  
  #adding a striped pattern to the table
  kable_styling(bootstrap_options = "striped") %>%
  
  #making the dimensions of the table 
  scroll_box(width = "50%", height = "400px")
(#tab:Text tokens)Glimpse at the full unnested dataset with counts for each word
topic word n
Россия россия 5653
Мир сообщает 4807
Россия сообщает 4668
Мир сша 4505
Россия новости 3492
Россия риа 3485
Экономика процента 3365
Россия словам 3319
Экономика россия 3264
Экономика долларов 3257
Россия заявил 2991
Экономика компании 2845
Мир заявил 2599
Россия рф 2573
Россия данным 2533
Россия суд 2067
Экономика сообщает 2045
Россия москвы 2028
Мир словам 2002
Мир страны 1919

In this step, I am taking the previous dataset, headlines_sub, and filtering it through the countries dataset from above. This process will leave only the countries from the text variable in the original headlines dataset, their topic, and the number of times each country appeared in articles. I did not put a cap on the amount of times a token of a country can come from a single article. This is because I find it illuminating the amount of times countries are mentioned in total, not only the number of articles they are mentioned in. This also allows me to retain the original weight of an article that mentioned “США”, the United States, 20 times in the overall view of the United States in this dataset.

hide
#creating a dataset that holds the english names of each country
english_names <-  left_join(countries_set, locations, by  = 
                        c("alpha2" = "tolower(abbv)")) %>%
  
  #selecting only the Russian and English country names
  select(name,english) %>%
  
  #making lowercase to match the datasets I will join together
  mutate(name = tolower(name))

#Creating a dataset with Russian and English Names and count (n)
headlines_countries <- headlines_sub %>% 
  
  #filtering the unnested `text` variable through the country list
  filter(word %in% countries) %>%
  
  #Reordering by highest to lowest mention (variable n)
  mutate(word = fct_reorder(word,n)) %>%
  
  #grouping by topic
  group_by(topic) %>%
  
  #joining the English country names to this Russian dataset
  full_join(english_names, by = c("word" = "name")) %>%
  
  #removing NA
  na.omit()

#creating a table using the kableextra package
headlines_countries %>%
  
#creating a caption for this table
kbl(caption = "Table of each countries number of mentions, n = number 
    of tokens from text wrangling, word = name of the country, 
    topic = the topic either: Мир, meaning 'world', Россия, meaning 
    Russian but rather domestic affairs, Экономнка, meaning economic; 
    and Интернет и СМИ, meaning the internet and mass media.", 
    position = "center") %>%
  
  #creating the styling of the table
  kable_styling(bootstrap_options = "striped", position = "center") %>%
  
  #making the dimensions of the table
  scroll_box(width = "70%", height = "400px")
(#tab:Counting Countries)Table of each countries number of mentions, n = number of tokens from text wrangling, word = name of the country, topic = the topic either: Мир, meaning ‘world’, Россия, meaning Russian but rather domestic affairs, Экономнка, meaning economic; and Интернет и СМИ, meaning the internet and mass media.
topic word n english
Россия россия 5653 Russian Federation
Мир сша 4505 United States
Экономика россия 3264 Russian Federation
Экономика сша 1286 United States
Россия сша 1283 United States
Мир россия 985 Russian Federation
Интернет и СМИ россия 612 Russian Federation
Мир израиль 540 Israel
Интернет и СМИ сша 416 United States
Мир кндр 375 North Korea
Мир ирак 348 Iraq
Мир иран 284 Iran, Islamic Republic of
Экономика украина 174 Ukraine
Мир китай 125 China
Мир великобритания 108 United Kingdom
Экономика китай 105 China
Мир франция 94 France
Мир сомали 83 Somalia
Мир пакистан 73 Pakistan
Россия грузия 71 Georgia
Мир сирия 63 Syrian Arab Republic
Мир украина 62 Ukraine
Мир германия 59 Germany
Мир чили 56 Chile
Мир афганистан 55 Afghanistan
Россия китай 55 China
Мир япония 52 Japan
Мир ливан 48 Lebanon
Мир египет 47 Egypt
Мир юар 47 South Africa
Экономика иран 45 Iran, Islamic Republic of
Россия украина 43 Ukraine
Мир гаити 41 Haiti
Россия кндр 41 North Korea
Экономика белоруссия 41 Belarus
Экономика япония 35 Japan
Мир индия 34 India
Мир испания 33 Spain
Россия великобритания 33 United Kingdom
Мир перу 32 Peru
Мир турция 32 Turkey
Экономика франция 32 France
Мир зимбабве 31 Zimbabwe
Мир италия 31 Italy
Россия израиль 30 Israel
Россия ирак 30 Iraq
Экономика германия 29 Germany
Мир никарагуа 28 Nicaragua
Мир оаэ 28 United Arab Emirates
Экономика великобритания 28 United Kingdom
Россия афганистан 27 Afghanistan
Мир колумбия 26 Colombia
Мир ливия 26 Libyan Arab Jamahiriya
Мир австралия 24 Australia
Мир бангладеш 24 Bangladesh
Мир канада 24 Canada
Мир сербия 24 Serbia
Россия сомали 24 Somalia
Россия франция 24 France
Экономика венесуэла 24 Venezuela
Экономика польша 24 Poland
Мир марокко 23 Morocco
Россия германия 23 Germany
Экономика зимбабве 23 Zimbabwe
Экономика казахстан 23 Kazakhstan
Мир фиджи 21 Fiji
Россия иран 21 Iran, Islamic Republic of
Россия япония 21 Japan
Мир грузия 20 Georgia
Россия марокко 20 Morocco
Экономика ирак 20 Iraq
Мир кувейт 19 Kuwait
Мир польша 19 Poland
Россия пакистан 19 Pakistan
Экономика азербайджан 19 Azerbaijan
Экономика грузия 19 Georgia
Мир венесуэла 18 Venezuela
Мир куба 17 Cuba
Россия азербайджан 17 Azerbaijan
Мир бельгия 16 Belgium
Россия египет 16 Egypt
Россия казахстан 16 Kazakhstan
Экономика индия 16 India
Мир кипр 15 Cyprus
Экономика латвия 15 Latvia
Экономика туркмения 15 Turkmenistan
Россия перу 14 Peru
Мир белоруссия 13 Belarus
Мир индонезия 13 Indonesia
Мир филиппины 13 Philippines
Россия польша 13 Poland
Экономика оаэ 13 United Arab Emirates
Экономика турция 13 Turkey
Экономика узбекистан 13 Uzbekistan
Россия оаэ 12 United Arab Emirates
Россия таиланд 12 Thailand
Экономика италия 12 Italy
Мир болгария 11 Bulgaria
Мир дания 11 Denmark
Мир литва 11 Lithuania
Мир румыния 11 Romania
Мир судан 11 Sudan
Россия кипр 11 Cyprus
Россия турция 11 Turkey
Экономика венгрия 11 Hungary
Экономика канада 11 Canada
Экономика литва 11 Lithuania
Экономика мексика 11 Mexico
Экономика молдавия 11 Moldova, Republic of
Экономика норвегия 11 Norway
Экономика сингапур 11 Singapore
Экономика юар 11 South Africa
Мир австрия 10 Austria
Мир нидерланды 10 Netherlands
Мир словакия 10 Slovakia
Россия таджикистан 10 Tajikistan
Россия юар 10 South Africa
Экономика австрия 10 Austria
Экономика катар 10 Qatar
Интернет и СМИ израиль 9 Israel
Интернет и СМИ украина 9 Ukraine
Мир молдавия 9 Moldova, Republic of
Мир словения 9 Slovenia
Россия белоруссия 9 Belarus
Россия зимбабве 9 Zimbabwe
Россия италия 9 Italy
Россия узбекистан 9 Uzbekistan
Экономика алжир 9 Algeria
Экономика швейцария 9 Switzerland
Экономика швеция 9 Sweden
Интернет и СМИ великобритания 8 United Kingdom
Интернет и СМИ индия 8 India
Мир алжир 8 Algeria
Мир вьетнам 8 Vietnam
Мир греция 8 Greece
Мир иордания 8 Jordan
Мир катар 8 Qatar
Мир узбекистан 8 Uzbekistan
Мир чехия 8 Czech Republic
Мир швейцария 8 Switzerland
Мир эквадор 8 Ecuador
Россия литва 8 Lithuania
Экономика нигерия 8 Nigeria
Экономика нидерланды 8 Netherlands
Интернет и СМИ ирак 7 Iraq
Интернет и СМИ китай 7 China
Мир азербайджан 7 Azerbaijan
Мир албания 7 Albania
Мир ирландия 7 Ireland
Мир казахстан 7 Kazakhstan
Мир нигерия 7 Nigeria
Мир таджикистан 7 Tajikistan
Мир черногория 7 Montenegro
Мир швеция 7 Sweden
Россия испания 7 Spain
Экономика австралия 7 Australia
Экономика болгария 7 Bulgaria
Экономика бразилия 7 Brazil
Экономика греция 7 Greece
Экономика израиль 7 Israel
Экономика индонезия 7 Indonesia
Экономика оман 7 Oman
Экономика тонга 7 Tonga
Мир армения 6 Armenia
Мир бахрейн 6 Bahrain
Мир венгрия 6 Hungary
Мир науру 6 Nauru
Мир нигер 6 Niger
Мир норвегия 6 Norway
Мир таиланд 6 Thailand
Мир финляндия 6 Finland
Мир эфиопия 6 Ethiopia
Россия бельгия 6 Belgium
Россия киргизия 6 Kyrgyzstan
Россия норвегия 6 Norway
Россия судан 6 Sudan
Экономика испания 6 Spain
Экономика кндр 6 North Korea
Экономика лаос 6 Lao People’s Democratic Republic
Экономика пакистан 6 Pakistan
Интернет и СМИ германия 5 Germany
Интернет и СМИ египет 5 Egypt
Интернет и СМИ иран 5 Iran, Islamic Republic of
Интернет и СМИ испания 5 Spain
Интернет и СМИ кндр 5 North Korea
Мир бразилия 5 Brazil
Мир люксембург 5 Luxembourg
Мир мали 5 Mali
Мир эстония 5 Estonia
Россия армения 5 Armenia
Россия дания 5 Denmark
Россия индия 5 India
Россия индонезия 5 Indonesia
Россия канада 5 Canada
Россия колумбия 5 Colombia
Россия швейцария 5 Switzerland
Экономика албания 5 Albania
Экономика армения 5 Armenia
Экономика бангладеш 5 Bangladesh
Экономика дания 5 Denmark
Экономика египет 5 Egypt
Экономика науру 5 Nauru
Экономика словения 5 Slovenia
Экономика таджикистан 5 Tajikistan
Экономика финляндия 5 Finland
Экономика эстония 5 Estonia
Интернет и СМИ азербайджан 4 Azerbaijan
Интернет и СМИ грузия 4 Georgia
Интернет и СМИ франция 4 France
Интернет и СМИ швейцария 4 Switzerland
Мир мексика 4 Mexico
Мир хорватия 4 Croatia
Мир чад 4 Chad
Россия бангладеш 4 Bangladesh
Россия бразилия 4 Brazil
Россия венесуэла 4 Venezuela
Россия греция 4 Greece
Россия латвия 4 Latvia
Россия ливия 4 Libyan Arab Jamahiriya
Россия финляндия 4 Finland
Россия чехия 4 Czech Republic
Экономика бельгия 4 Belgium
Экономика боливия 4 Bolivia
Экономика ирландия 4 Ireland
Экономика кипр 4 Cyprus
Экономика колумбия 4 Colombia
Экономика ливия 4 Libyan Arab Jamahiriya
Экономика малайзия 4 Malaysia
Экономика нигер 4 Niger
Экономика перу 4 Peru
Экономика португалия 4 Portugal
Экономика филиппины 4 Philippines
Экономика чехия 4 Czech Republic
Интернет и СМИ белоруссия 3 Belarus
Интернет и СМИ бразилия 3 Brazil
Интернет и СМИ вьетнам 3 Vietnam
Интернет и СМИ канада 3 Canada
Интернет и СМИ колумбия 3 Colombia
Интернет и СМИ пакистан 3 Pakistan
Интернет и СМИ сингапур 3 Singapore
Интернет и СМИ япония 3 Japan
Мир бурунди 3 Burundi
Мир гвинея 3 Guinea
Мир гондурас 3 Honduras
Мир йемен 3 Yemen
Мир киргизия 3 Kyrgyzstan
Мир латвия 3 Latvia
Мир малави 3 Malawi
Мир малайзия 3 Malaysia
Мир монако 3 Monaco
Мир монголия 3 Mongolia
Мир португалия 3 Portugal
Мир сингапур 3 Singapore
Мир тунис 3 Tunisia
Мир туркмения 3 Turkmenistan
Мир цар 3 Central African Republic
Мир эритрея 3 Eritrea
Россия австрия 3 Austria
Россия алжир 3 Algeria
Россия болгария 3 Bulgaria
Россия гаити 3 Haiti
Россия ирландия 3 Ireland
Россия нидерланды 3 Netherlands
Россия никарагуа 3 Nicaragua
Россия парагвай 3 Paraguay
Россия сербия 3 Serbia
Россия сирия 3 Syrian Arab Republic
Россия словакия 3 Slovakia
Россия филиппины 3 Philippines
Россия чили 3 Chile
Россия эстония 3 Estonia
Экономика бруней 3 Brunei Darussalam
Экономика вьетнам 3 Vietnam
Экономика исландия 3 Iceland
Экономика киргизия 3 Kyrgyzstan
Экономика люксембург 3 Luxembourg
Экономика монако 3 Monaco
Экономика монголия 3 Mongolia
Экономика таиланд 3 Thailand
Интернет и СМИ афганистан 2 Afghanistan
Интернет и СМИ казахстан 2 Kazakhstan
Интернет и СМИ норвегия 2 Norway
Интернет и СМИ оаэ 2 United Arab Emirates
Интернет и СМИ узбекистан 2 Uzbekistan
Интернет и СМИ швеция 2 Sweden
Интернет и СМИ юар 2 South Africa
Мир ботсвана 2 Botswana
Мир бруней 2 Brunei Darussalam
Мир вануату 2 Vanuatu
Мир доминика 2 Dominica
Мир исландия 2 Iceland
Мир кения 2 Kenya
Мир лаос 2 Lao People’s Democratic Republic
Мир мальта 2 Malta
Мир мьянма 2 Myanmar
Мир непал 2 Nepal
Мир оман 2 Oman
Мир панама 2 Panama
Мир сальвадор 2 El Salvador
Россия аргентина 2 Argentina
Россия бурунди 2 Burundi
Россия венгрия 2 Hungary
Россия вьетнам 2 Vietnam
Россия гана 2 Ghana
Россия доминика 2 Dominica
Россия йемен 2 Yemen
Россия иордания 2 Jordan
Россия ливан 2 Lebanon
Россия люксембург 2 Luxembourg
Россия малайзия 2 Malaysia
Россия мали 2 Mali
Россия мексика 2 Mexico
Россия молдавия 2 Moldova, Republic of
Россия нигер 2 Niger
Россия нигерия 2 Nigeria
Россия румыния 2 Romania
Россия хорватия 2 Croatia
Россия чад 2 Chad
Экономика ангола 2 Angola
Экономика бурунди 2 Burundi
Экономика гайана 2 Guyana
Экономика гондурас 2 Honduras
Экономика куба 2 Cuba
Экономика кувейт 2 Kuwait
Экономика мьянма 2 Myanmar
Экономика непал 2 Nepal
Экономика никарагуа 2 Nicaragua
Экономика румыния 2 Romania
Экономика хорватия 2 Croatia
Экономика чад 2 Chad
Экономика чили 2 Chile
Экономика эквадор 2 Ecuador
Интернет и СМИ австралия 1 Australia
Интернет и СМИ белиз 1 Belize
Интернет и СМИ болгария 1 Bulgaria
Интернет и СМИ боливия 1 Bolivia
Интернет и СМИ иордания 1 Jordan
Интернет и СМИ италия 1 Italy
Интернет и СМИ катар 1 Qatar
Интернет и СМИ куба 1 Cuba
Интернет и СМИ ливан 1 Lebanon
Интернет и СМИ ливия 1 Libyan Arab Jamahiriya
Интернет и СМИ марокко 1 Morocco
Интернет и СМИ мексика 1 Mexico
Интернет и СМИ молдавия 1 Moldova, Republic of
Интернет и СМИ непал 1 Nepal
Интернет и СМИ нидерланды 1 Netherlands
Интернет и СМИ парагвай 1 Paraguay
Интернет и СМИ перу 1 Peru
Интернет и СМИ польша 1 Poland
Интернет и СМИ таджикистан 1 Tajikistan
Интернет и СМИ таиланд 1 Thailand
Интернет и СМИ филиппины 1 Philippines
Интернет и СМИ чад 1 Chad
Мир аргентина 1 Argentina
Мир барбадос 1 Barbados
Мир гана 1 Ghana
Мир гренада 1 Grenada
Мир джибути 1 Djibouti
Мир либерия 1 Liberia
Мир лихтенштейн 1 Liechtenstein
Мир мавритания 1 Mauritania
Мир мадагаскар 1 Madagascar
Мир мальдивы 1 Maldives
Мир микронезия 1 Micronesia, Federated States of
Мир танзания 1 Tanzania, United Republic of
Мир тонга 1 Tonga
Мир тувалу 1 Tuvalu
Мир уругвай 1 Uruguay
Россия австралия 1 Australia
Россия гамбия 1 Gambia
Россия исландия 1 Iceland
Россия катар 1 Qatar
Россия кения 1 Kenya
Россия маврикий 1 Mauritius
Россия мавритания 1 Mauritania
Россия мадагаскар 1 Madagascar
Россия монако 1 Monaco
Россия монголия 1 Mongolia
Россия непал 1 Nepal
Россия португалия 1 Portugal
Россия сенегал 1 Senegal
Россия сингапур 1 Singapore
Россия туркмения 1 Turkmenistan
Россия уругвай 1 Uruguay
Россия швеция 1 Sweden
Россия эквадор 1 Ecuador
Экономика андорра 1 Andorra
Экономика аргентина 1 Argentina
Экономика афганистан 1 Afghanistan
Экономика бутан 1 Bhutan
Экономика вануату 1 Vanuatu
Экономика гаити 1 Haiti
Экономика гамбия 1 Gambia
Экономика гватемала 1 Guatemala
Экономика гвинея 1 Guinea
Экономика гренада 1 Grenada
Экономика доминика 1 Dominica
Экономика иордания 1 Jordan
Экономика камбоджа 1 Cambodia
Экономика камерун 1 Cameroon
Экономика кения 1 Kenya
Экономика лесото 1 Lesotho
Экономика либерия 1 Liberia
Экономика ливан 1 Lebanon
Экономика лихтенштейн 1 Liechtenstein
Экономика мальдивы 1 Maldives
Экономика мальта 1 Malta
Экономика руанда 1 Rwanda
Экономика сальвадор 1 El Salvador
Экономика сенегал 1 Senegal
Экономика сербия 1 Serbia
Экономика сирия 1 Syrian Arab Republic
Экономика словакия 1 Slovakia
Экономика тунис 1 Tunisia
Экономика эфиопия 1 Ethiopia

For this graph, I showed the distribution of each country’s article topics. This was to visualize how each country was mentioned in regard to topic. I specifically limited the number of mentions to 50 total in order to not over-saturate the visualization with single mentions of countries. This filter was to hone in on the top countries that are mentioned in the Lenta.ru articles that I analyzed.

hide
#creating a plot of the top mentioned countries
country_plot2 <- ggplot(headlines_countries %>%
                          filter(n > 50), aes(x = english, y= n)) + 
  
  #specifying the type of visualization
  geom_col() +
  
  #placing the country names on the Y-axis by flipping x and y axes
  coord_flip() + 
  
  #adding labels and captions to the graph
  labs(title = "Countries with highest mention in Lenta.Ru by Topic",
       x = "Country ",
       y = "Number of instances in Articles",
       fill = "Topic",
       caption = "Raw count of Countries' mentions in all articles")
hide
#creating a plot that shows the distribution of topic by country
country_plot3 <- ggplot(headlines_countries %>%
                          filter(n > 50), 
                        aes(x = english, y= n, fill = topic)) + 
  
  #setting the positons to 'fill' to visualize the percentage of each country's
  #topic
  geom_col(position = "fill") +
  
  #flipping x and y axes
  coord_flip() + 
  
  #Placing the legend at the bottom of the graph
  theme(legend.position = "bottom") +
  
  #adding labels and captions
  labs(title = "Countries with highest mention in Lenta.Ru by Topic",
       x = "Country",
       y = "Number of instances in Articles",
       fill = "Topic",
       caption = "Distribution of countries with a count higher than 50, 
       (n > 50), by topic of article")

#printing out both graphs
country_plot2
hide
country_plot3

Map Making

In this section of my method, I detail how I went about creating my map graphics and joining my spatial data with my country data. To begin I joined the locations and country data set, from the CSV section above, to create a dataset that has both the name in Russian and English, along with the latitude and longitude of each country. These variables are called lat and long. I also cleaned up the data and removed the duplicate column of the English country names. After creating this I joined the dataset with the country names and their number (n) together. This produced the dataset spatial which holds the count per country (variable n), the Russian name of the country (variable name), the English name of the country (variable english), and the latitude and longitude of each country (lat and long respectively). Later on in this section, I further clean up the dataset and remove unnecessary columns.

hide
#making a temporary dataset to help join my location data to my count total
spatial2 <- left_join(countries_set, locations, by  = 
                        c("alpha2" = "tolower(abbv)")) %>%
  
  #this is used to match the country names in the 'headlines_countries' set
  mutate(tolower(name)) 

#here I am making the final dataset with country names, count, lat, and long
spatial <- left_join(headlines_countries, spatial2, by = 
                       c("word" = "tolower(name)")) %>%
  
  #I am removing this duplicate column
  select(!`english.y`) %>%
  
  #I am renaming this for simplicity 
  rename("english" = "english.x") %>%
  
  #I am reordering my count
  mutate(name = fct_reorder(name,n)) %>%
  
  #I am assigning ID values to each country to use to create the nodes 
  #and edges
  mutate(id = as.numeric(unique(name))) 

#making a variable that holds the id value of Russia
#this will be used to create the edges and nodes coming from Russia
russia_id <- spatial[1,]$id

In this code segment I am separating the datasets into specific topic based datasets. This is because, further in the process, the edges would not allow for multiple topic variables. Because of this, I made 4 separate datasets and 4 maps with each specific dataset. I also removed countries that were not mentioned at least 10 times in each topic. This was to reduce clutter and hone in on the top countries that are mentioned in Lenta.ru articles.

hide
#These datasets are to filter by topic and have a dataset for each
world <- spatial %>% 
  filter(topic == "Мир") %>%
  filter(n > 10 ) 

russia <- spatial %>%
  filter(topic == "Россия")%>%
  filter(n > 10 )

economy <- spatial %>%
  filter(topic == "Экономика")%>%
  filter(n > 10 )

internet <- spatial %>%
  filter(topic == "Интернет и СМИ") %>%
  filter(n > 10)

In this segment, I am making a theme that can be applied to all of the 4 maps, so I did not have to repeat the code for all, and it makes it possible to edit the code for the theme here to affect all the maps at once. I also made the theme for the country layers that shows all of the countries on top of the map. I will go into further depth on the different layers of my map visualizations in the next section.

hide
#Here I am creating a custom theme for the following maps to all share
#The ones commented out are ones I did not choose to use but that can be 

#saving it as a whole theme
maptheme <- #theme(panel.grid = element_blank()) +
  
  #removes the axis text 
  theme(axis.text = element_blank()) +
  
  #removes the axis ticks on the map
  theme(axis.ticks = element_blank()) +
  
  #removes the titles from the axis
  theme(axis.title = element_blank()) +
  
  #Change the font/ size of text in Map
  #theme(text = element_text(size = 12, family = "Times")) +
  
  #sets the position of the legend
  theme(legend.position = "bottom") +
  
  #Controls the background of the map
  theme(panel.background = element_rect(fill = "deepskyblue1")) +
  
  theme(plot.margin = unit(c(0, 0, 0.5, 0), 'cm'))

#This is the theme for the color, and shapes of the countries overlay
country_shapes <- geom_polygon(aes(x = long, y = lat, group = group),
                               data = map_data('world'),
                               
                               #color of countries, and their borders
                               fill = "seagreen2", color = "black",
                               
                               #size/ thickness of the boarders
                               size = 0.05)

#Sets the framing of the map to these specific coords
#I am not using this 
mapcoords <- coord_fixed(xlim = c(-150, 180), ylim = c(-55, 80))

Visualizations

For this segment of my analysis, I used the methods in Konrad (2018) to create my visualizations of the networks overlays. The process to create them involved creating and overlaying layers on top of each other. The first step was creating the nodes and edges data. This is shown in the nodes_world dataset, in which I used the specifically filtered datasets on the topic variable and selected the necessary columns. The nodes were simple to create. For the edges, I used the id that I assigned to each country to make sure that Russia is always the from of the edge. I then assigned the to variable to be the id of each country that was not Russia. Then I joined these together to have the latitude and longitude of both the from country, which is always Russia, and the to country. I then filtered out Russia from this dataset because the to and from variables were the same. This created problems for the overlay of the geom_curve() function, so I removed Russia.

The process to make the map is all the same, and only the specific dataset for the topic variable changes. I made the base dataset the nodes, then overlaid the previous country_shapes dataset, which is the overlay for the green visuals of the countries of the world. The next layer was the curves of the edges. For this I used geom_curve() and assigned the ends and starts to the lat and long of the ending node and beginning nodes. The next Layer was the points themselves. This involved using geom_point. I mapped the size of the points to the number of mentions, n, and I had to run n through log10 in order to make the sizes proportionate to each other. Originality without doing this, the United States and Russian Federation points were so large they covered the entire map. I then used facet_wrap() to create the header that displays the topic name in Russian. This was only for aesthetic purposes, as there is only one topic in each data set. Finally I added the map theme from the previous section, in order to easily have the same base theme. I changed the colors for each topic to better separate by topic. The process is the same for each map.

hide
#Creating the nodes from the specific topic dataset
nodes_world <- world %>%
  select(c("id","name","english","n","topic","Latitude","Longitude")) %>%
  rename(lat = Latitude, lon = Longitude)

#trying to make the nodes all start from Russia's id
edges_world <-  nodes_world %>%
  mutate(from = russia_id) %>%
  mutate(to = id) 

#creating the edges by joing the nodes and edges datasets
edges_for_world <- edges_world %>%
  inner_join(nodes_world %>% select(id, lon, lat), by = c('from' = 'id')) %>%
  inner_join(nodes_world %>% select(id, lon, lat), by = c('to' = 'id')) %>%
  filter(!name =="Россия") 

#beginning the ggplot pipeline for the maps
world_plot <-ggplot(nodes_world) + 
  
  #overlay of country shapes from segment above
  country_shapes +
  
  #adding the edges 
  geom_curve(aes(x = `lon.y`, 
                 y = `lat.y`, 
                 xend = lon, 
                 yend = lat),
             data = edges_for_world,
             curvature = 0.1,
             alpha = 0.5, 
             color = "darkorange1") +
  
  scale_size_continuous(guide = FALSE) +
  
  #adding the country points to the map
  geom_point(aes(x = lon, y = lat),                         
             shape = 21, 
             size = log10(world$n), 
             fill = 'darkorange2',
             color = 'black', stroke = 0.5)  +
  
  #title
  labs(title = "Mentions of Foreign countries in 'World' topic") + 
  
  #adding a facet wrap only for aestetic reasons 
  facet_wrap(~topic) +
  
  #adding the theme from above
  maptheme

#plotting the grpah
world_plot

hide
#Creating the nodes from the specific topic dataset
nodes_economy <- economy %>%
  
  select(c("id","name","english","n","topic","Latitude","Longitude")) %>%
  rename(lat = Latitude, lon = Longitude)

#trying to make the nodes all start from Russia's id
edges_economy <-  nodes_economy %>%
  mutate(from = russia_id) %>%
  mutate(to = id) 

#creating the edges from the nodes and edges datasets by joining them
edges_for_economy <- edges_economy %>%
  inner_join(nodes_economy %>% select(id, lon, lat), by = c('from' = 'id')) %>%
  inner_join(nodes_economy %>% select(id, lon, lat), by = c('to' = 'id')) %>%
  filter(!name =="Россия") %>%
  unique()

#beginning of ggplot pipeline
economy_plot <-ggplot(nodes_economy) + 
  
  #Loading country layer with theme settings from above
  country_shapes +
  
  #overlay of the edges 
  geom_curve(aes(x = `lon.y`, 
                 y = `lat.y`, 
                 xend = lon, 
                 yend = lat),
             data = edges_for_economy,
             curvature = 0.2,
             alpha = 0.5, 
             color = "violetred1") +
  
  scale_size_continuous(guide = FALSE) +
  
  #overlay of the points
  geom_point(aes(x = lon, y = lat),                         
             shape = 21, 
             size = log10(economy$n), 
             fill = 'red',
             color = 'black', stroke = 0.5)  +
  
  labs(title = "Mentions of Foreign countries in 'Economy' topic") + 
  
  #only for aesthetic reasons 
  facet_wrap(~topic) +
  
  #theme from above
  maptheme

economy_plot

hide
#Creating the nodes from the specific topic dataset
nodes_russia <- russia %>%
  select(c("id","name","english","n","topic","Latitude","Longitude")) %>%
  rename(lat = Latitude, lon = Longitude)

#trying to make the nodes all start from Russia's id
edges_russia <-  nodes_russia %>%
  mutate(from = russia_id) %>%
  mutate(to = id) 

#making the edges from the nodes and edges datasets 
edges_for_russia <- edges_russia %>%
  inner_join(nodes_russia %>% select(id, lon, lat), by = c('from' = 'id')) %>%
  inner_join(nodes_russia %>% select(id, lon, lat), by = c('to' = 'id')) %>%
  filter(!name =="Россия") %>%
  unique()

#beginning of ggplot pipeline 
domestic_plot <-ggplot(nodes_russia) + 
  
  #Loading country layer with theme settings from above
  country_shapes +
  
  #making it allign with color of topic
  geom_curve(aes(x = `lon.y`, 
                 y = `lat.y`, 
                 xend = lon, 
                 yend = lat),
             data = edges_for_russia,
             curvature = 0.2,
             alpha = 0.5, 
             color = "hotpink1") +
  
  scale_size_continuous(guide = FALSE) +
  
  #adding the point layer
  geom_point(aes(x = lon, y = lat),                         
             shape = 21, 
             size = log10(russia$n), 
             fill = 'hotpink',
             color = 'black', stroke = 0.5)  +
  
  #title of graph
  labs(title = "Mentions of Foreign countries in 'Domestic' topic") + 
  
  #only for aestetic reasons 
  facet_wrap(~topic) +
  
  #adding the theme from above
  maptheme

domestic_plot

hide
#Creating the nodes from the specific topic dataset
nodes_internet <- internet %>%
  select(c("id","name","english","n","topic","Latitude","Longitude")) %>%
  rename(lat = Latitude, lon = Longitude)

#trying to make the nodes all start from Russia's id
edges_internet <-  nodes_internet %>%
  mutate(from = russia_id) %>%
  mutate(to = id) 

#creating the edges by joing together the edges and nodes datasets
edges_for_internet <- edges_internet %>%
  inner_join(nodes_internet %>% select(id, lon, lat), by = c('from' = 'id')) %>%
  inner_join(nodes_internet %>% select(id, lon, lat), by = c('to' = 'id')) %>%
  filter(!name =="Россия") %>%
  unique()

#beginning of ggplot pipeline
internet_plot <-ggplot(nodes_internet) + 
  
  #Loading country layer with theme settings from above
  country_shapes +
  
  #adding edges layer
  geom_curve(aes(x = `lon.y`, 
                 y = `lat.y`, 
                 xend = lon, 
                 yend = lat),
             data = edges_for_internet,
             curvature = 0.2,
             alpha = 0.5, 
             color = "darkorchid1") +
  
  scale_size_continuous(guide = FALSE) +
  
  #adding point layer
  geom_point(aes(x = lon, y = lat),                         
             shape = 21, 
             size = log10(internet$n), 
             fill = 'purple',
             color = 'black', stroke = 0.5)  +
  
  labs(title = "Mentions of Foreign countries in 'internet' topic") + 
  
  #only for aesthetic reasons 
  facet_wrap(~topic) + 
  
  #adding theme from above
  maptheme

internet_plot

Conclusion

The results of my visualizations and data wrangling show that the country that is mentioned the most in this Lenta.ru data set, is the United States by far. The only other country that is mentioned more than the United States is Russia itself which makes sense. This shows that in Lenta.ru articles the main foreign country that is mentioned is the United States. This lends itself to the idea that Russia is most concerned with the United States in regard to certain topics. The domestic topic is the only topic where Russia is mentioned more than the United States and this makes sense in regard to foreign affairs. The United States, in the Russia view, is a foreign affair not a domestic one.

For the internet topic, in Russian “Интернет и СМИ”, It is interesting that the only country that is mentioned more than 10 times is the United States. I find this interesting and an insight into the relationship between Russian and American internet culture. Clearly, there is not enough to analyze the cultures specifically, but it is shown in this analysis that there is a great number of mentions of one and another. I find this interesting and a potential point of research in the future. Seeing as the internet culture has grown tremendously in the past two decades, this can be an excellent point of research into the relationship between Russian and American cultures.

For further research on this topic, I can foresee further exploration of Russian news agencies being prudent and yielding fruitful results. It would be interesting to see if the same weight of each countries’ mention would hold consistent across other Russian news agencies. It would also be interesting to investigate other countries and the interaction between news articles in multiple countries. This, I predict, could illuminate responses of countries and interaction between them. This can then be visualized with the same method.

Class Peer Reviews

Countries.csv. (2012). Dataset Publishing Language. https://developers.google.com/public-data/docs/canonical/countries_csv
Gabos, S. (2022). World countries. https://stefangabos.github.io/world_countries/
Konrad, M. (2018). Three ways of visualizing a grpah on a map. https://datascience.blog.wzb.eu/2018/05/31/three-ways-of-visualizing-a-graph-on-a-map/
News dataset from lenta.ru. (2020). https://www.kaggle.com/datasets/yutkin/corpus-of-russian-news-articles-from-lenta; DMITRYYUTKIN.
Russian stopwords set. (n.d.). https://countwordsfree.com/stopwords/russian.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".