For this project I plan to study a dataset provided from kaggle
that shows every news article
from the news source lenta.ru, (News
Dataset from Lenta.ru, 2020). Lenta.ru is the most
popular Russian language news source, and because of this I find it
interesting to analyze the mentions of foreign countries in the text
data of each article. I hope that this data, once extracted and
visualized, will allow for an insight into Russian vision of the world
from a Russian perspective.
In this section, I will explain the processes for data wrangling and the production of the network visualizations of the Lenta.ru dataset, (News Dataset from Lenta.ru (2020)).
To load the data set, I randomly sampled in order to conserve memory,
as the dataset is 1.9 GB. This sampled dataset I then saved and used.
The set seed and code for the original sampling is provided below. I
then loaded the sampled dataset as a csv file, which is
news2.csv
. I recommend downloading the dataset if you wish
to increase the sample size. I also found a list of Russian stopwords to
filter out common Russian prepositions and conjunctions, (Russian Stopwords Set, n.d.).
In order to create my spatial dataset further in the process, I used a
dataset that contains all the world countries and their latitude and
longitude, (Countries.csv (2012)). This dataset I then will join
with another dataset that has each country’s name in Russian, (Gabos (2022)).
#making a vector with the topics I will analyze
topics <- c("Россия", "Мир", "Экономика", "Интернет и СМИ")
#set seed for my dataset.
#set.seed(1002)
#loading the Lenta.ru dataset using the entire dataset
#headlines <- read.csv("news.csv") %>%
#sampling 1% of the dataset
#sample_frac(0.1) %>%
#selecting these specific columns
#select(text,topic,date) %>%
#only taking in the topics I plan to analayze
#filter(topic %in% topics) %>%
# group by topic for sampling later
#group_by(topic) %>%
#ungroup()
#loading the randomly sampled dataset for memory conservation
headlines <- read.csv("news2.csv")
# List of Russian Stopwords
banned_words <- read.csv("stopwords.csv")
#creating a vector in order to filter text data through
banned <- banned_words$word
#list of countries in Russian langauge
countries_set <- read.csv("countries.csv")
#creating a vector to filter text data through for countries
countries <- countries_set$name %>%
tolower()
#loading a dataset that has all of the geogrpahic coordinates for each country
locations2 <- read.csv("countriesloc.csv",
col.names = c("abbv",
"english",
"Latitude",
"Longitude" ))
locations <- locations2 %>%
mutate(tolower(abbv))
For my initial exploration into this dataset, I took the time to
analyze the number and topic of articles in regard to the year and month
they were published. I first did some data wrangling to produce a better
formulated date structure to be able to map aesthetics onto year and
month separately. I did this using the mutate
function and
creating string subsets of the original date.
headlines_sub_year <- headlines %>%
#selecting only topic and date to reduce size
select(topic, date) %>%
#creating a substring of the year by selecting first 4 characters
mutate(year = str_sub(date, 0, nchar(date)-6)) %>%
#createing a substring of the mounth by selecting only month characters
mutate(month = str_sub(date, 6, nchar(date)-3)) %>%
#creating a substring of the day
mutate(day = str_sub(date, 9, nchar(date)))
I then used this new dataframe that I made to process and visualize the dates and their relationship to the number of articles published.
year <- ggplot(headlines_sub_year, aes(x = year, fill = month)) +
geom_bar() +
scale_color_colorblind() +
labs(title = "Yearly Amount of Publications by Month") +
facet_wrap(~topic) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
monthly <- ggplot(headlines_sub_year, aes(x = month, fill = year)) +
geom_bar() +
labs(title = "Monthly amount of Publicaitons by Year")
year
monthly
As a result of this visualization, it appears that the amount that Lenta.ru published articles remained mostly steady across topic and year. I do note that in 2008 there was a spike in the ‘Economic’ topic. I am unsure of what this could mean in regard to Russian views of global developments in this year. This degree of analysis falls outside the scope of this project, but this angle would be a prudent topic for further research with this dataset in regard to other developments of the same timeframe. It is also evident that the most common topic is ‘Russia’ or ‘Domestic affairs’.
For this section, I show the process that I used to wrangle the data
from the text variable. Since this text
variable represents
the entire text of the article, I am only using the randomly sampled
dataset from before to conserve space. I recommend downloading the
dataset, (News Dataset from
Lenta.ru (2020)), if you wish to include more
articles into the text wrangling. After I unnested all of the words in
the text
variable and then saved them into the
word
variable of the new headlines_sub
, I
counted the number of each word. This was to set up the dataset to be
ready to filter by countries, so that I have a dataset of the mentions
of each country. In this process of text wrangling, I filtered the
dataset through a list of Russian stopwords, Russian Stopwords Set (n.d.), in order to
remove prepositions, conjunctions, and filler words. This is the same as
the process for removing English stopwords, but specifically for the
Russian language and its grammar. I have provided a glimpse of the
wrangling.
headlines_sub<- headlines %>%
# Filtering by topic in the four topics I chose
filter(topic %in% topics) %>%
group_by(topic) %>%
unnest_tokens(word, text) %>%
#Here I am changing the grammatical forms of the word Russia to make it the
#same for the counting
mutate(word = case_when(word == "россии" ~ "россия",
word != "россии" ~ word)) %>%
count(word) %>%
ungroup() %>%
#remvoing stop words, such as and, the, but, etc.
filter(!word %in% banned) %>%
arrange(desc(n)) %>%
na.omit()
#Creating a table using the Kableextra package
headlines_sub %>%
#limiting the table to only 20 words, not 200,000
head(20) %>%
#adding a caption to the table
kbl(caption = "Glimpse at the full unnested dataset with counts
for each word") %>%
#adding a striped pattern to the table
kable_styling(bootstrap_options = "striped") %>%
#making the dimensions of the table
scroll_box(width = "50%", height = "400px")
topic | word | n |
---|---|---|
Россия | россия | 5653 |
Мир | сообщает | 4807 |
Россия | сообщает | 4668 |
Мир | сша | 4505 |
Россия | новости | 3492 |
Россия | риа | 3485 |
Экономика | процента | 3365 |
Россия | словам | 3319 |
Экономика | россия | 3264 |
Экономика | долларов | 3257 |
Россия | заявил | 2991 |
Экономика | компании | 2845 |
Мир | заявил | 2599 |
Россия | рф | 2573 |
Россия | данным | 2533 |
Россия | суд | 2067 |
Экономика | сообщает | 2045 |
Россия | москвы | 2028 |
Мир | словам | 2002 |
Мир | страны | 1919 |
In this step, I am taking the previous dataset,
headlines_sub
, and filtering it through the countries
dataset from above. This process will leave only the countries from the
text
variable in the original headlines
dataset, their topic, and the number of times each country appeared in
articles. I did not put a cap on the amount of times a token of a
country can come from a single article. This is because I find it
illuminating the amount of times countries are mentioned in total, not
only the number of articles they are mentioned in. This also allows me
to retain the original weight of an article that mentioned “США”, the
United States, 20 times in the overall view of the United States in this
dataset.
#creating a dataset that holds the english names of each country
english_names <- left_join(countries_set, locations, by =
c("alpha2" = "tolower(abbv)")) %>%
#selecting only the Russian and English country names
select(name,english) %>%
#making lowercase to match the datasets I will join together
mutate(name = tolower(name))
#Creating a dataset with Russian and English Names and count (n)
headlines_countries <- headlines_sub %>%
#filtering the unnested `text` variable through the country list
filter(word %in% countries) %>%
#Reordering by highest to lowest mention (variable n)
mutate(word = fct_reorder(word,n)) %>%
#grouping by topic
group_by(topic) %>%
#joining the English country names to this Russian dataset
full_join(english_names, by = c("word" = "name")) %>%
#removing NA
na.omit()
#creating a table using the kableextra package
headlines_countries %>%
#creating a caption for this table
kbl(caption = "Table of each countries number of mentions, n = number
of tokens from text wrangling, word = name of the country,
topic = the topic either: Мир, meaning 'world', Россия, meaning
Russian but rather domestic affairs, Экономнка, meaning economic;
and Интернет и СМИ, meaning the internet and mass media.",
position = "center") %>%
#creating the styling of the table
kable_styling(bootstrap_options = "striped", position = "center") %>%
#making the dimensions of the table
scroll_box(width = "70%", height = "400px")
topic | word | n | english |
---|---|---|---|
Россия | россия | 5653 | Russian Federation |
Мир | сша | 4505 | United States |
Экономика | россия | 3264 | Russian Federation |
Экономика | сша | 1286 | United States |
Россия | сша | 1283 | United States |
Мир | россия | 985 | Russian Federation |
Интернет и СМИ | россия | 612 | Russian Federation |
Мир | израиль | 540 | Israel |
Интернет и СМИ | сша | 416 | United States |
Мир | кндр | 375 | North Korea |
Мир | ирак | 348 | Iraq |
Мир | иран | 284 | Iran, Islamic Republic of |
Экономика | украина | 174 | Ukraine |
Мир | китай | 125 | China |
Мир | великобритания | 108 | United Kingdom |
Экономика | китай | 105 | China |
Мир | франция | 94 | France |
Мир | сомали | 83 | Somalia |
Мир | пакистан | 73 | Pakistan |
Россия | грузия | 71 | Georgia |
Мир | сирия | 63 | Syrian Arab Republic |
Мир | украина | 62 | Ukraine |
Мир | германия | 59 | Germany |
Мир | чили | 56 | Chile |
Мир | афганистан | 55 | Afghanistan |
Россия | китай | 55 | China |
Мир | япония | 52 | Japan |
Мир | ливан | 48 | Lebanon |
Мир | египет | 47 | Egypt |
Мир | юар | 47 | South Africa |
Экономика | иран | 45 | Iran, Islamic Republic of |
Россия | украина | 43 | Ukraine |
Мир | гаити | 41 | Haiti |
Россия | кндр | 41 | North Korea |
Экономика | белоруссия | 41 | Belarus |
Экономика | япония | 35 | Japan |
Мир | индия | 34 | India |
Мир | испания | 33 | Spain |
Россия | великобритания | 33 | United Kingdom |
Мир | перу | 32 | Peru |
Мир | турция | 32 | Turkey |
Экономика | франция | 32 | France |
Мир | зимбабве | 31 | Zimbabwe |
Мир | италия | 31 | Italy |
Россия | израиль | 30 | Israel |
Россия | ирак | 30 | Iraq |
Экономика | германия | 29 | Germany |
Мир | никарагуа | 28 | Nicaragua |
Мир | оаэ | 28 | United Arab Emirates |
Экономика | великобритания | 28 | United Kingdom |
Россия | афганистан | 27 | Afghanistan |
Мир | колумбия | 26 | Colombia |
Мир | ливия | 26 | Libyan Arab Jamahiriya |
Мир | австралия | 24 | Australia |
Мир | бангладеш | 24 | Bangladesh |
Мир | канада | 24 | Canada |
Мир | сербия | 24 | Serbia |
Россия | сомали | 24 | Somalia |
Россия | франция | 24 | France |
Экономика | венесуэла | 24 | Venezuela |
Экономика | польша | 24 | Poland |
Мир | марокко | 23 | Morocco |
Россия | германия | 23 | Germany |
Экономика | зимбабве | 23 | Zimbabwe |
Экономика | казахстан | 23 | Kazakhstan |
Мир | фиджи | 21 | Fiji |
Россия | иран | 21 | Iran, Islamic Republic of |
Россия | япония | 21 | Japan |
Мир | грузия | 20 | Georgia |
Россия | марокко | 20 | Morocco |
Экономика | ирак | 20 | Iraq |
Мир | кувейт | 19 | Kuwait |
Мир | польша | 19 | Poland |
Россия | пакистан | 19 | Pakistan |
Экономика | азербайджан | 19 | Azerbaijan |
Экономика | грузия | 19 | Georgia |
Мир | венесуэла | 18 | Venezuela |
Мир | куба | 17 | Cuba |
Россия | азербайджан | 17 | Azerbaijan |
Мир | бельгия | 16 | Belgium |
Россия | египет | 16 | Egypt |
Россия | казахстан | 16 | Kazakhstan |
Экономика | индия | 16 | India |
Мир | кипр | 15 | Cyprus |
Экономика | латвия | 15 | Latvia |
Экономика | туркмения | 15 | Turkmenistan |
Россия | перу | 14 | Peru |
Мир | белоруссия | 13 | Belarus |
Мир | индонезия | 13 | Indonesia |
Мир | филиппины | 13 | Philippines |
Россия | польша | 13 | Poland |
Экономика | оаэ | 13 | United Arab Emirates |
Экономика | турция | 13 | Turkey |
Экономика | узбекистан | 13 | Uzbekistan |
Россия | оаэ | 12 | United Arab Emirates |
Россия | таиланд | 12 | Thailand |
Экономика | италия | 12 | Italy |
Мир | болгария | 11 | Bulgaria |
Мир | дания | 11 | Denmark |
Мир | литва | 11 | Lithuania |
Мир | румыния | 11 | Romania |
Мир | судан | 11 | Sudan |
Россия | кипр | 11 | Cyprus |
Россия | турция | 11 | Turkey |
Экономика | венгрия | 11 | Hungary |
Экономика | канада | 11 | Canada |
Экономика | литва | 11 | Lithuania |
Экономика | мексика | 11 | Mexico |
Экономика | молдавия | 11 | Moldova, Republic of |
Экономика | норвегия | 11 | Norway |
Экономика | сингапур | 11 | Singapore |
Экономика | юар | 11 | South Africa |
Мир | австрия | 10 | Austria |
Мир | нидерланды | 10 | Netherlands |
Мир | словакия | 10 | Slovakia |
Россия | таджикистан | 10 | Tajikistan |
Россия | юар | 10 | South Africa |
Экономика | австрия | 10 | Austria |
Экономика | катар | 10 | Qatar |
Интернет и СМИ | израиль | 9 | Israel |
Интернет и СМИ | украина | 9 | Ukraine |
Мир | молдавия | 9 | Moldova, Republic of |
Мир | словения | 9 | Slovenia |
Россия | белоруссия | 9 | Belarus |
Россия | зимбабве | 9 | Zimbabwe |
Россия | италия | 9 | Italy |
Россия | узбекистан | 9 | Uzbekistan |
Экономика | алжир | 9 | Algeria |
Экономика | швейцария | 9 | Switzerland |
Экономика | швеция | 9 | Sweden |
Интернет и СМИ | великобритания | 8 | United Kingdom |
Интернет и СМИ | индия | 8 | India |
Мир | алжир | 8 | Algeria |
Мир | вьетнам | 8 | Vietnam |
Мир | греция | 8 | Greece |
Мир | иордания | 8 | Jordan |
Мир | катар | 8 | Qatar |
Мир | узбекистан | 8 | Uzbekistan |
Мир | чехия | 8 | Czech Republic |
Мир | швейцария | 8 | Switzerland |
Мир | эквадор | 8 | Ecuador |
Россия | литва | 8 | Lithuania |
Экономика | нигерия | 8 | Nigeria |
Экономика | нидерланды | 8 | Netherlands |
Интернет и СМИ | ирак | 7 | Iraq |
Интернет и СМИ | китай | 7 | China |
Мир | азербайджан | 7 | Azerbaijan |
Мир | албания | 7 | Albania |
Мир | ирландия | 7 | Ireland |
Мир | казахстан | 7 | Kazakhstan |
Мир | нигерия | 7 | Nigeria |
Мир | таджикистан | 7 | Tajikistan |
Мир | черногория | 7 | Montenegro |
Мир | швеция | 7 | Sweden |
Россия | испания | 7 | Spain |
Экономика | австралия | 7 | Australia |
Экономика | болгария | 7 | Bulgaria |
Экономика | бразилия | 7 | Brazil |
Экономика | греция | 7 | Greece |
Экономика | израиль | 7 | Israel |
Экономика | индонезия | 7 | Indonesia |
Экономика | оман | 7 | Oman |
Экономика | тонга | 7 | Tonga |
Мир | армения | 6 | Armenia |
Мир | бахрейн | 6 | Bahrain |
Мир | венгрия | 6 | Hungary |
Мир | науру | 6 | Nauru |
Мир | нигер | 6 | Niger |
Мир | норвегия | 6 | Norway |
Мир | таиланд | 6 | Thailand |
Мир | финляндия | 6 | Finland |
Мир | эфиопия | 6 | Ethiopia |
Россия | бельгия | 6 | Belgium |
Россия | киргизия | 6 | Kyrgyzstan |
Россия | норвегия | 6 | Norway |
Россия | судан | 6 | Sudan |
Экономика | испания | 6 | Spain |
Экономика | кндр | 6 | North Korea |
Экономика | лаос | 6 | Lao People’s Democratic Republic |
Экономика | пакистан | 6 | Pakistan |
Интернет и СМИ | германия | 5 | Germany |
Интернет и СМИ | египет | 5 | Egypt |
Интернет и СМИ | иран | 5 | Iran, Islamic Republic of |
Интернет и СМИ | испания | 5 | Spain |
Интернет и СМИ | кндр | 5 | North Korea |
Мир | бразилия | 5 | Brazil |
Мир | люксембург | 5 | Luxembourg |
Мир | мали | 5 | Mali |
Мир | эстония | 5 | Estonia |
Россия | армения | 5 | Armenia |
Россия | дания | 5 | Denmark |
Россия | индия | 5 | India |
Россия | индонезия | 5 | Indonesia |
Россия | канада | 5 | Canada |
Россия | колумбия | 5 | Colombia |
Россия | швейцария | 5 | Switzerland |
Экономика | албания | 5 | Albania |
Экономика | армения | 5 | Armenia |
Экономика | бангладеш | 5 | Bangladesh |
Экономика | дания | 5 | Denmark |
Экономика | египет | 5 | Egypt |
Экономика | науру | 5 | Nauru |
Экономика | словения | 5 | Slovenia |
Экономика | таджикистан | 5 | Tajikistan |
Экономика | финляндия | 5 | Finland |
Экономика | эстония | 5 | Estonia |
Интернет и СМИ | азербайджан | 4 | Azerbaijan |
Интернет и СМИ | грузия | 4 | Georgia |
Интернет и СМИ | франция | 4 | France |
Интернет и СМИ | швейцария | 4 | Switzerland |
Мир | мексика | 4 | Mexico |
Мир | хорватия | 4 | Croatia |
Мир | чад | 4 | Chad |
Россия | бангладеш | 4 | Bangladesh |
Россия | бразилия | 4 | Brazil |
Россия | венесуэла | 4 | Venezuela |
Россия | греция | 4 | Greece |
Россия | латвия | 4 | Latvia |
Россия | ливия | 4 | Libyan Arab Jamahiriya |
Россия | финляндия | 4 | Finland |
Россия | чехия | 4 | Czech Republic |
Экономика | бельгия | 4 | Belgium |
Экономика | боливия | 4 | Bolivia |
Экономика | ирландия | 4 | Ireland |
Экономика | кипр | 4 | Cyprus |
Экономика | колумбия | 4 | Colombia |
Экономика | ливия | 4 | Libyan Arab Jamahiriya |
Экономика | малайзия | 4 | Malaysia |
Экономика | нигер | 4 | Niger |
Экономика | перу | 4 | Peru |
Экономика | португалия | 4 | Portugal |
Экономика | филиппины | 4 | Philippines |
Экономика | чехия | 4 | Czech Republic |
Интернет и СМИ | белоруссия | 3 | Belarus |
Интернет и СМИ | бразилия | 3 | Brazil |
Интернет и СМИ | вьетнам | 3 | Vietnam |
Интернет и СМИ | канада | 3 | Canada |
Интернет и СМИ | колумбия | 3 | Colombia |
Интернет и СМИ | пакистан | 3 | Pakistan |
Интернет и СМИ | сингапур | 3 | Singapore |
Интернет и СМИ | япония | 3 | Japan |
Мир | бурунди | 3 | Burundi |
Мир | гвинея | 3 | Guinea |
Мир | гондурас | 3 | Honduras |
Мир | йемен | 3 | Yemen |
Мир | киргизия | 3 | Kyrgyzstan |
Мир | латвия | 3 | Latvia |
Мир | малави | 3 | Malawi |
Мир | малайзия | 3 | Malaysia |
Мир | монако | 3 | Monaco |
Мир | монголия | 3 | Mongolia |
Мир | португалия | 3 | Portugal |
Мир | сингапур | 3 | Singapore |
Мир | тунис | 3 | Tunisia |
Мир | туркмения | 3 | Turkmenistan |
Мир | цар | 3 | Central African Republic |
Мир | эритрея | 3 | Eritrea |
Россия | австрия | 3 | Austria |
Россия | алжир | 3 | Algeria |
Россия | болгария | 3 | Bulgaria |
Россия | гаити | 3 | Haiti |
Россия | ирландия | 3 | Ireland |
Россия | нидерланды | 3 | Netherlands |
Россия | никарагуа | 3 | Nicaragua |
Россия | парагвай | 3 | Paraguay |
Россия | сербия | 3 | Serbia |
Россия | сирия | 3 | Syrian Arab Republic |
Россия | словакия | 3 | Slovakia |
Россия | филиппины | 3 | Philippines |
Россия | чили | 3 | Chile |
Россия | эстония | 3 | Estonia |
Экономика | бруней | 3 | Brunei Darussalam |
Экономика | вьетнам | 3 | Vietnam |
Экономика | исландия | 3 | Iceland |
Экономика | киргизия | 3 | Kyrgyzstan |
Экономика | люксембург | 3 | Luxembourg |
Экономика | монако | 3 | Monaco |
Экономика | монголия | 3 | Mongolia |
Экономика | таиланд | 3 | Thailand |
Интернет и СМИ | афганистан | 2 | Afghanistan |
Интернет и СМИ | казахстан | 2 | Kazakhstan |
Интернет и СМИ | норвегия | 2 | Norway |
Интернет и СМИ | оаэ | 2 | United Arab Emirates |
Интернет и СМИ | узбекистан | 2 | Uzbekistan |
Интернет и СМИ | швеция | 2 | Sweden |
Интернет и СМИ | юар | 2 | South Africa |
Мир | ботсвана | 2 | Botswana |
Мир | бруней | 2 | Brunei Darussalam |
Мир | вануату | 2 | Vanuatu |
Мир | доминика | 2 | Dominica |
Мир | исландия | 2 | Iceland |
Мир | кения | 2 | Kenya |
Мир | лаос | 2 | Lao People’s Democratic Republic |
Мир | мальта | 2 | Malta |
Мир | мьянма | 2 | Myanmar |
Мир | непал | 2 | Nepal |
Мир | оман | 2 | Oman |
Мир | панама | 2 | Panama |
Мир | сальвадор | 2 | El Salvador |
Россия | аргентина | 2 | Argentina |
Россия | бурунди | 2 | Burundi |
Россия | венгрия | 2 | Hungary |
Россия | вьетнам | 2 | Vietnam |
Россия | гана | 2 | Ghana |
Россия | доминика | 2 | Dominica |
Россия | йемен | 2 | Yemen |
Россия | иордания | 2 | Jordan |
Россия | ливан | 2 | Lebanon |
Россия | люксембург | 2 | Luxembourg |
Россия | малайзия | 2 | Malaysia |
Россия | мали | 2 | Mali |
Россия | мексика | 2 | Mexico |
Россия | молдавия | 2 | Moldova, Republic of |
Россия | нигер | 2 | Niger |
Россия | нигерия | 2 | Nigeria |
Россия | румыния | 2 | Romania |
Россия | хорватия | 2 | Croatia |
Россия | чад | 2 | Chad |
Экономика | ангола | 2 | Angola |
Экономика | бурунди | 2 | Burundi |
Экономика | гайана | 2 | Guyana |
Экономика | гондурас | 2 | Honduras |
Экономика | куба | 2 | Cuba |
Экономика | кувейт | 2 | Kuwait |
Экономика | мьянма | 2 | Myanmar |
Экономика | непал | 2 | Nepal |
Экономика | никарагуа | 2 | Nicaragua |
Экономика | румыния | 2 | Romania |
Экономика | хорватия | 2 | Croatia |
Экономика | чад | 2 | Chad |
Экономика | чили | 2 | Chile |
Экономика | эквадор | 2 | Ecuador |
Интернет и СМИ | австралия | 1 | Australia |
Интернет и СМИ | белиз | 1 | Belize |
Интернет и СМИ | болгария | 1 | Bulgaria |
Интернет и СМИ | боливия | 1 | Bolivia |
Интернет и СМИ | иордания | 1 | Jordan |
Интернет и СМИ | италия | 1 | Italy |
Интернет и СМИ | катар | 1 | Qatar |
Интернет и СМИ | куба | 1 | Cuba |
Интернет и СМИ | ливан | 1 | Lebanon |
Интернет и СМИ | ливия | 1 | Libyan Arab Jamahiriya |
Интернет и СМИ | марокко | 1 | Morocco |
Интернет и СМИ | мексика | 1 | Mexico |
Интернет и СМИ | молдавия | 1 | Moldova, Republic of |
Интернет и СМИ | непал | 1 | Nepal |
Интернет и СМИ | нидерланды | 1 | Netherlands |
Интернет и СМИ | парагвай | 1 | Paraguay |
Интернет и СМИ | перу | 1 | Peru |
Интернет и СМИ | польша | 1 | Poland |
Интернет и СМИ | таджикистан | 1 | Tajikistan |
Интернет и СМИ | таиланд | 1 | Thailand |
Интернет и СМИ | филиппины | 1 | Philippines |
Интернет и СМИ | чад | 1 | Chad |
Мир | аргентина | 1 | Argentina |
Мир | барбадос | 1 | Barbados |
Мир | гана | 1 | Ghana |
Мир | гренада | 1 | Grenada |
Мир | джибути | 1 | Djibouti |
Мир | либерия | 1 | Liberia |
Мир | лихтенштейн | 1 | Liechtenstein |
Мир | мавритания | 1 | Mauritania |
Мир | мадагаскар | 1 | Madagascar |
Мир | мальдивы | 1 | Maldives |
Мир | микронезия | 1 | Micronesia, Federated States of |
Мир | танзания | 1 | Tanzania, United Republic of |
Мир | тонга | 1 | Tonga |
Мир | тувалу | 1 | Tuvalu |
Мир | уругвай | 1 | Uruguay |
Россия | австралия | 1 | Australia |
Россия | гамбия | 1 | Gambia |
Россия | исландия | 1 | Iceland |
Россия | катар | 1 | Qatar |
Россия | кения | 1 | Kenya |
Россия | маврикий | 1 | Mauritius |
Россия | мавритания | 1 | Mauritania |
Россия | мадагаскар | 1 | Madagascar |
Россия | монако | 1 | Monaco |
Россия | монголия | 1 | Mongolia |
Россия | непал | 1 | Nepal |
Россия | португалия | 1 | Portugal |
Россия | сенегал | 1 | Senegal |
Россия | сингапур | 1 | Singapore |
Россия | туркмения | 1 | Turkmenistan |
Россия | уругвай | 1 | Uruguay |
Россия | швеция | 1 | Sweden |
Россия | эквадор | 1 | Ecuador |
Экономика | андорра | 1 | Andorra |
Экономика | аргентина | 1 | Argentina |
Экономика | афганистан | 1 | Afghanistan |
Экономика | бутан | 1 | Bhutan |
Экономика | вануату | 1 | Vanuatu |
Экономика | гаити | 1 | Haiti |
Экономика | гамбия | 1 | Gambia |
Экономика | гватемала | 1 | Guatemala |
Экономика | гвинея | 1 | Guinea |
Экономика | гренада | 1 | Grenada |
Экономика | доминика | 1 | Dominica |
Экономика | иордания | 1 | Jordan |
Экономика | камбоджа | 1 | Cambodia |
Экономика | камерун | 1 | Cameroon |
Экономика | кения | 1 | Kenya |
Экономика | лесото | 1 | Lesotho |
Экономика | либерия | 1 | Liberia |
Экономика | ливан | 1 | Lebanon |
Экономика | лихтенштейн | 1 | Liechtenstein |
Экономика | мальдивы | 1 | Maldives |
Экономика | мальта | 1 | Malta |
Экономика | руанда | 1 | Rwanda |
Экономика | сальвадор | 1 | El Salvador |
Экономика | сенегал | 1 | Senegal |
Экономика | сербия | 1 | Serbia |
Экономика | сирия | 1 | Syrian Arab Republic |
Экономика | словакия | 1 | Slovakia |
Экономика | тунис | 1 | Tunisia |
Экономика | эфиопия | 1 | Ethiopia |
For this graph, I showed the distribution of each country’s article
topics. This was to visualize how each country was mentioned in regard
to topic
. I specifically limited the number of mentions to
50 total in order to not over-saturate the visualization with single
mentions of countries. This filter was to hone in on the top countries
that are mentioned in the Lenta.ru articles that I analyzed.
#creating a plot of the top mentioned countries
country_plot2 <- ggplot(headlines_countries %>%
filter(n > 50), aes(x = english, y= n)) +
#specifying the type of visualization
geom_col() +
#placing the country names on the Y-axis by flipping x and y axes
coord_flip() +
#adding labels and captions to the graph
labs(title = "Countries with highest mention in Lenta.Ru by Topic",
x = "Country ",
y = "Number of instances in Articles",
fill = "Topic",
caption = "Raw count of Countries' mentions in all articles")
#creating a plot that shows the distribution of topic by country
country_plot3 <- ggplot(headlines_countries %>%
filter(n > 50),
aes(x = english, y= n, fill = topic)) +
#setting the positons to 'fill' to visualize the percentage of each country's
#topic
geom_col(position = "fill") +
#flipping x and y axes
coord_flip() +
#Placing the legend at the bottom of the graph
theme(legend.position = "bottom") +
#adding labels and captions
labs(title = "Countries with highest mention in Lenta.Ru by Topic",
x = "Country",
y = "Number of instances in Articles",
fill = "Topic",
caption = "Distribution of countries with a count higher than 50,
(n > 50), by topic of article")
#printing out both graphs
country_plot2
country_plot3
In this section of my method, I detail how I went about creating my
map graphics and joining my spatial data with my country data. To begin
I joined the locations and country data set, from the CSV section above,
to create a dataset that has both the name in Russian and English, along
with the latitude and longitude of each country. These variables are
called lat
and long
. I also cleaned up the
data and removed the duplicate column of the English country names.
After creating this I joined the dataset with the country names and
their number (n
) together. This produced the dataset
spatial
which holds the count per country (variable
n
), the Russian name of the country (variable
name
), the English name of the country (variable
english
), and the latitude and longitude of each country
(lat
and long
respectively). Later on in this
section, I further clean up the dataset and remove unnecessary
columns.
#making a temporary dataset to help join my location data to my count total
spatial2 <- left_join(countries_set, locations, by =
c("alpha2" = "tolower(abbv)")) %>%
#this is used to match the country names in the 'headlines_countries' set
mutate(tolower(name))
#here I am making the final dataset with country names, count, lat, and long
spatial <- left_join(headlines_countries, spatial2, by =
c("word" = "tolower(name)")) %>%
#I am removing this duplicate column
select(!`english.y`) %>%
#I am renaming this for simplicity
rename("english" = "english.x") %>%
#I am reordering my count
mutate(name = fct_reorder(name,n)) %>%
#I am assigning ID values to each country to use to create the nodes
#and edges
mutate(id = as.numeric(unique(name)))
#making a variable that holds the id value of Russia
#this will be used to create the edges and nodes coming from Russia
russia_id <- spatial[1,]$id
In this code segment I am separating the datasets into specific
topic
based datasets. This is because, further in the
process, the edges would not allow for multiple topic
variables. Because of this, I made 4 separate datasets and 4 maps with
each specific dataset. I also removed countries that were not mentioned
at least 10 times in each topic. This was to reduce clutter and hone in
on the top countries that are mentioned in Lenta.ru articles.
#These datasets are to filter by topic and have a dataset for each
world <- spatial %>%
filter(topic == "Мир") %>%
filter(n > 10 )
russia <- spatial %>%
filter(topic == "Россия")%>%
filter(n > 10 )
economy <- spatial %>%
filter(topic == "Экономика")%>%
filter(n > 10 )
internet <- spatial %>%
filter(topic == "Интернет и СМИ") %>%
filter(n > 10)
In this segment, I am making a theme that can be applied to all of the 4 maps, so I did not have to repeat the code for all, and it makes it possible to edit the code for the theme here to affect all the maps at once. I also made the theme for the country layers that shows all of the countries on top of the map. I will go into further depth on the different layers of my map visualizations in the next section.
#Here I am creating a custom theme for the following maps to all share
#The ones commented out are ones I did not choose to use but that can be
#saving it as a whole theme
maptheme <- #theme(panel.grid = element_blank()) +
#removes the axis text
theme(axis.text = element_blank()) +
#removes the axis ticks on the map
theme(axis.ticks = element_blank()) +
#removes the titles from the axis
theme(axis.title = element_blank()) +
#Change the font/ size of text in Map
#theme(text = element_text(size = 12, family = "Times")) +
#sets the position of the legend
theme(legend.position = "bottom") +
#Controls the background of the map
theme(panel.background = element_rect(fill = "deepskyblue1")) +
theme(plot.margin = unit(c(0, 0, 0.5, 0), 'cm'))
#This is the theme for the color, and shapes of the countries overlay
country_shapes <- geom_polygon(aes(x = long, y = lat, group = group),
data = map_data('world'),
#color of countries, and their borders
fill = "seagreen2", color = "black",
#size/ thickness of the boarders
size = 0.05)
#Sets the framing of the map to these specific coords
#I am not using this
mapcoords <- coord_fixed(xlim = c(-150, 180), ylim = c(-55, 80))
For this segment of my analysis, I used the methods in Konrad (2018) to create my visualizations of the
networks overlays. The process to create them involved creating and
overlaying layers on top of each other. The first step was creating the
nodes and edges data. This is shown in the nodes_world dataset, in which
I used the specifically filtered datasets on the topic
variable and selected the necessary columns. The nodes were simple to
create. For the edges, I used the id
that I assigned to
each country to make sure that Russia is always the from
of
the edge. I then assigned the to
variable to be the
id
of each country that was not Russia. Then I joined these
together to have the latitude and longitude of both the
from
country, which is always Russia, and the
to
country. I then filtered out Russia from this dataset
because the to
and from
variables were the
same. This created problems for the overlay of the
geom_curve()
function, so I removed Russia.
The process to make the map is all the same, and only the specific
dataset for the topic
variable changes. I made the base
dataset the nodes, then overlaid the previous
country_shapes
dataset, which is the overlay for the green
visuals of the countries of the world. The next layer was the curves of
the edges. For this I used geom_curve()
and assigned the
ends and starts to the lat and long of the ending node and beginning
nodes. The next Layer was the points themselves. This involved using
geom_point
. I mapped the size of the points to the number
of mentions, n
, and I had to run n
through
log10
in order to make the sizes proportionate to each
other. Originality without doing this, the United States and Russian
Federation points were so large they covered the entire map. I then used
facet_wrap()
to create the header that displays the topic
name in Russian. This was only for aesthetic purposes, as there is only
one topic in each data set. Finally I added the map theme from the
previous section, in order to easily have the same base theme. I changed
the colors for each topic to better separate by topic
. The
process is the same for each map.
#Creating the nodes from the specific topic dataset
nodes_world <- world %>%
select(c("id","name","english","n","topic","Latitude","Longitude")) %>%
rename(lat = Latitude, lon = Longitude)
#trying to make the nodes all start from Russia's id
edges_world <- nodes_world %>%
mutate(from = russia_id) %>%
mutate(to = id)
#creating the edges by joing the nodes and edges datasets
edges_for_world <- edges_world %>%
inner_join(nodes_world %>% select(id, lon, lat), by = c('from' = 'id')) %>%
inner_join(nodes_world %>% select(id, lon, lat), by = c('to' = 'id')) %>%
filter(!name =="Россия")
#beginning the ggplot pipeline for the maps
world_plot <-ggplot(nodes_world) +
#overlay of country shapes from segment above
country_shapes +
#adding the edges
geom_curve(aes(x = `lon.y`,
y = `lat.y`,
xend = lon,
yend = lat),
data = edges_for_world,
curvature = 0.1,
alpha = 0.5,
color = "darkorange1") +
scale_size_continuous(guide = FALSE) +
#adding the country points to the map
geom_point(aes(x = lon, y = lat),
shape = 21,
size = log10(world$n),
fill = 'darkorange2',
color = 'black', stroke = 0.5) +
#title
labs(title = "Mentions of Foreign countries in 'World' topic") +
#adding a facet wrap only for aestetic reasons
facet_wrap(~topic) +
#adding the theme from above
maptheme
#plotting the grpah
world_plot
#Creating the nodes from the specific topic dataset
nodes_economy <- economy %>%
select(c("id","name","english","n","topic","Latitude","Longitude")) %>%
rename(lat = Latitude, lon = Longitude)
#trying to make the nodes all start from Russia's id
edges_economy <- nodes_economy %>%
mutate(from = russia_id) %>%
mutate(to = id)
#creating the edges from the nodes and edges datasets by joining them
edges_for_economy <- edges_economy %>%
inner_join(nodes_economy %>% select(id, lon, lat), by = c('from' = 'id')) %>%
inner_join(nodes_economy %>% select(id, lon, lat), by = c('to' = 'id')) %>%
filter(!name =="Россия") %>%
unique()
#beginning of ggplot pipeline
economy_plot <-ggplot(nodes_economy) +
#Loading country layer with theme settings from above
country_shapes +
#overlay of the edges
geom_curve(aes(x = `lon.y`,
y = `lat.y`,
xend = lon,
yend = lat),
data = edges_for_economy,
curvature = 0.2,
alpha = 0.5,
color = "violetred1") +
scale_size_continuous(guide = FALSE) +
#overlay of the points
geom_point(aes(x = lon, y = lat),
shape = 21,
size = log10(economy$n),
fill = 'red',
color = 'black', stroke = 0.5) +
labs(title = "Mentions of Foreign countries in 'Economy' topic") +
#only for aesthetic reasons
facet_wrap(~topic) +
#theme from above
maptheme
economy_plot
#Creating the nodes from the specific topic dataset
nodes_russia <- russia %>%
select(c("id","name","english","n","topic","Latitude","Longitude")) %>%
rename(lat = Latitude, lon = Longitude)
#trying to make the nodes all start from Russia's id
edges_russia <- nodes_russia %>%
mutate(from = russia_id) %>%
mutate(to = id)
#making the edges from the nodes and edges datasets
edges_for_russia <- edges_russia %>%
inner_join(nodes_russia %>% select(id, lon, lat), by = c('from' = 'id')) %>%
inner_join(nodes_russia %>% select(id, lon, lat), by = c('to' = 'id')) %>%
filter(!name =="Россия") %>%
unique()
#beginning of ggplot pipeline
domestic_plot <-ggplot(nodes_russia) +
#Loading country layer with theme settings from above
country_shapes +
#making it allign with color of topic
geom_curve(aes(x = `lon.y`,
y = `lat.y`,
xend = lon,
yend = lat),
data = edges_for_russia,
curvature = 0.2,
alpha = 0.5,
color = "hotpink1") +
scale_size_continuous(guide = FALSE) +
#adding the point layer
geom_point(aes(x = lon, y = lat),
shape = 21,
size = log10(russia$n),
fill = 'hotpink',
color = 'black', stroke = 0.5) +
#title of graph
labs(title = "Mentions of Foreign countries in 'Domestic' topic") +
#only for aestetic reasons
facet_wrap(~topic) +
#adding the theme from above
maptheme
domestic_plot
#Creating the nodes from the specific topic dataset
nodes_internet <- internet %>%
select(c("id","name","english","n","topic","Latitude","Longitude")) %>%
rename(lat = Latitude, lon = Longitude)
#trying to make the nodes all start from Russia's id
edges_internet <- nodes_internet %>%
mutate(from = russia_id) %>%
mutate(to = id)
#creating the edges by joing together the edges and nodes datasets
edges_for_internet <- edges_internet %>%
inner_join(nodes_internet %>% select(id, lon, lat), by = c('from' = 'id')) %>%
inner_join(nodes_internet %>% select(id, lon, lat), by = c('to' = 'id')) %>%
filter(!name =="Россия") %>%
unique()
#beginning of ggplot pipeline
internet_plot <-ggplot(nodes_internet) +
#Loading country layer with theme settings from above
country_shapes +
#adding edges layer
geom_curve(aes(x = `lon.y`,
y = `lat.y`,
xend = lon,
yend = lat),
data = edges_for_internet,
curvature = 0.2,
alpha = 0.5,
color = "darkorchid1") +
scale_size_continuous(guide = FALSE) +
#adding point layer
geom_point(aes(x = lon, y = lat),
shape = 21,
size = log10(internet$n),
fill = 'purple',
color = 'black', stroke = 0.5) +
labs(title = "Mentions of Foreign countries in 'internet' topic") +
#only for aesthetic reasons
facet_wrap(~topic) +
#adding theme from above
maptheme
internet_plot
The results of my visualizations and data wrangling show that the country that is mentioned the most in this Lenta.ru data set, is the United States by far. The only other country that is mentioned more than the United States is Russia itself which makes sense. This shows that in Lenta.ru articles the main foreign country that is mentioned is the United States. This lends itself to the idea that Russia is most concerned with the United States in regard to certain topics. The domestic topic is the only topic where Russia is mentioned more than the United States and this makes sense in regard to foreign affairs. The United States, in the Russia view, is a foreign affair not a domestic one.
For the internet topic, in Russian “Интернет и СМИ”, It is interesting that the only country that is mentioned more than 10 times is the United States. I find this interesting and an insight into the relationship between Russian and American internet culture. Clearly, there is not enough to analyze the cultures specifically, but it is shown in this analysis that there is a great number of mentions of one and another. I find this interesting and a potential point of research in the future. Seeing as the internet culture has grown tremendously in the past two decades, this can be an excellent point of research into the relationship between Russian and American cultures.
For further research on this topic, I can foresee further exploration of Russian news agencies being prudent and yielding fruitful results. It would be interesting to see if the same weight of each countries’ mention would hold consistent across other Russian news agencies. It would also be interesting to investigate other countries and the interaction between news articles in multiple countries. This, I predict, could illuminate responses of countries and interaction between them. This can then be visualized with the same method.
Reviewer 1
The author’s objective is to provide insight into the Russian vision of the world from a Russian perspective by analyzing mentions of foreign countries in Russia’s most popular news source Lenta.ru. The conclusion of the report frames insight into the Russian vision of the world specifically through the lens of the number of mentions of foreign countries in the categories of World, Economy, Domestic, and Internet. Their conclusion particularly notes that the main foreign country mentioned was the US and demonstrates that in the Internet category, the US was the only foreign country mentioned over 10 times, which they note could be an interesting area of further research. The results and figures well support the author’s conclusions. I appreciate the demonstration of workflow in finding the results as shown by the bar plots reporting publication frequency and mentions of foreign countries. I also find the maps to be a very effective way of visualizing foreign country mentions. The one thing that I think might be helpful in this area would be to provide some sort of figure to show have the size of the node corresponds to the numbers of mentions as it would have been interesting to see the differences in mentions between the different categories.
The foundations of data visualization are Verifications (checking for errors and the source of data), Dimensions (the variables shown, variable types, numbers of dimensions), Aesthetics (the scale of the plots, coloring, and size of elements), and Interpretations and Intensions (intended meaning, target audience, and purpose). The author of this report does a great job working with each of these foundations. In terms of verification, it seems they have done their research well to select the most popular source of Russian media. For dimensions, the author uses text data and tokenizes the headlines to select country names and looked at the topic of the articles and the number of mentions of each country. Overall, the bar plots contain 3 to 4 dimensions, with increased dimensions coming from the use of facet_wrap and color to add categories of articles, months, and years. In terms of aesthetics, the author uses primarily uses color and size to add additional information to plots. The interpretations and intentions of the author are clearly stated in the introduction and conclusion. The majority of the plots are presented very objectively and areas, where the author did additional wrangling to change aspects of how the data is presented (as with the size of the nodes in the maps) are transparently stated.
Strong: 1. The written component of this report is quite strong. The author makes it very clear what they did in each part of the report such that I believe it would be very possible to reproduce their methods. 2. The data wrangling of this project is well done. I really appreciated the incorporation of the scrolling table that allows the reader to see all the hard work going on behind the scenes of the visualization. 3. I appreciate the flow of the visualizations. Starting with the bar plots to generally visualize the data before providing specific insights into Russian media through the maps was something I thought was very effective.
To be improved: 1. I mentioned this above, but I think the maps could have really benefitted from a legend to allow the reader to easily see how the countries mention change between the categories as well as how the number of mentions changes. 2. In the figure that shows the “Countries with the mention in Lenta.Ru by Topic,” it would have been nice to have a subtitle for each topic giving the translation into English. This would be beneficial as it would be easier to see how this bar plot relates to the maps later on in the report.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".