Using the full public domain texts available through Project Gutenberg, I would like to evaluate the most frequent words used in science fiction from the mid-19th century to present to visualize how topics of interest have shifted over time.
The 20th century was a time of rapid acceleration in the development of technology. From the flight of the first gas-powered airplane in 1903 (The Wright Stuff, n.d.) to the release of IBM’s first PC in 1981 (Timeline of Computer History, n.d.), new technology had the potential to change the operation of day-to-day life. Alongside these changes came shifts in the way people wrote speculative fiction (Writing the Future, n.d.). From the adventurous exploits and dazzling technology in the work of Jules Verne and HG Wells to the grim post-apocalyptic worlds widespread by the 1960s, the science fiction genre offers an interesting perspective on the contemporary issues of their day. In this project, I would like to explore this shift using the gutenbergr package to perform textual analysis on science fiction from the late 1800s to mid-1900s to gain insight into this time with prominent technological, political, and social turmoil.
The objectives of this project are to identify and visualize thematic shifts in science fiction over the last 150 years, as well as provide potential reasons for why these shifts occurred.
The main source of data in this project is the
gutenbergr
package, which includes meta data for and access
to the public-domain full texts available through Project Gutenberg. The
gutenberg_metadata
data frame includes the variables
gutenberg_id
, title
, author
,
gutenberg_author_id
, language
,
gutenberg_bookshelf
, rights
, and
has_text
. In addition to the meta data, I am using the
gutenberg_download
function provided in the package to
download the full texts of works I am interested in performing textual
analysis on. The full text data is in the form of a list with the
gutenberg_id
, title, and text.
Working with the Project Gutenberg data to evaluate change over time
is challenging for a few reasons. Primarily, information about
publication date is not available, so alternative sources must be used
to estimate approximate publication dates. For this analysis, the
gutenberg_authors
data frame (which includes
gutenberg_author_id
, author
,
alias
, birthdate
, deathdate
,
wikipedia
, and aliases
) was used to sort the
works into 30-year time intervals based on the average of the author’s
birth and death date. This method is very inexact, and if this project
were to be extended, it would be beneficial to use a more precise source
(i.e. scraping the wikipedia pages for publication dates).
The data exploration and wrangling for this project took the form of four major steps: identifying texts of interest by establishing search criteria, doing exploratory investigations of themes using the titles of identified texts, extracting word frequency data for each of the 1100+ texts of interest, and visualizing the results through wordclouds.
The first step in data analysis was establishing search criteria that
allowed the data set to be manageable, focused, and without too many
complicating factors. I first filtered the gutenberg metadata for
English-language items with the full text available, and then used the
categorization provided with the gutenberg_bookshelf
variable to restrict my data set to only include science fiction
titles.
Rows: 10,056
Columns: 8
$ gutenberg_id <int> 1, 2, 4, 5, 6, 8, 9, 11, 12, 13, 14, 15,…
$ title <chr> "The Declaration of Independence of the …
$ author <chr> "Jefferson, Thomas", "United States", "L…
$ gutenberg_author_id <int> 1638, 1, 3, 1, 4, 3, 3, 7, 7, 7, 8, 9, 1…
$ language <chr> "en", "en", "en", "en", "en", "en", "en"…
$ gutenberg_bookshelf <chr> "United States Law/American Revolutionar…
$ rights <chr> "Public domain in the USA.", "Public dom…
$ has_text <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
After generating my sciFi data set, I used the
gutenberg_authors
data frame to estimate the approximate
publication date of the texts of interest. My method for this was very
imprecise, as I averaged the birth and death dates of the author to get
the halfway point of their life, and from there, I assigned each work to
a thirty year interval. This could have potentially caused a number of
texts to be incorrectly categorized, as authors writing at about the
time when my intervals change could have been placed too early or too
soon, and any author who wrote very young or old could be anywhere from
one to two intervals away from the proper publication date.
# using gutenberg's author data frame to estimate the time of publication
approxDates <- gutenberg_authors %>% group_by(author) %>%
summarize(year = (birthdate + deathdate)/2) %>%
drop_na() %>%
#removes authors with a BC birthdate
filter(year > 0) %>%
arrange(desc(-year))
# joining sciFi data frame with approx_Dates, then assigning each title to a thirty year interval
#incorporating Alex's suggestion to use case_when() rather than cut()
sciFi_wDates <- sciFi %>% left_join(approxDates, by = "author") %>%
arrange(desc(-year)) %>%
mutate(interval = case_when(year >= 1850 & year <1880 ~ 1850,
year >= 1880 & year <1910 ~ 1880,
year >= 1910 & year <1940 ~ 1910,
year >= 1940 & year <1970 ~ 1940,
year >= 1970 & year <2000 ~ 1970,
year >= 2000 ~ 2000))
# resulting frame's variables:
# gutenberg_id, title, author, gutenberg_author_id, language, gutenberg_bookshelf,
# rights, has_text, year, interval
After this initial wrangling, I had a data frame containing about 1100 titles that were all eligible for analysis under the criteria I had developed. From there, I began doing some exploratory text analysis.
Before looking at the change over time, I wanted to get a broad sense of some major themes explored in science fiction. To do this, I decided to use the titles of the texts I identified and create a visualization of some of the most frequent words.
# load stop words lexicon code from class
swd_list <- tidytext::stop_words %>%
filter(lexicon == "SMART") %>%
select(word)
swd_list <- swd_list$word
# making a data frame with tokenized words from all titles
initial <- sciFi_wDates %>%
group_by(interval) %>%
unnest_tokens(word, title) %>%
count(word) %>%
# filter out stop words
filter(!word %in% swd_list) %>%
# minimum word occurrence to be able to fit wordcloud on page is 3
filter(n>=3) %>%
arrange(desc(n))
# making the wordcloud
wordcloud(initial$word, initial$n)
As can be seen in the wordcloud, there are lots of words associated with space, war, the future, and various other topics.
The bulk of data processing occurred in the analysis of the full texts. Because of the complex nature of the data, I chose not to calculate any frequency information for words in individual texts. Instead, I chose to identify the twenty most frequent words (outside of common words as described in the stop words lexicon) used in each text. Then, after joining data frames to associate the words with the time intervals of interest, I calculated how frequently words appeared within the top 20 words of each time interval. My evaluation was therefore based on how frequently a word occurred within the subset of top 20 words that occurred 3 or more times.
It should be noted running the code to analyze all of the full texts
takes a significant amount of time (20+ minutes the first time it was
run), though it successfully outputs the data of interest. This
processing time can be avoided by simply opening the .csv
file with the resulting data frame that was saved after the analysis was
initially run. The file is included and the code below set up to use the
.csv automatically, but the code used is also included and can be run
again if desired.
## the text analysis code, once again warning it takes 20+ minutes to run
# top20Words.csv has the results without the hassle
# but here's the code I used:
# making a list that has the IDs of all the texts to tokenize
text_IDs <- sciFi_wDates$gutenberg_id
# making the data frame to be appended to in the for loop
top20Words <- data.frame(gutenberg_id = c(), words = c())
# for loop outputs frame with 20 rows per text, each with the gutenberg_id,
# word, and n variables
for (i in text_IDs) {
#downloads full text for current work of interest
testText <- gutenberg_download(c(i), meta_fields = "title")
# tokenizing text of interest
tokenedText <- testText %>%
group_by(gutenberg_id) %>%
unnest_tokens(word, text) %>%
#removing stop words
filter(!word %in% swd_list) %>%
#counting number of times words appear and outputting 20 most frequent
count(word) %>%
arrange(desc(n)) %>%
head(20)
#adding top 20 words of text to dataframe with all top 20 words
top20Words <- rbind(top20Words, tokenedText)
}
glimpse(top20Words)
After getting the 20 most frequently occurring words for each text, I used the subsetted meta data to assign each row to an interval.
# selecting rows of interest in sciFi_wDates data frame
text_info <- sciFi_wDates %>% select(gutenberg_id, interval, author)
# joining the data frames, result had gutenberg_id, word, n, author, interval
top20Words_wDates <- top20Words %>% left_join(text_info)
#counting the number of times words occur within each interval
words_summary <- top20Words_wDates %>% group_by(interval, word) %>%
summarize(count = n()) %>%
#filters for words that appear in the top 20 words of three or more texts
filter(count >= 3) %>%
drop_na() %>%
arrange(desc(count))
glimpse(words_summary)
Rows: 867
Columns: 3
Groups: interval [5]
$ interval <dbl> 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 194…
$ word <chr> "back", "time", "man", "looked", "eyes", "thought",…
$ count <int> 427, 377, 335, 190, 151, 137, 135, 125, 120, 116, 1…
The final data set I ended up with, words_summary, summarizes the number of times a word appears in the top 20 words of a text over the course of each interval.
The final step I did as part of this data exploration was visualize the most frequently occurring words for each of the thirty-year intervals using some basic word clouds. Because the amount of text evaluated was so large with limited meta data associated, I wanted to keep the word clouds as simple as possible so as not to be misleading. The colors used do not indicate any qualities of the data and are only for aesthetic purposes. Additional data wrangling was done to generate relative frequencies, but sizing does not reflect information about the magnitude of difference between word frequencies in the full data set.
## each of the following word clouds uses the same code, so only the first is commented
#1850s wordcloud
#making a subset of the summarized word counts for the interval of interest
summary_1850 <- words_summary %>% filter(interval == 1850)
#calculating the sum of instances of the words that occurred in 3+ texts within interval
totalWords_1850 <- sum(summary_1850$count)
#mutating column with relative frequencies of words
summary_1850 <- summary_1850 %>% mutate(freq = count/totalWords_1850) %>%
select(word, freq)
#generating the wordcloud
wordcloud(words = summary_1850$word, frequency = summary_1850$n,
colors = brewer.pal(n = 5,name = "Dark2"),
random.color = TRUE,
random.order = FALSE, scale = c(2, 6))
wordcloud(words = summary_1880$word, frequency = summary_1880$n,
colors = brewer.pal(n = 5,name = "Dark2"),
random.color = TRUE,
random.order = FALSE, scale = c(1.8, 4.8))
wordcloud(words = summary_1910$word, frequency = summary_1910$n,
colors = brewer.pal(n = 5,name = "Dark2"),
random.color = TRUE,
random.order = FALSE, scale = c(1.8, 5.5))
wordcloud(words = summary_1940$word, frequency = summary_1940$n,
colors = brewer.pal(n = 5,name = "Dark2"),
random.color = TRUE,
random.order = FALSE, scale = c(2, 6))
wordcloud(words = summary_1970$word, frequency = summary_1970$n,
colors = brewer.pal(n = 5,name = "Dark2"),
random.color = TRUE,
random.order = FALSE, scale = c(2, 6))
Because the word clouds are a little tricky to compare, the last plot I made elaborates on their differences and similarities. The scaling and relative frequency shown is not reliable, but this plot shows some of the change I’m interested in.
#frame with the words used in the wordcloud for each interval
all_summary <- summary_1850 %>% rbind(summary_1880) %>%
rbind(summary_1910) %>% rbind(summary_1940) %>% rbind(summary_1970)
# using manual categorization to assign each word to a category based on the
# general context of the word
people <- c("captain", "doctor", "man", "men","mr","people","professor","sir")
verbs <- c("asked", "began", "he'd", "knew", "looked", "made", "make", "replied", "thought")
descriptors <- c("black", "good", "great", "half", "long", "suddenly")
nouns <- c("air", "alien", "back", "car", "eyes", "face",
"feet", "hand", "hands", "head", "mind", "city","door")
space_time <- c("earth", "planet", "ship","space", "sun", "world", "years",
"time", "light","past", "day","night")
other_nouns <- c("life", "love","room","surface","thing","things","war","work")
#adding a column with the categorization to the data frame
all_summary <- all_summary %>%
mutate(category = case_when(word %in% people ~ "people",
word %in% verbs ~ "verbs",
word %in% descriptors ~ "descriptors",
word %in% nouns ~ "nouns",
word %in% space_time ~ "space_time",
word %in% other_nouns ~ "other_nouns",
TRUE ~ "other"))
# making an interactive ggplot using plotly
interactive_timePlot <- ggplot(data = all_summary, aes(x = interval, y = freq, color = word)) +
geom_point() +
geom_line() +
facet_wrap(~category) +
labs(title="Subject Matter Over Time")
ggplotly(interactive_timePlot)
As can be seen in the plot, there are some interesting changed that occur over the course of the five intervals that could likely be evaluated further with more precise text analysis. The most frequently appearing verbs are similar across the intervals, but the variations observed in this broad view could potentially be interesting to look at further. The words “war” and “alien” don’t appear as top words until the 1970s interval, while words like “planet,” “time,” and “good” appear in all of them.
This project used a large data set with over 1000 full text public domain documents in an attempt to learn about the broad thematic patterns of science fiction over the last 150 years and hypothesized the results would show shifting areas of focus based on the political, social, and technological context of the time the texts were written. While the results were not as conclusive as one might have hoped, there is a foundation for more specific inquiry to be done in the future. One potential avenue to explore the data further is to create a more extensive system for categorizing the words in these texts. Similarly to how emotional lexicons can be used to map emotional associations to words for simpler analysis, creating a source to associate certain words with categories could help streamline the process of looking at trends over time.
It is also crucial that a more accurate source is found for assigning publication dates to texts. The degree of error caused by the assumptions in estimating when something was written makes any observations stand on very shaky ground. Since their wikipedia pages are included in the gutenberg_authors dataframe, an effort could be made to use that as a source of more accurate time information. Given more accurate publication dates, some of the areas I would be interested in exploring further are trends in the most frequently occurring verbs, what sorts of occupations/roles are mentioned at different times, and how the frequency of abstract nouns (love, peace, hope) changes over time.
The algorithm used to generate the most frequently occurring words in the data set could easily be adapted to perform textual analysis with other parameters as well. Changing what words are filtered for, where the list of texts to look at originates, and what is chosen as an output could allow many of these questions to be investigated more precisely, or a different set of questions entirely could be evaluated. Overall, this project has established a good jumping-off point for textual analysis within the gutenbergr data set with the potential to be adapted for other applications as well.
Reviewer 1
The author’s objectives were to explore the trend in frequent words in science fiction texts between the time period of late 19C to mid-late-20C. The questions the author was considering includes how the topics of interest in science fiction shifted over time, potentially influenced by the factors including political and scientific context that was happening at each time period. The results and figures were in agreement with the conclusions made in the report, as the author did mention some trends such as “war” and “alien” not appearing as top words until 1970s while less politically and technologically relevant terms such as “planet”, “time”, and “good” appear across all time intervals. The author was, however, very cautious about drawing strong conclusions and mentioned that the results were not as conclusive as one might have hoped, given the limitation of the estimation of publication dates.
Regarding the foundations of data visualizations, the author verified and checked the data that it is from a reliable source, despite the limitation in publication dates. Since the only variable shown in the wordclouds is the word itself, if the size of words in each wordcloud reflected the relative frequency presented in the relative frequency plot, it might have been interesting. For the frequency plot, if the x-axis was larger so each time interval label were not overlapping, it would have been slightly easier to view the plot. The different coloring that reflect different words in both types of plots fulfills the purpose of showing the trends of different words over the time intervals well.
I think the author communicated clearly of the limitations of the study, and did the best they could do regarding the information given to estimate the publication date of the text in a resourceful way (using the writers’ birth and death date). The data wrangling including finding word frequencies were very well laid out with the code and explanation together, so the readers could follow it step-by-step. The different types of visualizations using the wordcloud and frequency plot also complement each other nicely, and the author was able to identify words that are found in all time intervals as well as words that change in frequency from them.
As the author briefly mentioned, scaling and relative frequency could be improved in the final relative frequency plot to be more accurate if the other parts of the data such as the publication date is improved in accuracy. In addition, if the color of the lines in the relative frequency plot was indicative of the type of word, it would have also been interesting in a plot that is facet wrapped differently to compare the frequency trends of different types of words.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".