Shifts in the Topics of Science Fiction Literature Over Time

Sarah Ellis (Data Science at Reed College)https://reed-statistics.github.io/math241-spring2022/
May 4, 2022

Introduction

Problem Statement

Using the full public domain texts available through Project Gutenberg, I would like to evaluate the most frequent words used in science fiction from the mid-19th century to present to visualize how topics of interest have shifted over time.

Background

The 20th century was a time of rapid acceleration in the development of technology. From the flight of the first gas-powered airplane in 1903 (The Wright Stuff, n.d.) to the release of IBM’s first PC in 1981 (Timeline of Computer History, n.d.), new technology had the potential to change the operation of day-to-day life. Alongside these changes came shifts in the way people wrote speculative fiction (Writing the Future, n.d.). From the adventurous exploits and dazzling technology in the work of Jules Verne and HG Wells to the grim post-apocalyptic worlds widespread by the 1960s, the science fiction genre offers an interesting perspective on the contemporary issues of their day. In this project, I would like to explore this shift using the gutenbergr package to perform textual analysis on science fiction from the late 1800s to mid-1900s to gain insight into this time with prominent technological, political, and social turmoil.

The objectives of this project are to identify and visualize thematic shifts in science fiction over the last 150 years, as well as provide potential reasons for why these shifts occurred.

Data Set

The main source of data in this project is the gutenbergr package, which includes meta data for and access to the public-domain full texts available through Project Gutenberg. The gutenberg_metadata data frame includes the variables gutenberg_id, title, author, gutenberg_author_id, language, gutenberg_bookshelf, rights, and has_text. In addition to the meta data, I am using the gutenberg_download function provided in the package to download the full texts of works I am interested in performing textual analysis on. The full text data is in the form of a list with the gutenberg_id, title, and text.

Working with the Project Gutenberg data to evaluate change over time is challenging for a few reasons. Primarily, information about publication date is not available, so alternative sources must be used to estimate approximate publication dates. For this analysis, the gutenberg_authors data frame (which includes gutenberg_author_id, author, alias, birthdate, deathdate, wikipedia, and aliases) was used to sort the works into 30-year time intervals based on the average of the author’s birth and death date. This method is very inexact, and if this project were to be extended, it would be beneficial to use a more precise source (i.e. scraping the wikipedia pages for publication dates).

Methods and Results

The data exploration and wrangling for this project took the form of four major steps: identifying texts of interest by establishing search criteria, doing exploratory investigations of themes using the titles of identified texts, extracting word frequency data for each of the 1100+ texts of interest, and visualizing the results through wordclouds.

hide

Identifying Texts

The first step in data analysis was establishing search criteria that allowed the data set to be manageable, focused, and without too many complicating factors. I first filtered the gutenberg metadata for English-language items with the full text available, and then used the categorization provided with the gutenberg_bookshelf variable to restrict my data set to only include science fiction titles.

hide
# filtering data to only include English titles with full text available
metaData <- gutenberg_metadata %>% filter(has_text==TRUE & language=="en") %>%
  drop_na() %>%
  glimpse()
Rows: 10,056
Columns: 8
$ gutenberg_id        <int> 1, 2, 4, 5, 6, 8, 9, 11, 12, 13, 14, 15,…
$ title               <chr> "The Declaration of Independence of the …
$ author              <chr> "Jefferson, Thomas", "United States", "L…
$ gutenberg_author_id <int> 1638, 1, 3, 1, 4, 3, 3, 7, 7, 7, 8, 9, 1…
$ language            <chr> "en", "en", "en", "en", "en", "en", "en"…
$ gutenberg_bookshelf <chr> "United States Law/American Revolutionar…
$ rights              <chr> "Public domain in the USA.", "Public dom…
$ has_text            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
hide
# filtering data for a science fiction subset
sciFi <- metaData %>% filter(gutenberg_bookshelf == "Science Fiction") %>%
  group_by(author)

After generating my sciFi data set, I used the gutenberg_authors data frame to estimate the approximate publication date of the texts of interest. My method for this was very imprecise, as I averaged the birth and death dates of the author to get the halfway point of their life, and from there, I assigned each work to a thirty year interval. This could have potentially caused a number of texts to be incorrectly categorized, as authors writing at about the time when my intervals change could have been placed too early or too soon, and any author who wrote very young or old could be anywhere from one to two intervals away from the proper publication date.

hide
# using gutenberg's author data frame to estimate the time of publication
approxDates <- gutenberg_authors %>% group_by(author) %>%
  summarize(year = (birthdate + deathdate)/2) %>%
  drop_na() %>%
  #removes authors with a BC birthdate
  filter(year > 0) %>%
  arrange(desc(-year)) 

# joining sciFi data frame with approx_Dates, then assigning each title to a thirty year interval
#incorporating Alex's suggestion to use case_when() rather than cut()
sciFi_wDates <- sciFi %>% left_join(approxDates, by = "author") %>%
  arrange(desc(-year)) %>%
  mutate(interval = case_when(year >= 1850 & year <1880 ~ 1850,
                            year >= 1880 & year <1910 ~ 1880,
                            year >= 1910 & year <1940 ~ 1910,
                            year >= 1940 & year <1970 ~ 1940,
                            year >= 1970 & year <2000 ~ 1970,
                            year >= 2000 ~ 2000))
# resulting frame's variables: 
#   gutenberg_id, title, author, gutenberg_author_id, language, gutenberg_bookshelf, 
#   rights, has_text, year, interval

After this initial wrangling, I had a data frame containing about 1100 titles that were all eligible for analysis under the criteria I had developed. From there, I began doing some exploratory text analysis.

Exploring Theme

Before looking at the change over time, I wanted to get a broad sense of some major themes explored in science fiction. To do this, I decided to use the titles of the texts I identified and create a visualization of some of the most frequent words.

hide
# load stop words lexicon code from class
swd_list <- tidytext::stop_words %>% 
  filter(lexicon == "SMART") %>%
  select(word)
swd_list <- swd_list$word

# making a data frame with tokenized words from all titles
initial <- sciFi_wDates %>%
  group_by(interval) %>%
  unnest_tokens(word, title) %>%
  count(word) %>%
  # filter out stop words
  filter(!word %in% swd_list) %>%
  # minimum word occurrence to be able to fit wordcloud on page is 3
  filter(n>=3) %>%
  arrange(desc(n))
hide
# making the wordcloud 
wordcloud(initial$word, initial$n)

As can be seen in the wordcloud, there are lots of words associated with space, war, the future, and various other topics.

Finding Word Frequencies

The bulk of data processing occurred in the analysis of the full texts. Because of the complex nature of the data, I chose not to calculate any frequency information for words in individual texts. Instead, I chose to identify the twenty most frequent words (outside of common words as described in the stop words lexicon) used in each text. Then, after joining data frames to associate the words with the time intervals of interest, I calculated how frequently words appeared within the top 20 words of each time interval. My evaluation was therefore based on how frequently a word occurred within the subset of top 20 words that occurred 3 or more times.

It should be noted running the code to analyze all of the full texts takes a significant amount of time (20+ minutes the first time it was run), though it successfully outputs the data of interest. This processing time can be avoided by simply opening the .csv file with the resulting data frame that was saved after the analysis was initially run. The file is included and the code below set up to use the .csv automatically, but the code used is also included and can be run again if desired.

hide
## importing .csv file of previously completed analysis rather than re-evaluating
#includes gutenberg_id, word, and count

library(readr)
top20Words <- read_csv("top20Words.csv")
hide
## the text analysis code, once again warning it takes 20+ minutes to run
#     top20Words.csv has the results without the hassle
#     but here's the code I used:

# making a list that has the IDs of all the texts to tokenize
text_IDs <- sciFi_wDates$gutenberg_id

# making the data frame to be appended to in the for loop
top20Words <- data.frame(gutenberg_id = c(), words = c())

# for loop outputs frame with 20 rows per text, each with the gutenberg_id,
#   word, and n variables

for (i in text_IDs) {
  #downloads full text for current work of interest
  testText <- gutenberg_download(c(i), meta_fields = "title")
  
  # tokenizing text of interest
  tokenedText <- testText %>%
    group_by(gutenberg_id) %>%
    unnest_tokens(word, text) %>%
    #removing stop words
    filter(!word %in% swd_list) %>%
    #counting number of times words appear and outputting 20 most frequent
    count(word) %>%
    arrange(desc(n)) %>%
    head(20)
  
  #adding top 20 words of text to dataframe with all top 20 words
  top20Words <- rbind(top20Words, tokenedText)
}

glimpse(top20Words)

After getting the 20 most frequently occurring words for each text, I used the subsetted meta data to assign each row to an interval.

hide
# selecting rows of interest in sciFi_wDates data frame
text_info <- sciFi_wDates %>% select(gutenberg_id, interval, author)

# joining the data frames, result had gutenberg_id, word, n, author, interval
top20Words_wDates <- top20Words %>% left_join(text_info)

#counting the number of times words occur within each interval
words_summary <- top20Words_wDates %>% group_by(interval, word) %>%
  summarize(count = n()) %>%
 #filters for words that appear in the top 20 words of three or more texts
  filter(count >= 3) %>%
  drop_na() %>%
  arrange(desc(count))

glimpse(words_summary)
Rows: 867
Columns: 3
Groups: interval [5]
$ interval <dbl> 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 194…
$ word     <chr> "back", "time", "man", "looked", "eyes", "thought",…
$ count    <int> 427, 377, 335, 190, 151, 137, 135, 125, 120, 116, 1…

The final data set I ended up with, words_summary, summarizes the number of times a word appears in the top 20 words of a text over the course of each interval.

Visualization

The final step I did as part of this data exploration was visualize the most frequently occurring words for each of the thirty-year intervals using some basic word clouds. Because the amount of text evaluated was so large with limited meta data associated, I wanted to keep the word clouds as simple as possible so as not to be misleading. The colors used do not indicate any qualities of the data and are only for aesthetic purposes. Additional data wrangling was done to generate relative frequencies, but sizing does not reflect information about the magnitude of difference between word frequencies in the full data set.

hide
## each of the following word clouds uses the same code, so only the first is commented

#1850s wordcloud

#making a subset of the summarized word counts for the interval of interest
summary_1850 <- words_summary %>% filter(interval == 1850)
#calculating the sum of instances of the words that occurred in 3+ texts within interval
totalWords_1850 <- sum(summary_1850$count)
#mutating column with relative frequencies of words
summary_1850 <- summary_1850 %>% mutate(freq = count/totalWords_1850) %>% 
  select(word, freq)
hide
#generating the wordcloud
wordcloud(words = summary_1850$word, frequency = summary_1850$n, 
          colors = brewer.pal(n = 5,name = "Dark2"),
          random.color = TRUE,
          random.order = FALSE, scale = c(2, 6))

hide
#1880s wordcloud
summary_1880 <- words_summary %>% filter(interval == 1880)
totalWords_1880 <- sum(summary_1880$count)
summary_1880 <- summary_1880 %>% mutate(freq = count/totalWords_1880) %>%
  head(30) %>% select(word, freq)
hide
wordcloud(words = summary_1880$word, frequency = summary_1880$n, 
          colors = brewer.pal(n = 5,name = "Dark2"),
          random.color = TRUE,
          random.order = FALSE, scale = c(1.8, 4.8))

hide
# wordcloud 1910s
summary_1910 <- words_summary %>% filter(interval == 1910)
totalWords_1910 <- sum(summary_1910$count)
summary_1910 <- summary_1910 %>% mutate(freq = count/totalWords_1910) %>%
  head(30) %>% select(word, freq)
hide
wordcloud(words = summary_1910$word, frequency = summary_1910$n, 
          colors = brewer.pal(n = 5,name = "Dark2"),
          random.color = TRUE,
          random.order = FALSE, scale = c(1.8, 5.5))

hide
#1940s wordcloud
summary_1940 <- words_summary %>% filter(interval == 1940)
totalWords_1940 <- sum(summary_1940$count)
summary_1940 <- summary_1940 %>% mutate(freq = count/totalWords_1940) %>%
  head(30) %>% select(word, freq)
hide
wordcloud(words = summary_1940$word, frequency = summary_1940$n, 
          colors = brewer.pal(n = 5,name = "Dark2"),
          random.color = TRUE,
          random.order = FALSE, scale = c(2, 6))

hide
#1970s wordcloud
summary_1970 <- words_summary %>% filter(interval == 1970)
totalWords_1970 <- sum(summary_1970$count)
summary_1970 <- summary_1970 %>% mutate(freq = count/totalWords_1970) %>%
  head(30) %>% select(word, freq)
hide
wordcloud(words = summary_1970$word, frequency = summary_1970$n, 
          colors = brewer.pal(n = 5,name = "Dark2"),
          random.color = TRUE,
          random.order = FALSE, scale = c(2, 6))

Because the word clouds are a little tricky to compare, the last plot I made elaborates on their differences and similarities. The scaling and relative frequency shown is not reliable, but this plot shows some of the change I’m interested in.

hide
#frame with the words used in the wordcloud for each interval
all_summary <- summary_1850 %>% rbind(summary_1880) %>% 
  rbind(summary_1910) %>% rbind(summary_1940) %>% rbind(summary_1970) 

# using manual categorization to assign each word to a category based on the
#     general context of the word

people <- c("captain", "doctor", "man", "men","mr","people","professor","sir")
verbs <- c("asked", "began", "he'd", "knew", "looked", "made", "make", "replied", "thought")
descriptors <- c("black", "good", "great", "half", "long", "suddenly")
nouns <- c("air", "alien", "back", "car", "eyes", "face", 
           "feet", "hand", "hands", "head",  "mind", "city","door")
space_time <- c("earth", "planet",  "ship","space", "sun", "world", "years", 
                "time", "light","past", "day","night")
other_nouns <- c("life", "love","room","surface","thing","things","war","work")

#adding a column with the categorization to the data frame
all_summary <- all_summary %>% 
  mutate(category = case_when(word %in% people ~ "people",
                              word %in% verbs ~ "verbs",
                              word %in% descriptors ~ "descriptors",
                              word %in% nouns ~ "nouns",
                              word %in% space_time ~ "space_time",
                              word %in% other_nouns ~ "other_nouns",
                              TRUE ~ "other"))

# making an interactive ggplot using plotly
interactive_timePlot <- ggplot(data = all_summary, aes(x = interval, y = freq, color = word)) +
  geom_point() +
  geom_line() +
  facet_wrap(~category) +
  labs(title="Subject Matter Over Time")

ggplotly(interactive_timePlot)

As can be seen in the plot, there are some interesting changed that occur over the course of the five intervals that could likely be evaluated further with more precise text analysis. The most frequently appearing verbs are similar across the intervals, but the variations observed in this broad view could potentially be interesting to look at further. The words “war” and “alien” don’t appear as top words until the 1970s interval, while words like “planet,” “time,” and “good” appear in all of them.

Conclusions

This project used a large data set with over 1000 full text public domain documents in an attempt to learn about the broad thematic patterns of science fiction over the last 150 years and hypothesized the results would show shifting areas of focus based on the political, social, and technological context of the time the texts were written. While the results were not as conclusive as one might have hoped, there is a foundation for more specific inquiry to be done in the future. One potential avenue to explore the data further is to create a more extensive system for categorizing the words in these texts. Similarly to how emotional lexicons can be used to map emotional associations to words for simpler analysis, creating a source to associate certain words with categories could help streamline the process of looking at trends over time.

It is also crucial that a more accurate source is found for assigning publication dates to texts. The degree of error caused by the assumptions in estimating when something was written makes any observations stand on very shaky ground. Since their wikipedia pages are included in the gutenberg_authors dataframe, an effort could be made to use that as a source of more accurate time information. Given more accurate publication dates, some of the areas I would be interested in exploring further are trends in the most frequently occurring verbs, what sorts of occupations/roles are mentioned at different times, and how the frequency of abstract nouns (love, peace, hope) changes over time.

The algorithm used to generate the most frequently occurring words in the data set could easily be adapted to perform textual analysis with other parameters as well. Changing what words are filtered for, where the list of texts to look at originates, and what is chosen as an output could allow many of these questions to be investigated more precisely, or a different set of questions entirely could be evaluated. Overall, this project has established a good jumping-off point for textual analysis within the gutenbergr data set with the potential to be adapted for other applications as well.

Class Peer Reviews

The wright stuff: How man learned to fly. (n.d.). Retrieved April 16, 2022, from https://www.thoughtco.com/history-of-flight-the-wright-brothers-1992681
Timeline of computer history. (n.d.). Retrieved April 16, 2022, from https://www.computerhistory.org/timeline/computers/
Writing the future: A timeline of science fiction literature. (n.d.). Retrieved April 16, 2022, from https://www.bbc.co.uk/teach/writing-the-future-a-timeline-of-science-fiction-literature/zjfv6v4

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".