class: center, middle ### Text Analysis with `tidytext` <img src="img/hero_wall_pink.png" width="800px"/> ### Kelly McConville .large[Math 241 | Week 9 | Spring 2021] --- ## Announcements/Reminders * Mini Project 2 is due today. + Will start grading at noon on Sunday. * Lab 6 is posted. --- ## Regular Expression Recap * A concise language for describing patterns in strings. + But not super easy to read. + Good to have cheatsheets and the internet for help! * Will post key for Tuesday's handout to the shared folder. --- ## Pattern Matching Recap * Functions to take **action** based on our regular expression pattern matching. * Detect [pattern]() with: + `str_detect()` + `str_subset()` + `str_count()` * Extract [pattern]() + `str_extract()` and `str_extract_all()` * Replace [pattern]() + `str_replace()` and `str_replace_all()` * Split [pattern]() + `str_split()` --- ## Goals for Today Now that we know how to handle text/strings as data, let's do some text analysis with `tidytext`. Topics: * Tokenizing to a tidy format * Word frequencies * Word clouds * Sentiment analysis --- ## Recap: Tidy data What makes a dataset tidy? -- <img src="img/tidyRules.png" width="80%" /> * Each column is a single variable. * Each row is a unique observation. * Each row must have its own cell. --- ## Is Hey Jude Tidy? ```r library(genius) hey_jude <- genius_lyrics(artist = "The Beatles", song = "Hey Jude") hey_jude ``` ``` ## # A tibble: 53 x 3 ## track_title line lyric ## <chr> <int> <chr> ## 1 Hey Jude 1 Hey Jude, don't make it bad ## 2 Hey Jude 2 Take a sad song and make it better ## 3 Hey Jude 3 Remember to let her into your heart ## 4 Hey Jude 4 Then you can start to make it better ## 5 Hey Jude 5 Hey Jude, don't be afraid ## 6 Hey Jude 6 You were made to go out and get her ## 7 Hey Jude 7 The minute you let her under your skin ## 8 Hey Jude 8 Then you begin to make it better ## 9 Hey Jude 9 And anytime you feel the pain, hey Jude, refrain ## 10 Hey Jude 10 Don't carry the world upon your shoulders ## # … with 43 more rows ``` --- ## Tidy Text * A data table with one token per row. -- * **Token**: meaningful unit of text + What is the unit for `hey_jude`? ```r hey_jude ``` ``` ## # A tibble: 53 x 3 ## track_title line lyric ## <chr> <int> <chr> ## 1 Hey Jude 1 Hey Jude, don't make it bad ## 2 Hey Jude 2 Take a sad song and make it better ## 3 Hey Jude 3 Remember to let her into your heart ## 4 Hey Jude 4 Then you can start to make it better ## 5 Hey Jude 5 Hey Jude, don't be afraid ## 6 Hey Jude 6 You were made to go out and get her ## 7 Hey Jude 7 The minute you let her under your skin ## 8 Hey Jude 8 Then you begin to make it better ## 9 Hey Jude 9 And anytime you feel the pain, hey Jude, refrain ## 10 Hey Jude 10 Don't carry the world upon your shoulders ## # … with 43 more rows ``` --- ## Tidy Text * A data table with one token per row. * **Token**: meaningful unit of text + What is the unit for `hey_jude`? * Other common tokens are words, sentences, paragraphs. * Some text analysis should be done on text data in a non-tidy format. --- ## Tidying Text Data * **Tokenize**: Break text into individual tokens ```r library(tidytext) hey_jude_words <- hey_jude %>% unnest_tokens(output = word, input = lyric, token = "words") hey_jude_words ``` ``` ## # A tibble: 544 x 3 ## track_title line word ## <chr> <int> <chr> ## 1 Hey Jude 1 hey ## 2 Hey Jude 1 jude ## 3 Hey Jude 1 don't ## 4 Hey Jude 1 make ## 5 Hey Jude 1 it ## 6 Hey Jude 1 bad ## 7 Hey Jude 2 take ## 8 Hey Jude 2 a ## 9 Hey Jude 2 sad ## 10 Hey Jude 2 song ## # … with 534 more rows ``` --- ## Tidying Text Data * What is an `ngram`? ```r hey_jude_ngram <- hey_jude %>% unnest_tokens(output = ngram, input = lyric, token = "ngrams", n = 2) hey_jude_ngram ``` ``` ## # A tibble: 491 x 3 ## track_title line ngram ## <chr> <int> <chr> ## 1 Hey Jude 1 hey jude ## 2 Hey Jude 1 jude don't ## 3 Hey Jude 1 don't make ## 4 Hey Jude 1 make it ## 5 Hey Jude 1 it bad ## 6 Hey Jude 2 take a ## 7 Hey Jude 2 a sad ## 8 Hey Jude 2 sad song ## 9 Hey Jude 2 song and ## 10 Hey Jude 2 and make ## # … with 481 more rows ``` --- ## Word Frequencies * Common text mining task * What have we learned about the frequency of words in "Hey Jude"? * Which words in this list do we maybe not care about? ```r hey_jude_words %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 94 x 2 ## word n ## <chr> <int> ## 1 na 204 ## 2 jude 43 ## 3 hey 27 ## 4 yeah 18 ## 5 it 17 ## 6 naa 17 ## 7 you 13 ## 8 better 12 ## 9 make 12 ## 10 to 10 ## # … with 84 more rows ``` --- ## Word Frequencies * **Stop words**: Common words that are not useful for analysis ```r data("stop_words") stop_words ``` ``` ## # A tibble: 1,149 x 2 ## word lexicon ## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART ## 7 accordingly SMART ## 8 across SMART ## 9 actually SMART ## 10 after SMART ## # … with 1,139 more rows ``` --- ## Word Frequencies * I want remove from `hey_jude_words` the rows that contain the stop words. + Get to learn a new `join`! -- ```r hey_jude_words <- hey_jude_words %>% anti_join(stop_words, by = "word") ``` --- ## Word Frequencies * What graph should we construct? ```r hey_jude_words %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 43 x 2 ## word n ## <chr> <int> ## 1 na 204 ## 2 jude 43 ## 3 hey 27 ## 4 yeah 18 ## 5 naa 17 ## 6 ma 8 ## 7 judy 5 ## 8 bad 3 ## 9 begin 3 ## 10 remember 3 ## # … with 33 more rows ``` --- ## Word Frequencies * Which `forcats` function should we use to reorder the bars? ```r hey_jude_words %>% count(word, sort = TRUE) %>% filter(n > 2) %>% ggplot(mapping = aes(x = word, y = n)) + geom_col() + coord_flip() ``` <img src="slidesWk9Th_files/figure-html/unnamed-chunk-10-1.png" width="360" /> --- ## Word Frequencies * What `forcats` function should we use to reorder the bars? ```r hey_jude_words %>% count(word, sort = TRUE) %>% filter(n > 2) %>% mutate(word = fct_reorder(word, n)) %>% ggplot(mapping = aes(x = word, y = n)) + geom_col() + coord_flip() ``` <img src="slidesWk9Th_files/figure-html/unnamed-chunk-11-1.png" width="360" /> --- ## Let's Get More Data ```r white_album <- genius_album(artist = "The Beatles", album = "The Beatles ('The White Album')") ``` <img src="img/TheBeatles.jpg" width="320" /> --- ```r white_album %>% unnest_tokens(output = word, input = lyric, token = "words") %>% anti_join(stop_words, by = "word") %>% count(word, sort = TRUE) %>% filter(n > 12) %>% mutate(word = fct_reorder(word, n)) %>% ggplot(mapping = aes(x = word, y = n)) + geom_col() + coord_flip() ``` <img src="slidesWk9Th_files/figure-html/unnamed-chunk-14-1.png" width="360" /> --- I have so many questions. <img src="slidesWk9Th_files/figure-html/unnamed-chunk-15-1.png" width="360" /> -- * Do The Beatles really sing that much about a bungalow? -- * What is "ob" or "mi"? --- ## bungalow Problem? ```r str_subset(string = white_album$lyric, pattern = "bungalow") ``` ``` ## character(0) ``` --- ## Bungalow ```r str_subset(string = white_album$lyric, pattern = "Bungalow") ``` ``` ## [1] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [3] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [5] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [7] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [9] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [11] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [13] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [15] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [17] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [19] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [21] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [23] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [25] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ## [27] "Hey, Bungalow Bill" "What did you kill, Bungalow Bill?" ``` --- ## mi * How can we use regular expressions to get the word(?) "mi", not words that contain "mi"? ```r str_subset(string = white_album$lyric, pattern = "mi|Mi") ``` ``` ## [1] "Flew in from Miami Beach BOAC" ## [2] "That Georgia's always on my mi-mi-mi-mi-mi-mi-mi-mi-mind" ## [3] "That Georgia's always on my mi-mi-mi-mi-mi-mi-mi-mi-mind" ## [4] "Let me see you smile" ## [5] "So let me see you smile again" ## [6] "Won't you let me see you smile?" ## [7] "Deep in the jungle, where the mighty tiger lies" ## [8] "With every mistake, we must surely be learning" ## [9] "She's not a girl who misses much" ## [10] "The man in the crowd with the multicoloured mirrors" ## [11] "I'm so tired, my mind is on the blink" ## [12] "I'm so tired, my mind is set on you" ## [13] "For a little peace of mind" ## [14] "For a little peace of mind" ## [15] "I'd give you everything I've got for a little peace of mind" ## [16] "I'd give you everything I've got for a little peace of mind" ## [17] "I'd give you everything I've got for a little peace of mind" ## [18] "Now somewhere in the black mining hills of DakotaThere lived a young boy named Rocky Raccoon" ## [19] "I listen for your footsteps coming up the drive" ## [20] "Windy smile calls me" ## [21] "I can only speak my mind, Julia" ## [22] "Black cloud crossed my mind" ## [23] "Blue mist round my soul" ## [24] "Just a smile would lighten everything" ## [25] "I'm coming down fast, but I'm miles above you" ## [26] "I'm coming down fast, but don't let me break you" ## [27] "I'm coming down fast, but don't let me break you" ## [28] "She's coming down fast!" ## [29] "Coming down fast" ## [30] "How can I ever misplace you?" ## [31] "But if you want money for people with minds that hate" ## [32] "You better free your mind instead" ## [33] "You might not feel it now" ## [34] "The duchess of Kircaldy always smiling" ## [35] "I've missed all of that" ## [36] "Them for themming and when for whimming" ## [37] "Close your eyes and I'll close mine" ## [38] "Close your eyes and I'll close mine" ``` --- ## mi ```r str_subset(string = white_album$lyric, pattern = "\\b(mi|Mi)\\b") ``` ``` ## [1] "That Georgia's always on my mi-mi-mi-mi-mi-mi-mi-mi-mind" ## [2] "That Georgia's always on my mi-mi-mi-mi-mi-mi-mi-mi-mind" ``` --- ## ob ```r str_subset(string = white_album$lyric, pattern = "\\b(ob|Ob)\\b") ``` ``` ## [1] "Ob-la-di, ob-la-da, life goes on, brah" ## [2] "Ob-la-di, ob-la-da, life goes on, brah" ## [3] "Ob-la-di, ob-la-da, life goes on, brah" ## [4] "Ob-la-di, ob-la-da, life goes on, brah" ## [5] "Ob-la-di, ob-la-da, life goes on, brah" ## [6] "Yeah, ob-la-di, ob-la-da, life goes on, brah" ## [7] "Ob-la-di, ob-la-da, life goes on, brah" ## [8] "Yeah, ob-la-di, ob-la-da, life goes on, brah" ## [9] "Take Ob-la-di-bla-da" ## [10] "We all know Ob-La-Di-Bla-Da" ``` --- ## Wordcloud * What's the `geom`? * What are the `aes`thetics of the `geom`? * How are the variables mapped to the `aes`thetics? <img src="slidesWk9Th_files/figure-html/unnamed-chunk-21-1.png" width="504" style="display: block; margin: auto;" /> --- ## Wordcloud ```r library(wordcloud) library(RColorBrewer) pal <- brewer.pal(9, "Set1") white_album_count <- white_album %>% unnest_tokens(output = word, input = lyric, token = "words") %>% anti_join(stop_words, by = "word") %>% count(word, sort = TRUE) ``` --- ```r white_album_count %>% with(wordcloud(word, n, colors = pal, min.freq = 7, random.order = FALSE, scale = c(4, 1))) ``` <img src="slidesWk9Th_files/figure-html/unnamed-chunk-23-1.png" width="576" style="display: block; margin: auto;" /> * Issue with the color palette? --- ```r library(viridis) pal <- magma(n = 30, direction = -1) white_album_count %>% with(wordcloud(word, n, scale = c(4, 1), colors = pal, min.freq = 7, random.order = FALSE)) ``` <img src="slidesWk9Th_files/figure-html/unnamed-chunk-24-1.png" width="576" style="display: block; margin: auto;" /> --- ## Comparisons Across Albums ```r sweetener <- genius_album(artist = "Ariana Grande", album = "Sweetener") %>% mutate(album = "Sweetener") thank_u_next <- genius_album(artist = "Ariana Grande", album = "thank u next") %>% mutate(album = "thank_u_next") ``` <img src="img/ariana_grande.png" width="80%" /> --- ## Word Frequencies Across Albums ```r ariana_grande <- bind_rows(sweetener, thank_u_next) %>% unnest_tokens(output = word, input = lyric, token = "words") %>% anti_join(stop_words, by = "word") %>% filter(!(word %in% c("ayy", "da", "eh"))) %>% count(album, word) %>% group_by(album) %>% mutate(prop = n/sum(n)) ariana_grande ``` ``` ## # A tibble: 887 x 4 ## # Groups: album [2] ## album word n prop ## <chr> <chr> <int> <dbl> ## 1 Sweetener afraid 1 0.000560 ## 2 Sweetener ah 13 0.00728 ## 3 Sweetener ahh 3 0.00168 ## 4 Sweetener air 7 0.00392 ## 5 Sweetener align 1 0.000560 ## 6 Sweetener angel 2 0.00112 ## 7 Sweetener ariana 1 0.000560 ## 8 Sweetener asleep 1 0.000560 ## 9 Sweetener awake 1 0.000560 ## 10 Sweetener aww 1 0.000560 ## # … with 877 more rows ``` --- ## Word Frequencies Across Albums ```r ariana_grande_wider <- ariana_grande %>% select(album, word, prop) %>% pivot_wider(names_from = album, values_from = prop) ariana_grande_wider ``` ``` ## # A tibble: 758 x 3 ## word Sweetener thank_u_next ## <chr> <dbl> <dbl> ## 1 afraid 0.000560 NA ## 2 ah 0.00728 0.00992 ## 3 ahh 0.00168 NA ## 4 air 0.00392 NA ## 5 align 0.000560 0.000661 ## 6 angel 0.00112 0.00397 ## 7 ariana 0.000560 NA ## 8 asleep 0.000560 NA ## 9 awake 0.000560 NA ## 10 aww 0.000560 NA ## # … with 748 more rows ``` --- ## Word Frequencies Across Albums ```r ariana_grande %>% group_by(album) %>% arrange(desc(n)) %>% slice(1:10) %>% ungroup() %>% mutate(word = factor(word), word = fct_reorder(word, n)) %>% ggplot(mapping = aes(x = word, y = n)) + geom_col() + facet_wrap(~album) + coord_flip() ``` <img src="slidesWk9Th_files/figure-html/unnamed-chunk-28-1.png" width="540" /> --- ## Word Frequencies Across Albums ```r ariana_grande %>% group_by(album) %>% arrange(desc(n)) %>% slice(1:10) %>% ungroup() %>% mutate(word = factor(word), word = fct_reorder(word, n)) %>% ggplot(mapping = aes(x = word, y = n)) + geom_col() + facet_wrap(~album, scales = "free_y") + coord_flip() ``` <img src="slidesWk9Th_files/figure-html/unnamed-chunk-29-1.png" width="540" /> --- ### Word Frequencies Across Albums ```r ariana_grande_wider %>% filter(Sweetener > 0.001, thank_u_next > 0.001) %>% ggplot(mapping = aes(x = Sweetener, y = thank_u_next, label = word)) + geom_text(size = 4, position = position_jitter(width = 0.08, height = 0.08)) + scale_x_log10() + scale_y_log10() + geom_abline() ``` <img src="slidesWk9Th_files/figure-html/unnamed-chunk-30-1.png" width="432" /> --- ## Sentiment Analysis * Was `thank u next` a more negative album than `Sweetener`? * Need to add a column that measures the sentiment of each token. + From [Bing Liu and collaborators](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) + Generalizability to other English-speaking countries or time periods? ```r sentiments ``` ``` ## # A tibble: 6,786 x 2 ## word sentiment ## <chr> <chr> ## 1 2-faces negative ## 2 abnormal negative ## 3 abolish negative ## 4 abominable negative ## 5 abominably negative ## 6 abominate negative ## 7 abomination negative ## 8 abort negative ## 9 aborted negative ## 10 aborts negative ## # … with 6,776 more rows ``` --- ## Sentiment Analysis * Keep stop words this time. ```r ariana_grande <- bind_rows(sweetener, thank_u_next) %>% unnest_tokens(output = word, input = lyric, token = "words") %>% count(album, word) %>% group_by(album) %>% mutate(prop = n/sum(n)) ariana_grande ``` ``` ## # A tibble: 1,403 x 4 ## # Groups: album [2] ## album word n prop ## <chr> <chr> <int> <dbl> ## 1 Sweetener a 89 0.0141 ## 2 Sweetener about 8 0.00127 ## 3 Sweetener above 3 0.000475 ## 4 Sweetener afraid 1 0.000158 ## 5 Sweetener after 3 0.000475 ## 6 Sweetener again 2 0.000317 ## 7 Sweetener ah 13 0.00206 ## 8 Sweetener ahh 3 0.000475 ## 9 Sweetener ain't 23 0.00364 ## 10 Sweetener air 7 0.00111 ## # … with 1,393 more rows ``` --- ## Sentiment Analysis What are the most common **negative words** on each album? ```r ariana_grande %>% inner_join(sentiments, by = "word") %>% filter(sentiment == "negative") %>% arrange(desc(n)) ``` ``` ## # A tibble: 106 x 5 ## # Groups: album [2] ## album word n prop sentiment ## <chr> <chr> <int> <dbl> <chr> ## 1 Sweetener stole 32 0.00507 negative ## 2 Sweetener bum 23 0.00364 negative ## 3 thank_u_next bad 19 0.00396 negative ## 4 Sweetener darkness 18 0.00285 negative ## 5 Sweetener twist 16 0.00253 negative ## 6 thank_u_next fake 16 0.00334 negative ## 7 thank_u_next shit 11 0.00229 negative ## 8 Sweetener cry 10 0.00158 negative ## 9 Sweetener hard 10 0.00158 negative ## 10 thank_u_next ruin 10 0.00208 negative ## # … with 96 more rows ``` --- ## Sentiment Analysis What are the most common **positive words** on each album? ```r ariana_grande %>% inner_join(sentiments, by = "word") %>% filter(sentiment == "positive") %>% arrange(desc(n)) ``` ``` ## # A tibble: 81 x 5 ## # Groups: album [2] ## album word n prop sentiment ## <chr> <chr> <int> <dbl> <chr> ## 1 thank_u_next like 45 0.00938 positive ## 2 Sweetener like 43 0.00681 positive ## 3 thank_u_next love 41 0.00855 positive ## 4 thank_u_next thank 39 0.00813 positive ## 5 Sweetener happy 25 0.00396 positive ## 6 thank_u_next good 19 0.00396 positive ## 7 Sweetener love 17 0.00269 positive ## 8 thank_u_next smile 17 0.00354 positive ## 9 thank_u_next woo 13 0.00271 positive ## 10 Sweetener better 12 0.00190 positive ## # … with 71 more rows ``` --- ### What is the distribution of positive and negative words? * Remember that words not in the lexicon are dropped! ```r ariana_grande %>% inner_join(sentiments, by = "word") %>% group_by(album, sentiment) %>% summarize(n = sum(n)) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = album, y = prop, fill = sentiment)) + geom_col() ``` <img src="slidesWk9Th_files/figure-html/unnamed-chunk-35-1.png" width="360" /> -- Issue with word-based sentiment analysis? --- ## Sentiment Analysis ```r thank_u_next %>% filter(str_detect(lyric, "love")) ``` ``` ## # A tibble: 34 x 5 ## track_n line lyric track_title album ## <int> <int> <chr> <chr> <chr> ## 1 2 7 "I'ma scream and shout for what I love" needy thank_u_n… ## 2 2 11 "I'm obsessive and I love too hard" needy thank_u_n… ## 3 2 24 "I'ma scream and shout for what I love" needy thank_u_n… ## 4 2 28 "I'm obsessive and I love too hard" needy thank_u_n… ## 5 3 4 "You can say \"I love you\" through the… NASA thank_u_n… ## 6 3 25 "Usually, I would love it if you stayed… NASA thank_u_n… ## 7 3 50 "You can say \"I love you\" through the… NASA thank_u_n… ## 8 4 8 "Love me, love me, baby" bloodline thank_u_n… ## 9 4 12 "Get it like you love me" bloodline thank_u_n… ## 10 4 28 "I ain't lookin' for my one true love" bloodline thank_u_n… ## # … with 24 more rows ``` --- ## Sentiment Analysis * Should also try out other lexicons ```r nrc <- get_sentiments("nrc") nrc ``` ``` ## # A tibble: 13,901 x 2 ## word sentiment ## <chr> <chr> ## 1 abacus trust ## 2 abandon fear ## 3 abandon negative ## 4 abandon sadness ## 5 abandoned anger ## 6 abandoned fear ## 7 abandoned negative ## 8 abandoned sadness ## 9 abandonment anger ## 10 abandonment fear ## # … with 13,891 more rows ``` --- ```r ariana_grande %>% inner_join(nrc, by = "word") %>% group_by(album, sentiment) %>% summarize(n = sum(n)) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(fill = album, y = prop, x = sentiment)) + geom_col(position = "dodge") + coord_flip() ``` <img src="slidesWk9Th_files/figure-html/unnamed-chunk-38-1.png" width="504" /> --- ## Measuring Differences: `tf_idf` * tf = Number of times word appears in a given text * idf = log(number of texts/number of texts with word) * tf `\(*\)` idf = Sense of frequency within text that accounts for how common word is across texts If we have 6 texts and "you" shows up in all of them, then tf `\(*\)` idf equals what? --- ### Need Several Albums ```r ts_albums <- c("Taylor Swift", "Fearless", "Speak Now", "1989", "Reputation", "Lover") ts <- genius_album(artist = "Taylor Swift", album = ts_albums[1]) %>% mutate(album = ts_albums[1]) for(i in 2:length(ts_albums)){ next_album <- genius_album(artist = "Taylor Swift", album = ts_albums[i]) %>% mutate(album = ts_albums[i]) ts <- bind_rows(ts, next_album) } ``` <img src="img/taylor_swift.png" width="196" /> --- ## Measuring Differences: `tf_idf` ```r taylor_tidy <- ts %>% unnest_tokens(output = word, input = lyric, token = "words") %>% count(album, word, sort = TRUE) %>% filter(!(word %in% c("la", "ey", "e", "di", "da", "eeh", "ooh", "aah", "ah"))) %>% bind_tf_idf(word, album, n) taylor_tidy %>% arrange(desc(tf_idf)) ``` ``` ## # A tibble: 4,952 x 6 ## album word n tf idf tf_idf ## <chr> <chr> <int> <dbl> <dbl> <dbl> ## 1 1989 yet 64 0.0109 1.10 0.0120 ## 2 1989 woods 39 0.00664 1.79 0.0119 ## 3 Lover daylight 40 0.00595 1.79 0.0107 ## 4 Speak Now grow 21 0.00374 1.79 0.00669 ## 5 1989 york 30 0.00511 1.10 0.00561 ## 6 Reputation getaway 22 0.00306 1.79 0.00548 ## 7 1989 welcome 29 0.00494 1.10 0.00543 ## 8 1989 shake 78 0.0133 0.405 0.00539 ## 9 1989 blood 16 0.00272 1.79 0.00488 ## 10 Fearless belong 12 0.00267 1.79 0.00478 ## # … with 4,942 more rows ``` --- ```r taylor_tidy %>% mutate(album = factor(album, levels = ts_albums)) %>% group_by(album) %>% slice_max(tf_idf, n = 10) %>% ungroup() %>% mutate(word = fct_reorder(word, tf_idf)) %>% ggplot(aes(x = word, y = tf_idf, fill = album)) + geom_col(show.legend = FALSE) + coord_flip() + facet_wrap(~album, ncol = 3, scales = "free") ``` <img src="slidesWk9Th_files/figure-html/unnamed-chunk-42-1.png" width="648" /> --- ## Further Text Analysis Topics * Topic Models: Latent Dirichlet allocation * Sentence level sentiment analysis with `coreNLP`, `cleanNLP`, and/or `sentimentr` --- ### National --- ### ~~National~~ --- ### International Wear your Hat to Zoom Day 🎈: Thursday, April 1st -- 😜 Lesser known holiday on the same day: April Fool's Day -- **Any** Headware is welcome -- 🧢 Hats 🎧 Headphones 👱♀️ Wigs ⛵ [Paper sailor hats](https://lifestyle.howstuffworks.com/crafts/recycled/how-to-make-paper-sailor-hat.htm) 🍩 Inflatable donut -- Encouraged to wear the 🤠 all day. -- Participation is optional.