class: center, middle ### String Manipulation with `stringr` <img src="img/hero_wall_pink.png" width="800px"/> ### Kelly McConville .large[Math 241 | Week 9 | Spring 2021] --- ## Announcements/Reminders * Mini Project 2 is due Thursday. + Will start grading at noon on Sunday. --- ## And here we are! **String** -- ```r x <- "cat" ``` **Character vector** -- ```r x <- c("dog", "cat", "mouse") ``` -- **Factor vector** -- ```r x <- factor(x) levels(x) ``` ``` ## [1] "cat" "dog" "mouse" ``` --- ## Goals for Today: [Stringr](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) * Learn how to handle character vectors! + Character manipulation + Pattern matching * Let's look at some of the functionalities of `stringr` using a character vector of song lyrics. ```r library(stringr) ``` --- ## Our Toy Lyric * Song? * Artist? ```r lyric <- c("But I would walk 500 miles,", "And I would walk 500 more,", "Just to be the man who walks a 1000 miles,", "To fall down at your door") lyric ``` ``` ## [1] "But I would walk 500 miles," ## [2] "And I would walk 500 more," ## [3] "Just to be the man who walks a 1000 miles," ## [4] "To fall down at your door" ``` -- <img src="img/The_Proclaimers_500_Miles.jpg" width="133" /> --- ## String Length ```r length(lyric) ``` ``` ## [1] 4 ``` -- ```r str_length(lyric) ``` ``` ## [1] 27 26 42 25 ``` --- ## Accessing and Replacing ```r str_sub(string = lyric[1], start = 18, end = 20) ``` ``` ## [1] "500" ``` -- ```r str_sub(string = lyric[1], start = 18, end = 20) <- "2" ``` -- ```r lyric ``` ``` ## [1] "But I would walk 2 miles," ## [2] "And I would walk 500 more," ## [3] "Just to be the man who walks a 1000 miles," ## [4] "To fall down at your door" ``` --- ## Change Cases ```r str_to_upper(lyric) ``` ``` ## [1] "BUT I WOULD WALK 2 MILES," ## [2] "AND I WOULD WALK 500 MORE," ## [3] "JUST TO BE THE MAN WHO WALKS A 1000 MILES," ## [4] "TO FALL DOWN AT YOUR DOOR" ``` ```r str_to_title(lyric) ``` ``` ## [1] "But I Would Walk 2 Miles," ## [2] "And I Would Walk 500 More," ## [3] "Just To Be The Man Who Walks A 1000 Miles," ## [4] "To Fall Down At Your Door" ``` ```r str_to_lower(lyric) ``` ``` ## [1] "but i would walk 2 miles," ## [2] "and i would walk 500 more," ## [3] "just to be the man who walks a 1000 miles," ## [4] "to fall down at your door" ``` --- ## Sorting ```r str_sort(lyric) ``` ``` ## [1] "And I would walk 500 more," ## [2] "But I would walk 2 miles," ## [3] "Just to be the man who walks a 1000 miles," ## [4] "To fall down at your door" ``` --- ## Pattern Matching * Learn to: + Detect [pattern]() + Extract [pattern]() + Replace [pattern]() + Split [pattern]() * Look at the [`stringr` cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf). --- ## Common Goal: Match a particular pattern * I want to match the pattern `500` from `lyric`. ```r lyric ``` ``` ## [1] "But I would walk 2 miles," ## [2] "And I would walk 500 more," ## [3] "Just to be the man who walks a 1000 miles," ## [4] "To fall down at your door" ``` -- ```r str_view_all(string = lyric, pattern = "500") ```
--- ## Let's make it more general. * I want to locate all the numbers. ```r lyric ``` ``` ## [1] "But I would walk 2 miles," ## [2] "And I would walk 500 more," ## [3] "Just to be the man who walks a 1000 miles," ## [4] "To fall down at your door" ``` -- ```r str_view_all(lyric, "500|1000|2") ```
--- How should we modify the code to locate all the numbers from these lyrics of various songs? ```r lyrics <- c("But I would walk 500 miles", "2000 0 0 party over oops out of time!", "1 is the loneliest number that you'll ever do", "When I'm 64", "Where 2 and 2 always makes a 5", "1, 2, 3, 4: Tell me that you love me more") ``` ```r str_view_all(lyrics, "500|1000|2") ```
--- How should we modify the code to locate all the numbers from these lyrics of various songs? ```r lyrics <- c("But I would walk 500 miles", "2000 0 0 party over oops out of time!", "1 is the loneliest number that you'll ever do", "When I'm 64", "Where 2 and 2 always makes a 5", "1, 2, 3, 4: Tell me that you love me more") ``` ```r str_view_all(lyrics, "500|1000|0|2000|1|64|2|5|3|4") ```
--- But now imagine you had a very long vector and you want to locate any number? ```r str_view_all(lyrics, "1|2|3|4...") ``` Not a good approach! --- ## Regular Expressions * A concise language for describing patterns in strings. + But not super easy to read. + Good to have cheatsheets and the internet for help! * Neat RStudio Addin to help: [`RegExplain`](https://www.garrickadenbuie.com/project/regexplain/) --- ## Regular Expressions * `[:digit:]` is a particular [Character Class](https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html) * Character classes are a way of specifying that you want to match one of the following characters. ```r str_view_all(lyrics, "[:digit:]") ```
--- ## Regular Expressions * `+` is a quantifier * `+`: One or more ```r str_view_all(lyrics, "[:digit:]+") ```
--- ## Regular Expressions What does `{n}` do? ```r str_view_all(lyrics, "[:digit:]{2}") ```
--- ## Quantifiers * `?`: 0 or 1 * `*`: 0 or more * `+`: 1 or more * `{n}`: Exactly n * `{n,}`: n or more * `{,m}`: at most m * `{n,m}`: between n and m --- ## Character Classes * What pattern does this regexp match? ```r str_view_all(lyrics, "[:alpha:]") ```
--- ## Character Classes * What pattern does this regexp match? ```r str_view_all(lyrics, "[:upper:]") ```
--- ## Character Classes * What pattern does this regexp match? ```r str_view_all(lyrics, "[:alnum:]") ```
--- ## Character Classes * What pattern does this regexp match? ```r str_view_all(lyrics, "[:punct:]") ```
--- ## Character Classes * What pattern does this regexp match? ```r str_view_all(lyrics, "[:graph:]") ```
--- ## Character Classes * What pattern does this regexp match? ```r str_view_all(lyrics, "[:space:]") ```
--- ## Character Classes * Can also create your own. * What pattern does this regexp match? ```r str_view_all(lyrics, "[aeiou]") ```
--- ## Other Handy Regexps * What pattern does this regexp match? * Why do we need an extra `\`? ```r str_view_all(lyrics, "\\d") ```
--- ## Espacing Meta Characters * `\` is a special character that has a particular meaning in `r`. * You can see all the special characters that need escaping in the help page for `'`: ```r ?"'" ``` --- ## Other Handy Regexps * What pattern does this regexp match? ```r str_view_all(lyrics, ".w.") ```
--- ## Other Handy Regexps * What pattern does this regexp match? ```r str_view_all(lyrics, "\\W") ```
--- ## Other Handy Regexps * What pattern does this regexp match? ```r str_view_all(lyrics, "Whe(n|re)") ```
--- ## Groups * What pattern does this regexp match? ```r str_view_all(lyrics, "(\\d)\\1") ```
--- ## Groups * What pattern does this regexp match? ```r str_view_all(lyrics, "(\\d)\\1\\1") ```
--- ## Groups * What pattern does this regexp match? ```r str_view_all(lyrics, "([:alnum:])(\\s)[:alnum:]+\\2\\1") ```
--- ## Other Handy Regexps * What pattern does this regexp match? ```r str_view_all(lyrics, "\\b") ```
--- ## Anchors * What pattern does this regexp match? ```r str_view_all(lyrics, "^\\d+") ```
--- ## Anchors * What pattern does this regexp match? ```r str_view_all(lyrics, "\\d+$") ```
--- ## Alternates * What pattern does this regexp match? ```r str_view_all(lyrics, "[^aeiou]") ```
--- ## Alternates * What pattern does this regexp match? ```r str_view_all(lyrics, "o[m-z]") ```
--- ## Look Arounds ```r str_view_all(lyrics, "[:alpha:]+(?= \\d+)") ```
--- ## Look Arounds ```r str_view_all(lyrics, "(?<=\\d )[:alpha:]+") ```
--- ## Your Turn! * Work on the `regexp.Rmd`. + In the Handouts folder in our shared RStudio folder (`/home/courses/math241s21/Handouts/regexp`). --- ## Pattern Matching * The `str_view_all()` is a nice helper function. * Now need to learn functions to take **action** based on our regular expression pattern matching. * Learn to: + Detect [pattern]() + Extract [pattern]() + Replace [pattern]() + Split [pattern]() --- ## New Example: Whole Song from [`genuis`](https://github.com/josiahparry/genius) ```r library(genius) hey_jude <- genius_lyrics(artist = "The Beatles", song = "Hey Jude") hey_jude ``` ``` ## # A tibble: 53 x 3 ## track_title line lyric ## <chr> <int> <chr> ## 1 Hey Jude 1 Hey Jude, don't make it bad ## 2 Hey Jude 2 Take a sad song and make it better ## 3 Hey Jude 3 Remember to let her into your heart ## 4 Hey Jude 4 Then you can start to make it better ## 5 Hey Jude 5 Hey Jude, don't be afraid ## 6 Hey Jude 6 You were made to go out and get her ## 7 Hey Jude 7 The minute you let her under your skin ## 8 Hey Jude 8 Then you begin to make it better ## 9 Hey Jude 9 And anytime you feel the pain, hey Jude, refrain ## 10 Hey Jude 10 Don't carry the world upon your shoulders ## # … with 43 more rows ``` --- ## Detect ```r str_detect(string = hey_jude$lyric, pattern = "Jude") ``` ``` ## [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE ## [13] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE ## [25] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE ## [49] TRUE FALSE TRUE TRUE TRUE ``` --- ## Detect ```r str_subset(string = hey_jude$lyric, pattern = "Jude") ``` ``` ## [1] "Hey Jude, don't make it bad" ## [2] "Hey Jude, don't be afraid" ## [3] "And anytime you feel the pain, hey Jude, refrain" ## [4] "Hey Jude, don't let me down" ## [5] "Remember (Hey Jude) to let her into your heart" ## [6] "So let it out and let it in, hey Jude, begin" ## [7] "And don't you know that it's just you, hey Jude, you'll do" ## [8] "Hey Jude, don't make it bad" ## [9] "Naa na na na na na na, na na na na, hey Jude" ## [10] "Naa na na na na na na, na na na na, hey Jude" ## [11] "Naa na na na na na na, na na na na, hey Jude" ## [12] "Naa na na na na na na, na na na na, hey Jude" ## [13] "(Jude Judy Judy Judy Judy Judy owwwww wowww)" ## [14] "Naa na na na na na na (Na na na), na na na na, hey Jude" ## [15] "(Jude Jude Jude Jude Jude)" ## [16] "Naa na na na na na na (Yeah yeah yeah), na na na na, hey Jude" ## [17] "(You know you can make, Jude Jude, You're not gonna break it)" ## [18] "Naa na (Don't make it bad Jude) na na na na na (Take a sad song and make it better), na na na na, hey Jude" ## [19] "Hey Jude, hey Jude wowwwwww" ## [20] "Naa na na na na na na, na na na na, hey Jude" ## [21] "Naa na na na na na na, na na na na, hey Jude" ## [22] "Jude Jude Jude Jude Jude Jude" ## [23] "Naa na na na na na na, na na na na, hey Jude" ## [24] "Naa na na na na na na, na na na na, hey Jude" ## [25] "Naa na na na na na na, na na na na, hey Jude" ## [26] "Naa na na na na na na, na na na na, hey Jude" ## [27] "Naa na na na na na na (Make it Jude), na na na na, hey Jude" ## [28] "Naa na na na na na na, na na na na, hey Jude(Go listen to ya ma ma ma ma ma ma ma ma)" ## [29] "Naa na na na na na na, na na na na, hey Jude" ## [30] "Naa na na na na na na, na na na na, hey Jude" ``` --- ## Detect ```r hey_jude %>% filter(str_detect(string = lyric, pattern = "Jude")) ``` ``` ## # A tibble: 30 x 3 ## track_title line lyric ## <chr> <int> <chr> ## 1 Hey Jude 1 Hey Jude, don't make it bad ## 2 Hey Jude 5 Hey Jude, don't be afraid ## 3 Hey Jude 9 And anytime you feel the pain, hey Jude, refrain ## 4 Hey Jude 14 Hey Jude, don't let me down ## 5 Hey Jude 17 Remember (Hey Jude) to let her into your heart ## 6 Hey Jude 19 So let it out and let it in, hey Jude, begin ## 7 Hey Jude 21 And don't you know that it's just you, hey Jude, you'll do ## 8 Hey Jude 24 Hey Jude, don't make it bad ## 9 Hey Jude 30 Naa na na na na na na, na na na na, hey Jude ## 10 Hey Jude 31 Naa na na na na na na, na na na na, hey Jude ## # … with 20 more rows ``` --- ## Detect ```r str_count(string = hey_jude$lyric, pattern = "(n|N)a+") ``` ``` ## [1] 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 9 0 0 ## [26] 0 0 0 0 11 11 11 11 0 14 0 11 1 11 0 11 11 0 11 11 12 11 11 11 0 ## [51] 11 11 11 ``` --- ## Detect ```r hey_jude %>% filter(str_count(string = hey_jude$lyric, pattern = "(n|N)a+") > 0) ``` ``` ## # A tibble: 21 x 3 ## track_title line lyric ## <chr> <int> <chr> ## 1 Hey Jude 13 Na na na na na na na na na na ## 2 Hey Jude 23 Na na na na na na na na na yeah ## 3 Hey Jude 30 Naa na na na na na na, na na na na, hey Jude ## 4 Hey Jude 31 Naa na na na na na na, na na na na, hey Jude ## 5 Hey Jude 32 Naa na na na na na na, na na na na, hey Jude ## 6 Hey Jude 33 Naa na na na na na na, na na na na, hey Jude ## 7 Hey Jude 35 Naa na na na na na na (Na na na), na na na na, hey Jude ## 8 Hey Jude 37 Naa na na na na na na (Yeah yeah yeah), na na na na, hey J… ## 9 Hey Jude 38 (You know you can make, Jude Jude, You're not gonna break … ## 10 Hey Jude 39 Naa na (Don't make it bad Jude) na na na na na (Take a sad… ## # … with 11 more rows ``` --- ## Extract ```r str_subset(string = hey_jude$lyric, pattern = "[:punct:]") %>% str_extract(pattern = "[:punct:]") ``` ``` ## [1] "," "," "," "'" "'" "," "," "(" "(" "," "'" "'" "," "'" "," "," "," "," "," ## [20] "(" "(" "(" "(" "(" "(" "," "," "," "," "," "(" "," "," "(" "(" "," "," "," ``` --- ## Extract ```r str_subset(string = hey_jude$lyric, pattern = "[:punct:]") %>% str_extract_all(pattern = "[:punct:]") ``` ``` ## [[1]] ## [1] "," "'" ## ## [[2]] ## [1] "," "'" ## ## [[3]] ## [1] "," "," ## ## [[4]] ## [1] "'" ## ## [[5]] ## [1] "'" ## ## [[6]] ## [1] "," "'" ## ## [[7]] ## [1] "," ## ## [[8]] ## [1] "(" ")" ## ## [[9]] ## [1] "(" ")" ## ## [[10]] ## [1] "," "," ## ## [[11]] ## [1] "'" ## ## [[12]] ## [1] "'" "'" "," "," "'" ## ## [[13]] ## [1] "," "'" ## ## [[14]] ## [1] "'" "(" "," "!" ")" ## ## [[15]] ## [1] "," ## ## [[16]] ## [1] "," "," ## ## [[17]] ## [1] "," "," ## ## [[18]] ## [1] "," "," ## ## [[19]] ## [1] "," "," ## ## [[20]] ## [1] "(" ")" ## ## [[21]] ## [1] "(" ")" "," "," ## ## [[22]] ## [1] "(" ")" ## ## [[23]] ## [1] "(" ")" "," "," ## ## [[24]] ## [1] "(" "," "," "'" ")" ## ## [[25]] ## [1] "(" "'" ")" "(" ")" "," "," ## ## [[26]] ## [1] "," ## ## [[27]] ## [1] "," "," ## ## [[28]] ## [1] "," "," ## ## [[29]] ## [1] "," "," ## ## [[30]] ## [1] "," "," ## ## [[31]] ## [1] "(" ")" ## ## [[32]] ## [1] "," "," ## ## [[33]] ## [1] "," "," ## ## [[34]] ## [1] "(" ")" "," "," ## ## [[35]] ## [1] "(" ")" ## ## [[36]] ## [1] "," "," "(" ")" ## ## [[37]] ## [1] "," "," ## ## [[38]] ## [1] "," "," ``` --- ## Extract ```r str_view(hey_jude$lyric, pattern = "(?<= let )\\w+") ```
--- ## Extract ```r hey_jude %>% filter(str_detect(lyric, "let")) %>% mutate(after = str_extract(lyric, "(?<= let )\\w+")) ``` ``` ## # A tibble: 7 x 4 ## track_title line lyric after ## <chr> <int> <chr> <chr> ## 1 Hey Jude 3 Remember to let her into your heart her ## 2 Hey Jude 7 The minute you let her under your skin her ## 3 Hey Jude 14 Hey Jude, don't let me down me ## 4 Hey Jude 16 (Let it out and let it in) it ## 5 Hey Jude 17 Remember (Hey Jude) to let her into your heart her ## 6 Hey Jude 19 So let it out and let it in, hey Jude, begin it ## 7 Hey Jude 26 Remember to let her under your skin her ``` --- ## Replace ```r hey_kelly <- hey_jude %>% mutate(lyric = str_replace_all(lyric, "Judy|Jude", "Kelly")) hey_kelly ``` ``` ## # A tibble: 53 x 3 ## track_title line lyric ## <chr> <int> <chr> ## 1 Hey Jude 1 Hey Kelly, don't make it bad ## 2 Hey Jude 2 Take a sad song and make it better ## 3 Hey Jude 3 Remember to let her into your heart ## 4 Hey Jude 4 Then you can start to make it better ## 5 Hey Jude 5 Hey Kelly, don't be afraid ## 6 Hey Jude 6 You were made to go out and get her ## 7 Hey Jude 7 The minute you let her under your skin ## 8 Hey Jude 8 Then you begin to make it better ## 9 Hey Jude 9 And anytime you feel the pain, hey Kelly, refrain ## 10 Hey Jude 10 Don't carry the world upon your shoulders ## # … with 43 more rows ``` --- ## Split ```r str_split(hey_jude$lyric[1], " ") ``` ``` ## [[1]] ## [1] "Hey" "Jude," "don't" "make" "it" "bad" ``` * Will seen a tidier way to do this with `tidytext` on Thursday. --- ## Up Next: Now that we now how to handle text/strings as data, let's do some text analysis with `tidytext`. Topics: * Tokenizing to a tidy format * Word frequencies * Word clouds * Sentiment analysis