class: center, middle, inverse ### SQL Joins <img src="img/hero_wall_pink.png" width="800px"/> ### Kelly McConville .large[Math 241 | Week 13 | Spring 2021] --- ## Announcements/Reminders * Extra credit lab posted and due by end of day Sunday, May 9th. * Final project due Tuesday, May 11th at noon. * Remaining Office Hours: + Thu, May 6th 2:30 - 3:30pm PT + Fri, May 7th 11:00am - noon PT + Mon, May 10th 2:00 - 3:00pm PT * Don't forget to complete the course evaluations. --- ## Goals * SQL joins * Comparing SQL and `dplyr` * Wrap-up activity --- ## Database Querying Recap * **Relational database**: collection of linkable **tables** that are linked together by **keys** <img src="img/imdb_schema.png" width="100%" style="display: block; margin: auto;" /> * **SQL**: Structured Query Language + We are using [MySQL](https://www.mysql.com/). --- ## Connecting to MySQL ```r library(RMySQL) db <- dbConnect(MySQL(), host = "scidb.smith.edu", user = "sds192", password = "DSismfc@S", dbname = "imdb") ``` --- ## Query Order in SQL: * `SELECT` * `FROM` * `JOIN`: Cover today! * `WHERE` * `GROUP BY` * `HAVING` * `ORDER BY` * `LIMIT` Relate to `dplyr` to help learn these new functions. --- ## `JOIN`S Part of SQL's success is the ability to efficiently join tables. What information do we need to join two tables? -- * The name of the first table. * The type of join you want. * The name of the second table. * The conditions under which you want the records in table 1 to match records in table 2. -- Let's look at some examples with the im(db) database. * We want to find the keywords associated with "Shrek 2". --- ## Schema Again As we go through these examples, I want you to ask yourself: * Why are there so many tables? * Why isn't there one movies table? <img src="img/imdb_schema.png" width="100%" style="display: block; margin: auto;" /> --- ## Keywords * Seems like we need to start with these two tables: `movie_keyword` and `keyword`. * What does `COUNT(*)` do again? ```r db %>% dbGetQuery("SELECT COUNT(*) FROM movie_keyword;") ``` ``` ## COUNT(*) ## 1 7342172 ``` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM keyword;") ``` ``` ## COUNT(*) ## 1 230407 ``` --- ```r db %>% dbGetQuery("SELECT * FROM movie_keyword LIMIT 0, 8;") ``` ``` ## id movie_id keyword_id ## 1 1 2 1 ## 2 2 35 2 ## 3 3 55 3 ## 4 4 55 4 ## 5 5 55 5 ## 6 6 57 6 ## 7 7 102 7 ## 8 8 102 8 ``` ```r db %>% dbGetQuery("SELECT * FROM keyword LIMIT 0, 8;") ``` ``` ## id keyword phonetic_code ## 1 4321 raining R52 ## 2 6618 handcuffed-to-a-bed H5321 ## 3 791 offer-refused O1612 ## 4 8228 reference-to-dagmar-havlova R1652 ## 5 352 jane-austen J5235 ## 6 31 woods W32 ## 7 5491 teenage-son T525 ## 8 5830 hanging H5252 ``` --- ## Keywords What is the SQL command to arrange the `keyword` dataset by `id`? ```r db %>% dbGetQuery("SELECT * FROM keyword LIMIT 0, 8;") ``` ``` ## id keyword phonetic_code ## 1 4321 raining R52 ## 2 6618 handcuffed-to-a-bed H5321 ## 3 791 offer-refused O1612 ## 4 8228 reference-to-dagmar-havlova R1652 ## 5 352 jane-austen J5235 ## 6 31 woods W32 ## 7 5491 teenage-son T525 ## 8 5830 hanging H5252 ``` --- ## Keywords What is the SQL command to arrange the `keyword` dataset by `id`? ```r db %>% dbGetQuery("SELECT * FROM keyword ORDER BY id LIMIT 0, 8;") ``` ``` ## id keyword phonetic_code ## 1 1 number-in-title N5165 ## 2 2 web-series W1262 ## 3 3 friend F653 ## 4 4 heroin H65 ## 5 5 vlog V42 ## 6 6 tv-mini-series T1526 ## 7 7 austin-texas A2353 ## 8 8 beer B6 ``` --- ## Key for Keywords? What is the key for linking `keyword` and `movie_keyword`? <img src="img/imdb_schema.png" width="100%" style="display: block; margin: auto;" /> --- ## `JOIN` * Notice: `movie_keyword.movie_id` + R syntax? ```r db %>% dbGetQuery("SELECT movie_keyword.movie_id, keyword.keyword FROM movie_keyword JOIN keyword ON movie_keyword.keyword_id = keyword.id LIMIT 0, 8;") ``` ``` ## movie_id keyword ## 1 2 number-in-title ## 2 35 web-series ## 3 55 friend ## 4 55 heroin ## 5 55 vlog ## 6 57 tv-mini-series ## 7 102 austin-texas ## 8 102 beer ``` --- ## `JOIN` * Can remove the dataset name if... ```r db %>% dbGetQuery("SELECT movie_id, keyword FROM movie_keyword JOIN keyword ON keyword_id = id LIMIT 0, 8;") ``` ``` ## Error in .local(conn, statement, ...): could not run statement: Column 'id' in on clause is ambiguous ``` --- ## `JOIN` * Can remove the dataset name if... ```r db %>% dbGetQuery("SELECT movie_id, keyword FROM movie_keyword JOIN keyword ON keyword_id = keyword.id LIMIT 0, 8;") ``` ``` ## movie_id keyword ## 1 2 number-in-title ## 2 35 web-series ## 3 55 friend ## 4 55 heroin ## 5 55 vlog ## 6 57 tv-mini-series ## 7 102 austin-texas ## 8 102 beer ``` --- ## `JOIN` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM movie_keyword JOIN keyword ON keyword_id = keyword.id;") ``` ``` ## COUNT(*) ## 1 7342172 ``` --- ## Table Aliases ```r db %>% dbGetQuery("SELECT mk.movie_id, k.keyword FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id LIMIT 0, 8;") ``` ``` ## movie_id keyword ## 1 2 number-in-title ## 2 35 web-series ## 3 55 friend ## 4 55 heroin ## 5 55 vlog ## 6 57 tv-mini-series ## 7 102 austin-texas ## 8 102 beer ``` * Now what table to do we want to `JOIN` to our current result? --- ## Keywords <img src="img/imdb_schema.png" width="100%" style="display: block; margin: auto;" /> --- ```r db %>% dbGetQuery("SELECT t.title, k.keyword FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id JOIN title t ON mk.movie_id = t.id LIMIT 0, 8;") ``` ``` ## title keyword ## 1 #1 Single number-in-title ## 2 #1MinuteNightmare web-series ## 3 #30Nods friend ## 4 #30Nods heroin ## 5 #30Nods vlog ## 6 #4Hire tv-mini-series ## 7 #ATown austin-texas ## 8 #ATown beer ``` --- Now we are close to finding the keywords in 'Shrek 2'. * If we want to filter to just 'Shrek 2', what SQL command do we need? ```r db %>% dbGetQuery("SELECT t.title, k.keyword FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id JOIN title t ON mk.movie_id = t.id LIMIT 0, 8;") ``` ``` ## title keyword ## 1 #1 Single number-in-title ## 2 #1MinuteNightmare web-series ## 3 #30Nods friend ## 4 #30Nods heroin ## 5 #30Nods vlog ## 6 #4Hire tv-mini-series ## 7 #ATown austin-texas ## 8 #ATown beer ``` --- Now we are close to finding the keywords in 'Shrek 2'. * If we want to filter to just 'Shrek 2', what SQL command do we need to add? ```r db %>% dbGetQuery("SELECT t.title, k.keyword FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id JOIN title t ON mk.movie_id = t.id WHERE t.title = 'Shrek 2' AND t.kind_id = 1 LIMIT 0, 8;") ``` ``` ## title keyword ## 1 Shrek 2 3d-animation ## 2 Shrek 2 arm-wrestling ## 3 Shrek 2 arrogance ## 4 Shrek 2 assassin ## 5 Shrek 2 bar ## 6 Shrek 2 based-on-book ## 7 Shrek 2 based-on-fairy-tale ## 8 Shrek 2 battle ``` --- ```r db %>% dbGetQuery("SELECT t.title, k.keyword FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id JOIN title t ON mk.movie_id = t.id WHERE t.title = 'Shrek 2' AND t.kind_id = 1 LIMIT 0, 300;") ``` ``` ## title keyword ## 1 Shrek 2 3d-animation ## 2 Shrek 2 arm-wrestling ## 3 Shrek 2 arrogance ## 4 Shrek 2 assassin ## 5 Shrek 2 bar ## 6 Shrek 2 based-on-book ## 7 Shrek 2 based-on-fairy-tale ## 8 Shrek 2 battle ## 9 Shrek 2 belch ## 10 Shrek 2 best-friend ## 11 Shrek 2 blind ## 12 Shrek 2 blockbuster ## 13 Shrek 2 brawl ## 14 Shrek 2 bubble ## 15 Shrek 2 cartoon-violence ## 16 Shrek 2 castle ## 17 Shrek 2 cat ## 18 Shrek 2 catching-an-object-in-one's-mouth ## 19 Shrek 2 cgi-animation ## 20 Shrek 2 cgi-film ## 21 Shrek 2 character-name-in-title ## 22 Shrek 2 computer-animation ## 23 Shrek 2 cult-film ## 24 Shrek 2 cyclops ## 25 Shrek 2 dance ## 26 Shrek 2 destiny ## 27 Shrek 2 dinner ## 28 Shrek 2 dog ## 29 Shrek 2 donkey ## 30 Shrek 2 drag-queen ## 31 Shrek 2 dragon ## 32 Shrek 2 elf ## 33 Shrek 2 enchanted-object ## 34 Shrek 2 factory ## 35 Shrek 2 fairy ## 36 Shrek 2 fairy-godmother ## 37 Shrek 2 fairy-tale ## 38 Shrek 2 fairy-tale-parody ## 39 Shrek 2 far-far-away ## 40 Shrek 2 father-daughter-relationship ## 41 Shrek 2 father-in-law ## 42 Shrek 2 fencing ## 43 Shrek 2 fire-breathing-dragon ## 44 Shrek 2 fistfight ## 45 Shrek 2 flatulence ## 46 Shrek 2 frog ## 47 Shrek 2 giant ## 48 Shrek 2 giant-food ## 49 Shrek 2 gingerbread ## 50 Shrek 2 gingerbread-man ## 51 Shrek 2 glamour ## 52 Shrek 2 godmother ## 53 Shrek 2 hand-to-hand-combat ## 54 Shrek 2 handsomeness ## 55 Shrek 2 harp ## 56 Shrek 2 hit-in-the-crotch ## 57 Shrek 2 honeymoon ## 58 Shrek 2 horse ## 59 Shrek 2 husband-wife-relationship ## 60 Shrek 2 hybrid-animal ## 61 Shrek 2 immaturity ## 62 Shrek 2 king ## 63 Shrek 2 kingdom ## 64 Shrek 2 knight ## 65 Shrek 2 lifting-someone-into-the-air ## 66 Shrek 2 love-potion ## 67 Shrek 2 magic ## 68 Shrek 2 magic-wand ## 69 Shrek 2 magical-potion ## 70 Shrek 2 makeup ## 71 Shrek 2 martial-arts ## 72 Shrek 2 mermaid ## 73 Shrek 2 mirror ## 74 Shrek 2 mistaken-identity ## 75 Shrek 2 mother-in-law ## 76 Shrek 2 mother-son-relationship ## 77 Shrek 2 mouse ## 78 Shrek 2 mud-bath ## 79 Shrek 2 newlywed ## 80 Shrek 2 no-opening-credits ## 81 Shrek 2 ogre ## 82 Shrek 2 party ## 83 Shrek 2 physical-appearance ## 84 Shrek 2 pig ## 85 Shrek 2 pinocchio ## 86 Shrek 2 pitchfork ## 87 Shrek 2 prejudice ## 88 Shrek 2 prince ## 89 Shrek 2 princess ## 90 Shrek 2 puppet ## 91 Shrek 2 puss-in-boots ## 92 Shrek 2 queen ## 93 Shrek 2 red-carpet ## 94 Shrek 2 redemption ## 95 Shrek 2 rivalry ## 96 Shrek 2 rock-concert ## 97 Shrek 2 satire ## 98 Shrek 2 scene-during-end-credits ## 99 Shrek 2 second-part ## 100 Shrek 2 secret ## 101 Shrek 2 secret-identity ## 102 Shrek 2 secret-plot ## 103 Shrek 2 sequel ## 104 Shrek 2 shark ## 105 Shrek 2 shrek ## 106 Shrek 2 sidekick ## 107 Shrek 2 siege ## 108 Shrek 2 singer ## 109 Shrek 2 slow-motion-scene ## 110 Shrek 2 stepsister ## 111 Shrek 2 storybook-in-opening-shot ## 112 Shrek 2 subjective-camera ## 113 Shrek 2 superficiality ## 114 Shrek 2 surprise-after-end-credits ## 115 Shrek 2 sword ## 116 Shrek 2 sword-and-sorcery ## 117 Shrek 2 sword-fight ## 118 Shrek 2 talking-animal ## 119 Shrek 2 transformation ## 120 Shrek 2 true-love ## 121 Shrek 2 villainess ## 122 Shrek 2 violence ## 123 Shrek 2 white-horse ## 124 Shrek 2 wolf ``` --- ## Just For Fun ```r db %>% dbGetQuery("SELECT title FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id JOIN title t ON mk.movie_id = t.id WHERE k.keyword = 'portland-oregon';") ``` ``` ## title ## 1 Tinker Hatfield: Footwear Design ## 2 The Only Light in the Darkness ## 3 Backstrom ## 4 Ancient, Chinese, Secret ## 5 Bella ## 6 Bogeyman ## 7 Corkscrewed ## 8 Dragon Slayer ## 9 Enemy of My Enemies ## 10 Give 'Til It Hurts ## 11 I Am a Bird Now ## 12 I Like to Watch ## 13 Inescapable Truth ## 14 Love Is a Rose and You Better Not Pick It ## 15 Rock Bottom ## 16 Takes One to Know One ## 17 Scared to Death ## 18 (#1.1) ## 19 Meat Lover's Paradise ## 20 Grimm ## 21 Hello, Larry ## 22 Jamie and Sydney ## 23 Inside American Jail ## 24 Life Unexpected ## 25 Little People, Big World ## 26 Focused on Failure ## 27 Jail ## 28 Double Life ## 29 Portlandia ## 30 Family Emergency ## 31 Ramona ## 32 Deadcheck ## 33 Summer of the Ox ## 34 The Case of the Wholesale Burglaries ## 35 Andy Ngo ## 36 The Portland Trip ## 37 I Want a Threesome ## 38 Under Suspicion ## 39 Wage Slaves ## 40 Nevermore ## 41 Shadows ## 42 $100 & a T-Shirt: A Documentary About Zines in the Northwest ## 43 7-10 Split ## 44 ? ## 45 A Day Called X ## 46 A Good Whack: The Making of 'Tonya and Nancy: The Opera' ## 47 Act Your Age: The Kids of Widney High Story ## 48 Adventures in Plymptoons! ## 49 American Hardcore ## 50 Antitrust ## 51 Beatings ## 52 Behind the Mask: The Rise of Leslie Vernon ## 53 Bend of the River ## 54 Best Laid Plans ## 55 Bigger Than the Sky ## 56 Billy Graham: God's Ambassador ## 57 Blue Like Jazz ## 58 Body of Evidence ## 59 Bongwater ## 60 Brain Smasher... A Love Story ## 61 Broken Noses ## 62 Canyon Passage ## 63 City Baby ## 64 Cold Weather ## 65 Constant ## 66 Damon Blows America Volume 1: Portland Oregon ## 67 Deep Waters ## 68 Disc Jockey ## 69 Down and Out with the Dolls ## 70 Dr. Giggles ## 71 Drugstore Cowboy ## 72 Elephant ## 73 Feast of Love ## 74 Film Geek ## 75 Fine Arts ## 76 Foxfire ## 77 Green Room ## 78 Happy Birthday ## 79 Heaven Sent ## 80 Holed-Up ## 81 If I Stay ## 82 Kurt & Courtney ## 83 Lady .38 ## 84 Lead with Love: A Documentary Film by Logan Lynn ## 85 Leaving Normal ## 86 Lexie Cannes ## 87 Little Blue Pill ## 88 Lord, Save Us from Your Followers ## 89 Love, Mary ## 90 Mala Noche ## 91 Messiah Complex ## 92 MoonPi ## 93 Mr. Brooks ## 94 Mr. Holland's Opus ## 95 Murder C.O.D. ## 96 Music Within ## 97 My Own Private Idaho ## 98 My Story Lives Here ## 99 Nickel and Dimin' It with Buddy ## 100 On the Wing ## 101 One Night in Portland ## 102 Paranoid Park ## 103 Peekaboo Rose ## 104 Pendarvia ## 105 Permanent Record ## 106 Portland Exposé ## 107 Portland Mojo ## 108 Portlandia ## 109 Prayers for Bobby ## 110 Punk Love ## 111 Queens of Heart: Community Therapists in Drag ## 112 Reality Check ## 113 Reds ## 114 River's Edge ## 115 Saving Sarah Cain ## 116 Seven Girlfriends ## 117 Silent Hill: Revelation ## 118 Speech & Debate ## 119 Spiral ## 120 Stepfather II ## 121 Superfire ## 122 The Big One ## 123 The Burning Plain ## 124 The Buskers & Lou ## 125 The Darkest Corner of Paradise ## 126 The Future ## 127 The Graffiti Artist ## 128 The Hunted ## 129 The Iconographer ## 130 The Ladies Room ## 131 The Lathe of Heaven ## 132 The Old 1-2 Punch ## 133 The Roomies ## 134 The Stepfather ## 135 The Temp ## 136 Throw Away People ## 137 Tony Hawk's Pro Skater ## 138 Toy Room ## 139 Tying the Knot ## 140 Untraceable ## 141 We Don't Live Here Anymore ## 142 What the #$*! Do We (K)now!? ## 143 Whitepaddy ## 144 Wild ## 145 Without Orders ## 146 Yellow Pages ## 147 Zero Effect ``` --- ## Why are there so many datasets?! Why not just store all of the information about the movies in ONE dataset? --- What does a row represent for `title`? For `keyword`? For `movie_keyword`? ```r db %>% dbGetQuery("SELECT title FROM title LIMIT 4372899, 3;") ``` ``` ## title ## 1 The Master ## 2 The Heist ## 3 The Four Wizards at Deathly Side ``` ```r db %>% dbGetQuery("SELECT * FROM keyword LIMIT 0, 3;") ``` ``` ## id keyword phonetic_code ## 1 4321 raining R52 ## 2 6618 handcuffed-to-a-bed H5321 ## 3 791 offer-refused O1612 ``` ```r db %>% dbGetQuery("SELECT * FROM movie_keyword LIMIT 0, 3;") ``` ``` ## id movie_id keyword_id ## 1 1 2 1 ## 2 2 35 2 ## 3 3 55 3 ``` --- ## Why are there so many datasets?! Why not just store all of the information about the movies in ONE dataset? -- + Not very efficient. -- Database solution: + Store info about the movies in `title`. + Store info about the keywords in `keyword`. + Store short codes for the `keywords` in `keyword_id`. + Only `JOIN` when we need to. --- ## Last Step Don't forget to store your dataset in R! ```r shrek_2_keywords <- db %>% dbGetQuery("SELECT t.title, k.keyword FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id JOIN title t ON mk.movie_id = t.id WHERE t.title = 'Shrek 2' AND t.kind_id = 1;") ``` --- ## `R` Versus `SQL` * `SQL` Positives: + Great at accessing big(ger) data (and smaller data) + Very efficient at joining data, filtering data, selecting columns... * `SQL` Negatives: + Summary statistics and data wrangling are very limited in scope + Can't plot, fit models, run hypothesis tests, ... --- ## Interacting with the `MySQL` server via `dplyr` ```r title <- tbl(db, "title") keyword <- tbl(db, "keyword") movie_keyword <- tbl(db, "movie_keyword") class(title) ``` ``` ## [1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql" ## [4] "tbl_lazy" "tbl" ``` --- ```r dat <- inner_join(keyword, movie_keyword, by = c("id" = "keyword_id")) %>% inner_join(title, by = c("movie_id" = "id")) %>% filter(title == "Shrek 2", kind_id == 1) %>% select(title, keyword) dat ``` ``` ## # Source: lazy query [?? x 2] ## # Database: mysql 5.5.58-0ubuntu0.14.04.1-log [sds192@scidb.smith.edu:/imdb] ## title keyword ## <chr> <chr> ## 1 Shrek 2 3d-animation ## 2 Shrek 2 arm-wrestling ## 3 Shrek 2 arrogance ## 4 Shrek 2 assassin ## 5 Shrek 2 bar ## 6 Shrek 2 based-on-book ## 7 Shrek 2 based-on-fairy-tale ## 8 Shrek 2 battle ## 9 Shrek 2 belch ## 10 Shrek 2 best-friend ## # … with more rows ``` --- ## Translation of `dplyr` to `SQL` ```r show_query(dat) ``` ``` ## <SQL> ## SELECT `title`, `keyword` ## FROM (SELECT `LHS`.`id.x` AS `id.x`, `LHS`.`keyword` AS `keyword`, `LHS`.`phonetic_code` AS `phonetic_code.x`, `LHS`.`id.y` AS `id.y`, `LHS`.`movie_id` AS `movie_id`, `RHS`.`title` AS `title`, `RHS`.`imdb_index` AS `imdb_index`, `RHS`.`kind_id` AS `kind_id`, `RHS`.`production_year` AS `production_year`, `RHS`.`imdb_id` AS `imdb_id`, `RHS`.`phonetic_code` AS `phonetic_code.y`, `RHS`.`episode_of_id` AS `episode_of_id`, `RHS`.`season_nr` AS `season_nr`, `RHS`.`episode_nr` AS `episode_nr`, `RHS`.`series_years` AS `series_years`, `RHS`.`md5sum` AS `md5sum` ## FROM (SELECT `LHS`.`id` AS `id.x`, `LHS`.`keyword` AS `keyword`, `LHS`.`phonetic_code` AS `phonetic_code`, `RHS`.`id` AS `id.y`, `RHS`.`movie_id` AS `movie_id` ## FROM `keyword` AS `LHS` ## INNER JOIN `movie_keyword` AS `RHS` ## ON (`LHS`.`id` = `RHS`.`keyword_id`) ## ) `LHS` ## INNER JOIN `title` AS `RHS` ## ON (`LHS`.`movie_id` = `RHS`.`id`) ## ) `dbplyr_001` ## WHERE ((`title` = 'Shrek 2') AND (`kind_id` = 1.0)) ``` --- ## Math 241 Reflections * Let's crowd source some reflections on the course in [this Google Doc](https://docs.google.com/document/d/1VSDjdHUkYka3YBYNnbDQEAhEsZ1h991YM8bB2dzWtL8/edit?usp=sharing)