class: center, middle, inverse ### Database Querying <img src="img/hero_wall_pink.png" width="800px"/> ### Kelly McConville .large[Math 241 | Week 13 | Spring 2021] --- ## Announcements/Reminders * Receive extra credit lab on SQL on Thursday. * Final project due Tuesday, May 11th at noon. --- ## Goals Database querying with SQL * Why learn `SQL` when I already know `dplyr` and `readr`? + [Check out this post](https://blog.sqlizer.io/posts/sql-43/). -- Not Covering: * Database Administration + MDSR Ch 16 + Useful chapter to read for those who are more interested in the CS side of data science. --- ## Local Installation of R/RStudio * I couldn't connect to the MySQL server we are going to use this week via Reed's RStudio Server. * So I am using my local RStudio. * If you haven't included R/RStudio on your computer yet, I recommend you do so: + First: [R](https://cran.r-project.org/) + Then: [RStudio](https://rstudio.com/products/rstudio/download/) + Then: All your favorite packages. --- ## Database Querying What is a **database query**? -- * A request for data from a set of tables in the database. Why learn database querying? -- + You will likely need the skill if you want at an organization whose has data. + For many organizations, they store their data in a relational database. Think Google, Facebook, LinkedIn, ... + To know how to handle data that are too big to store in memory. + Instead you want to store it in a database (on disk) and then query the database for the bits you want to load in memory. --- ## Memory Versus Hard Disk A computer has two main options for storing data: -- * Memory (RAM): amount of data that a computer can work with at once. + Likely have a few gigabytes -- * Hard Disk: amount of data that a computer can store permanently. + Likely have hundreds or thousands of gigabytes -- Trade-off between storage room (disks) and speed (memory is faster to access). --- ## Data Storage and `R` * `R` generally stores all objects in memory. -- * That can get to be a problem as our datasets get bigger: ```r library(pdxTrees) pdxTrees <- get_pdxTrees_streets() dim(pdxTrees) ``` ``` ## [1] 218602 23 ``` ```r print(object.size(pdxTrees), units = "Mb") ``` ``` ## 55.6 Mb ``` --- ## Data Storage and `R` ```r print(object.size(pdxTrees), units = "Mb") ``` ``` ## 55.6 Mb ``` * Once we get to 10M rows in our data frame, things will start to get problematic. -- * **Solution**: Use a system that stores all of the data on disk but allows us to access a portion in memory. + Will use a relational database to do this! + There are other solutions for really BIG data. --- ## Relational Database * **Relational database**: collection of linkable **tables** that are linked together by **keys** * **Table**: Two dimensional + Record = row + Field = column --- ## Relational Database * We will explore an open im(db) database today with the [following schema](https://imgur.com/pDq0n) <img src="img/imdb_schema.png" width="100%" style="display: block; margin: auto;" /> --- ## [SQL](https://en.wikipedia.org/wiki/SQL) * **SQL**: Structured Query Language * Language for interacting with data stored in a relational database. * We will focus on **querying** the database. * Can be used to create and modify databases. --- ## RDBMS * **RDBMS**: relational database management system * There are several RDBMSs out there. * Unfortunately, the SQL syntax does vary a little between systems, but not really for the basics. * We will use [MySQL](https://www.mysql.com/). + Open source but owned by Oracle + Used by Facebook, Google, LinkedIn, Twitter --- ## MySQL * Client-server model + Server: Where data are stored and queries are executed. + Located on the user's computer or a remote server. + Client: Program that connects to the server * I did not set up a MySQL server. * Instead we will use the Smith College MySQL server, `scidb.smith.edu`. + Will access it through RStudio using `dplyr`, `DBI`, and `RMySQL`. * If you find yourself using a MySQL server, you should check out the [MySQLWorkbench](https://www.mysql.com/products/workbench/). --- ## Connecting to MySQL * `dbConnect()` via `DBI` package (which is automatically loaded when you load `RMySQL`). ```r library(RMySQL) db <- dbConnect(MySQL(), host = "scidb.smith.edu", user = "sds192", password = "DSismfc@S", dbname = "imdb") ``` --- ## Database Connection * We now have a database connection. ```r class(db) ``` ``` ## [1] "MySQLConnection" ## attr(,"package") ## [1] "RMySQL" ``` --- ## Tables in Database * The retrieval function: `dbGetQuery()` + First argument: database connection object. + Second argument: SQL query as a character vector. Ends in semi-colon. ```r db %>% dbGetQuery("SHOW TABLES;") ``` ``` ## Tables_in_imdb ## 1 aka_name ## 2 aka_title ## 3 cast_info ## 4 char_name ## 5 comp_cast_type ## 6 company_name ## 7 company_type ## 8 complete_cast ## 9 info_type ## 10 keyword ## 11 kind_type ## 12 link_type ## 13 movie_companies ## 14 movie_info ## 15 movie_info_idx ## 16 movie_keyword ## 17 movie_link ## 18 name ## 19 person_info ## 20 role_type ## 21 title ``` --- ## Another Option: Use SQL chunks * Will set some chunk options: ```r knitr::opts_chunk$set(connection = db, max.print = 20) ``` * Insert `sql` chunks instead of `r` chunks ```sql SHOW TABLES; ``` <div class="knitsql-table"> <table> <caption>Displaying records 1 - 20</caption> <thead> <tr> <th style="text-align:left;"> Tables_in_imdb </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> aka_name </td> </tr> <tr> <td style="text-align:left;"> aka_title </td> </tr> <tr> <td style="text-align:left;"> cast_info </td> </tr> <tr> <td style="text-align:left;"> char_name </td> </tr> <tr> <td style="text-align:left;"> comp_cast_type </td> </tr> <tr> <td style="text-align:left;"> company_name </td> </tr> <tr> <td style="text-align:left;"> company_type </td> </tr> <tr> <td style="text-align:left;"> complete_cast </td> </tr> <tr> <td style="text-align:left;"> info_type </td> </tr> <tr> <td style="text-align:left;"> keyword </td> </tr> <tr> <td style="text-align:left;"> kind_type </td> </tr> <tr> <td style="text-align:left;"> link_type </td> </tr> <tr> <td style="text-align:left;"> movie_companies </td> </tr> <tr> <td style="text-align:left;"> movie_info </td> </tr> <tr> <td style="text-align:left;"> movie_info_idx </td> </tr> <tr> <td style="text-align:left;"> movie_keyword </td> </tr> <tr> <td style="text-align:left;"> movie_link </td> </tr> <tr> <td style="text-align:left;"> name </td> </tr> <tr> <td style="text-align:left;"> person_info </td> </tr> <tr> <td style="text-align:left;"> role_type </td> </tr> </tbody> </table> </div> --- ## Retrieving data: `SELECT` and `FROM` * `SELECT` + Always starts a query * `FROM` + Specifies table ```r db %>% dbGetQuery("SELECT * FROM kind_type;") ``` ``` ## id kind ## 1 1 movie ## 2 2 tv series ## 3 3 tv movie ## 4 4 video movie ## 5 5 tv mini series ## 6 6 video game ## 7 7 episode ``` --- ## Retrieving data: `SELECT` and `FROM` * **What is the `dplyr` version?** What does `*` mean? ```r db %>% dbGetQuery("SELECT * FROM kind_type;") ``` ``` ## id kind ## 1 1 movie ## 2 2 tv series ## 3 3 tv movie ## 4 4 video movie ## 5 5 tv mini series ## 6 6 video game ## 7 7 episode ``` * Returns the kinds of "movies" in the database --- ## Saving Output ```r movie_kinds <- db %>% dbGetQuery("SELECT * FROM kind_type;") class(movie_kinds) ``` ``` ## [1] "data.frame" ``` --- ## A `SELECT` Word of Caution * I did NOT run the following chunk: ```r db %>% dbGetQuery("SELECT * FROM title;) ``` --- ## Retrieving data: `LIMIT` * **What is the `R` version?** ```r db %>% dbGetQuery("SELECT * FROM title LIMIT 10;") ``` ``` ## id title imdb_index kind_id ## 1 78460 Adults Recat to the Simpsons (30th Anniversary) <NA> 7 ## 2 70273 (2016-05-18) <NA> 7 ## 3 60105 (2014-04-11) <NA> 7 ## 4 32120 (2008-05-01) <NA> 7 ## 5 97554 Schmölders Traum <NA> 7 ## 6 57966 (#1.1) <NA> 7 ## 7 76391 Anniversary <NA> 7 ## 8 11952 Angus Black/Lester Barrie/DC Curry <NA> 7 ## 9 1554 New Orleans <NA> 7 ## 10 58442 Kiss Me Kate <NA> 7 ## production_year imdb_id phonetic_code episode_of_id season_nr episode_nr ## 1 2017 NA A3432 78406 NA NA ## 2 2016 NA <NA> 68058 NA NA ## 3 2014 NA <NA> 59138 NA NA ## 4 2008 NA <NA> 32060 NA NA ## 5 2001 NA S2543 97302 10 1 ## 6 2013 NA <NA> 57965 1 1 ## 7 1971 NA A5162 76385 4 9 ## 8 2009 NA A5214 11937 4 7 ## 9 2003 NA N6452 1508 2 11 ## 10 2011 NA K2523 58436 1 10 ## series_years md5sum ## 1 <NA> 2ae09eed7d576cc2c24774fed5b18168 ## 2 <NA> 511dfc14cfff7589d29a95abb30cd66a ## 3 <NA> c6cdce7e667e07713e431805c407feed ## 4 <NA> 100df65742caf5afd092b2e0ead67d8e ## 5 <NA> 46862a2f96f9fb2d59e8c9a11ecfdd28 ## 6 <NA> 409c37703766c4b24f8a86162fd9cf85 ## 7 <NA> 5e12ce73fac1d1dcf94136b6e9acd8f8 ## 8 <NA> 9c38b9e5601dc154444b73b518034aa1 ## 9 <NA> 621bea735740a547e862e4a3226f35d2 ## 10 <NA> 293e8c75c7f35a4035abf617962be5a9 ``` --- ## Retrieving data: `SELECT` and `FROM` * What is the `dplyr` version of `AS`? -- * In SQL, this is called giving it an alias. + Only applied to the resulting data frame, not the data on the SQL server. -- * Why were these five movies selected? ```r db %>% dbGetQuery("SELECT title AS movie, production_year AS year FROM title LIMIT 11452, 5;") ``` ``` ## movie year ## 1 Fogar, dulce fogar 2007 ## 2 Tokyo Rose/Communism, Italian Style/Two Can Live as Cheaply as One 1976 ## 3 Insane Clown Posse 2011 ## 4 Agent Carter 2015 ## 5 (#2.32) 2012 ``` --- ## `DISTINCT` * **What is the `dplyr`/R version?** ```r db %>% dbGetQuery("SELECT DISTINCT role FROM role_type;") ``` ``` ## role ## 1 actor ## 2 actress ## 3 cinematographer ## 4 composer ## 5 costume designer ## 6 director ## 7 editor ## 8 guest ## 9 miscellaneous crew ## 10 producer ## 11 production designer ## 12 writer ``` --- ## `COUNT` * **What is the `dplyr`/R version?** + Why are they not equal? ```r db %>% dbGetQuery("SELECT COUNT(title), COUNT(production_year) FROM title;") ``` ``` ## COUNT(title) COUNT(production_year) ## 1 4626322 4342074 ``` --- ## `COUNT` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM title;") ``` ``` ## COUNT(*) ## 1 4626322 ``` --- ## `COUNT` + `DISTINCT` ```r db %>% dbGetQuery("SELECT COUNT(DISTINCT production_year) FROM title;") ``` ``` ## COUNT(DISTINCT production_year) ## 1 144 ``` --- ## `WHERE` * **What is the `dplyr` version?** * What is different about the comparison operator syntax? * Single quotes are important. ```r db %>% dbGetQuery("SELECT * FROM title WHERE title = 'The Hobbit: An Unexpected Journey';") ``` ``` ## id title imdb_index kind_id production_year ## 1 368261 The Hobbit: An Unexpected Journey <NA> 7 2012 ## 2 1089471 The Hobbit: An Unexpected Journey <NA> 7 2012 ## 3 1176164 The Hobbit: An Unexpected Journey <NA> 7 2013 ## 4 1571622 The Hobbit: An Unexpected Journey <NA> 7 2017 ## 5 1680280 The Hobbit: An Unexpected Journey <NA> 7 2012 ## 6 1616172 The Hobbit: An Unexpected Journey <NA> 7 2012 ## 7 1680751 The Hobbit: An Unexpected Journey <NA> 7 2012 ## 8 2641984 The Hobbit: An Unexpected Journey <NA> 7 2013 ## 9 2743746 The Hobbit: An Unexpected Journey <NA> 7 2013 ## 10 4366574 The Hobbit: An Unexpected Journey <NA> 1 2012 ## imdb_id phonetic_code episode_of_id season_nr episode_nr series_years ## 1 NA H1352 368173 5 21 <NA> ## 2 NA H1352 1089359 2 23 <NA> ## 3 NA H1352 1175992 2 30 <NA> ## 4 NA H1352 1571578 1 51 <NA> ## 5 NA H1352 1680082 3 69 <NA> ## 6 NA H1352 1616064 1 98 <NA> ## 7 NA H1352 1680360 2 33 <NA> ## 8 NA H1352 2641969 2 1 <NA> ## 9 NA H1352 2743716 1 35 <NA> ## 10 NA H1352 NA NA NA <NA> ## md5sum ## 1 96911318791c86adbb4c1de214c00c0c ## 2 5fe0edda34566f38d1a7e86bba0d2153 ## 3 9e9d623365c7c40742b1e924272d434b ## 4 47d66403e79aaff72aaa13dd2731cdea ## 5 e2b2b91dfe2a5ac97f0c2e595eebbb54 ## 6 48cb3b29391d2318c49dbfcc8379348c ## 7 d8ae35d660c92630296f9ff86b5e18f1 ## 8 707a255ea3839f74f8dd0ef7339715fa ## 9 7cde871ef5bf395576c6c015884c25cd ## 10 75f8eb8f074f7d34fe40791cc580a87e ``` --- ## `WHERE` * What are all of these Hobbit entries?! -- ```r db %>% dbGetQuery("SELECT * FROM kind_type;") ``` ``` ## id kind ## 1 1 movie ## 2 2 tv series ## 3 3 tv movie ## 4 4 video movie ## 5 5 tv mini series ## 6 6 video game ## 7 7 episode ``` --- ## `WHERE` * What is different about the comparison operator syntax? ```r db %>% dbGetQuery("SELECT * FROM title WHERE title = 'The Hobbit: An Unexpected Journey' AND kind_id <> 7;") ``` ``` ## id title imdb_index kind_id production_year ## 1 4366574 The Hobbit: An Unexpected Journey <NA> 1 2012 ## imdb_id phonetic_code episode_of_id season_nr episode_nr series_years ## 1 NA H1352 NA NA NA <NA> ## md5sum ## 1 75f8eb8f074f7d34fe40791cc580a87e ``` --- ## Comparison syntax * `=` equal * `<>` not equal * `<` less than * `<=` less than or equal to --- ## `WHERE` + `AND` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM title WHERE production_year = 2016 AND kind_id = 1;") ``` ``` ## COUNT(*) ## 1 59702 ``` --- ## `WHERE` + `OR` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM title WHERE (production_year = 2016 OR production_year = 2015) AND kind_id = 1;") ``` ``` ## COUNT(*) ## 1 116263 ``` --- Why aren't these the same? ```r db %>% dbGetQuery("SELECT COUNT(*) FROM title WHERE (production_year = 2016 OR production_year = 2015) AND kind_id = 1;") ``` ``` ## COUNT(*) ## 1 116263 ``` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM title WHERE production_year = 2016 OR production_year = 2015 AND kind_id = 1;") ``` ``` ## COUNT(*) ## 1 318576 ``` --- ## `WHERE` + `BETWEEN` * Is `BETWEEN` inclusive? ```r db %>% dbGetQuery("SELECT * FROM kind_type WHERE id BETWEEN 2 AND 4;") ``` ``` ## id kind ## 1 2 tv series ## 2 3 tv movie ## 3 4 video movie ``` --- ## `WHERE` + `IN` * **What is the `dplyr` version?** ```r db %>% dbGetQuery("SELECT * FROM kind_type WHERE id IN (2, 4, 7);") ``` ``` ## id kind ## 1 2 tv series ## 2 4 video movie ## 3 7 episode ``` --- ## `WHERE` + `IS NULL` ```r db %>% dbGetQuery("SELECT COUNT(*) FROM title WHERE production_year IS NULL;") ``` ``` ## COUNT(*) ## 1 284248 ``` --- ## Pattern Matching with `LIKE` and `NOT LIKE` ```r db %>% dbGetQuery("SELECT * FROM title WHERE title LIKE 'The Hobbit%' LIMIT 0, 10;") ``` ``` ## id title imdb_index kind_id ## 1 19392 The Hobbit World Premiere: 3 News Special <NA> 7 ## 2 211957 The Hobbit #3 <NA> 7 ## 3 211956 The Hobbit #2 <NA> 7 ## 4 211955 The Hobbit #1 <NA> 7 ## 5 393183 The Hobbit Debate/Quake-proofing Christchurch <NA> 7 ## 6 368261 The Hobbit: An Unexpected Journey <NA> 7 ## 7 461979 The Hobbit: The Battle of the Five Armies <NA> 7 ## 8 461980 The Hobbit: The Desolation of Smaug <NA> 7 ## 9 445509 The Hobbit Episode <NA> 7 ## 10 551575 The Hobbit <NA> 7 ## production_year imdb_id phonetic_code episode_of_id season_nr episode_nr ## 1 2012 NA H1364 19382 NA NA ## 2 2014 NA H13 211546 7 50 ## 3 2014 NA H13 211546 7 49 ## 4 2014 NA H13 211546 7 48 ## 5 2010 NA H1313 393064 NA NA ## 6 2012 NA H1352 368173 5 21 ## 7 2014 NA H1313 461323 4 116 ## 8 2013 NA H1324 461323 3 115 ## 9 2012 NA H1312 445487 NA NA ## 10 2008 NA H13 551565 1 5 ## series_years md5sum ## 1 <NA> c4263d236c4087c27a36ef812b2bc074 ## 2 <NA> 2d75f6c4b9c0229818d472789a2abcbd ## 3 <NA> 997b12a3e71540e42144a64838e04912 ## 4 <NA> b790775301e0905feac0301ae8ee5f45 ## 5 <NA> f0a69898f6a3f4be5719a15f96c57081 ## 6 <NA> 96911318791c86adbb4c1de214c00c0c ## 7 <NA> bbe3cfbff9a0bfbb41eac90a80975684 ## 8 <NA> 4ffac41f8ab21a9749b92aeb411f78e9 ## 9 <NA> 243fd1d3643971aa21e6eb75939dadbe ## 10 <NA> 085d887b8ab371a70b89a49380229f32 ``` --- ## Pattern Matching with `LIKE` and `NOT LIKE` ```r db %>% dbGetQuery("SELECT * FROM title WHERE title LIKE '%Hobbit' LIMIT 0, 10;") ``` ``` ## id title imdb_index ## 1 269474 Drunk Hobbit <NA> ## 2 299412 Video Blogging from Behind the Scenes of the Hobbit <NA> ## 3 551575 The Hobbit <NA> ## 4 973474 'Twas the Week Before Hobbit <NA> ## 5 900600 Inside Look at the Hobbit <NA> ## 6 1179442 The Mystery of the Human Hobbit <NA> ## 7 1285159 The Hobbit <NA> ## 8 1219190 I Should Work on the Hobbit <NA> ## 9 1364649 Kids React to the Hobbit <NA> ## 10 1713486 The Real Hobbit <NA> ## kind_id production_year imdb_id phonetic_code episode_of_id season_nr ## 1 7 2013 NA D6521 269405 3 ## 2 7 2011 NA V3142 299322 2 ## 3 7 2008 NA H13 551565 1 ## 4 7 2012 NA T2321 973472 2 ## 5 7 2013 NA I5234 900552 1 ## 6 7 2005 NA M2361 1178497 42 ## 7 7 1979 NA H13 1284190 NA ## 8 2 2011 NA S4362 NA NA ## 9 7 2012 NA K3262 1364477 3 ## 10 7 2008 NA R413 1713423 1 ## episode_nr series_years md5sum ## 1 4 <NA> 3eb9dcbe0178ad9daab4d4b75c6c9147 ## 2 40 <NA> 8163c5df6efeead72cd117ceaa1def0a ## 3 5 <NA> 085d887b8ab371a70b89a49380229f32 ## 4 91 <NA> 0ad253e29933ef195586cf380726bb92 ## 5 6 <NA> 5c152175b72447167c20a900864addeb ## 6 3 <NA> 19ec7d540d2ce7e921551e5d6982cbc1 ## 7 NA <NA> 407c48d7fb754c473561ca68943a0704 ## 8 NA 2011-???? 93478ebe5765b85c4b3020500cd91a8d ## 9 26 <NA> 9433f440d520df003b499e86473c3f2f ## 10 12 <NA> 181a414589782bc180262c1d01300d88 ``` --- ## Pattern Matching with `LIKE` and `NOT LIKE` ```r db %>% dbGetQuery("SELECT * FROM title WHERE title LIKE 'The Hob_' LIMIT 0, 10;") ``` ``` ## id title imdb_index kind_id production_year imdb_id phonetic_code ## 1 968252 The Hobo <NA> 7 1956 NA H1 ## 2 1377015 The Hobo <NA> 7 2015 NA H1 ## 3 1945566 The Hobo <NA> 7 1981 NA H1 ## 4 2461323 The Hobo <NA> 7 1956 NA H1 ## 5 4366600 The Hobo <NA> 1 1912 NA H1 ## 6 4366601 The Hobo <NA> 1 1917 NA H1 ## 7 4366602 The Hobo <NA> 1 2013 NA H1 ## episode_of_id season_nr episode_nr series_years ## 1 968174 1 13 <NA> ## 2 1377005 1 6 <NA> ## 3 1945555 2 3 <NA> ## 4 2461283 1 23 <NA> ## 5 NA NA NA <NA> ## 6 NA NA NA <NA> ## 7 NA NA NA <NA> ## md5sum ## 1 7ce44c0d081444ae51774df34c627ec9 ## 2 8342384768f6f8f1b4c02845005a79bd ## 3 04fca0738ed2f54a273180a1d878aeab ## 4 592b6e015394f2fb771a8097a4a356c9 ## 5 e17fe78b1a1d6440211c5df64a8d176c ## 6 f7830f670f82f0950622c0f83f9d892e ## 7 bb8b376479a2c5cdff6bafcd37093d03 ``` --- ## Pattern Matching with `LIKE` and `NOT LIKE` ```r db %>% dbGetQuery("SELECT * FROM title WHERE title LIKE '%Hob%' LIMIT 0, 10;") ``` ``` ## id title imdb_index kind_id ## 1 31053 Secrets from Guantanamo Bay/The Hobbit/Goldie <NA> 7 ## 2 30953 Poisoned/Chief of Chobani/Gorilla Doctors <NA> 7 ## 3 50872 Wrestling with the Spirit - A Doukhobor Story <NA> 7 ## 4 26070 Phobias/Wonder Kids <NA> 7 ## 5 30418 Brain Hacking/Chief of Chobani/Japan's Babe Ruth <NA> 7 ## 6 12871 Hobo Hounds <NA> 7 ## 7 40553 Agoraphobic Sons and Lovers: Part 2 <NA> 7 ## 8 58234 Side by Side/Hooked on Phobics <NA> 7 ## 9 4788 Photo-Phobic to Photogenic <NA> 7 ## 10 57090 Touch Phobia <NA> 7 ## production_year imdb_id phonetic_code episode_of_id season_nr episode_nr ## 1 2005 NA S2632 30215 37 31 ## 2 2017 NA P2532 30215 49 38 ## 3 2001 NA W6234 50851 4 9 ## 4 1992 NA P1253 25599 5 38 ## 5 2017 NA B6525 30215 49 29 ## 6 1995 NA H1532 12861 2 13 ## 7 1982 NA A2612 40469 2 10 ## 8 1997 NA S3123 58187 4 9 ## 9 2005 NA P3123 100003 2 7 ## 10 2009 NA T21 57086 1 4 ## series_years md5sum ## 1 <NA> 0aa27b40a841bb0bdcca3c6dc2331018 ## 2 <NA> 753ddb89231f521279dbe9c46b31ad6d ## 3 <NA> c5102fe6d1535e69f4b4f41d96faa04b ## 4 <NA> cce0c3c9c15eb51bf75c168d366da2b4 ## 5 <NA> 452e6e50857cf2b58977b7f60bd8c5c2 ## 6 <NA> f20dae7d49d7579a208bf66459ed2e81 ## 7 <NA> 87ceeba25ddcb9a7e10214eeea59c2b3 ## 8 <NA> 9f05278c9308a9dae58c0ad10f680807 ## 9 <NA> 72433cdb105068730863a33c902a0188 ## 10 <NA> a1f2c9ce8cf4dc994a64ba98e9b54abf ``` --- ## Pattern Matching with `LIKE` and `NOT LIKE` ```r db %>% dbGetQuery("SELECT * FROM name WHERE name NOT LIKE 'A%' LIMIT 0, 10;") ``` ``` ## id name imdb_index imdb_id gender name_pcode_cf ## 1 235 -Alverio, Esteban Rodriguez <NA> NA m A4162 ## 2 619 78 Plus <NA> NA m P42 ## 3 258 ., Gong-Myoung <NA> NA m G5252 ## 4 336 10CC <NA> NA m C2 ## 5 236 -Ek, Ivan <NA> NA m E215 ## 6 255 ., Dian Aisyah <NA> NA m D52 ## 7 438 2Mex <NA> NA m M2 ## 8 246 ., Carlos <NA> NA m C642 ## 9 409 2000, Spoiler <NA> NA m S146 ## 10 65 'El Happy' Lora, Miguel <NA> NA m E4146 ## name_pcode_nf surname_pcode md5sum ## 1 E2315 A416 f5c410bff6839b545d04c531f776e8f2 ## 2 <NA> <NA> 1740dc46b0120dc8fd73af48c658430f ## 3 <NA> <NA> 78c74b0e1bdc4ea16ac6c313691d3ea5 ## 4 <NA> <NA> fef9ccd06e5a016078ac9d9ca705507f ## 5 I152 E2 fd4f55b18ef8efc1c58da2653260899d ## 6 <NA> <NA> 7107d4dce877b981e996078631bff681 ## 7 <NA> <NA> 9908e29f69fdd0ce98c50b69215846ec ## 8 <NA> <NA> 27dd23d89c453112b8fc0ffd2bda92b6 ## 9 <NA> <NA> ade2aad58b662b680f650b0d9f28bb52 ## 10 M2414 <NA> 4d04196e6b407211a1685f2556fd8da3 ``` --- ## Pattern Matching ```r db %>% dbGetQuery("SELECT * FROM title WHERE title REGEXP 'The Hobbit' LIMIT 0, 10;") ``` ``` ## id ## 1 31053 ## 2 19392 ## 3 30949 ## 4 90652 ## 5 211957 ## 6 298046 ## 7 211956 ## 8 299412 ## 9 211955 ## 10 393183 ## title ## 1 Secrets from Guantanamo Bay/The Hobbit/Goldie ## 2 The Hobbit World Premiere: 3 News Special ## 3 Plan B/The Hobbit/Mel Brooks on Broadway ## 4 Godzilla, the Hobbit and Evil Dead: Oh My! ## 5 The Hobbit #3 ## 6 Cast of 'The Hobbit: An Unexpected Journey'/Jesse Tyler Ferguson/Carmen Electra ## 7 The Hobbit #2 ## 8 Video Blogging from Behind the Scenes of the Hobbit ## 9 The Hobbit #1 ## 10 The Hobbit Debate/Quake-proofing Christchurch ## imdb_index kind_id production_year imdb_id phonetic_code episode_of_id ## 1 <NA> 7 2005 NA S2632 30215 ## 2 <NA> 7 2012 NA H1364 19382 ## 3 <NA> 7 2006 NA P4513 30215 ## 4 <NA> 7 2012 NA G3243 90639 ## 5 <NA> 7 2014 NA H13 211546 ## 6 <NA> 7 2012 NA C2313 297960 ## 7 <NA> 7 2014 NA H13 211546 ## 8 <NA> 7 2011 NA V3142 299322 ## 9 <NA> 7 2014 NA H13 211546 ## 10 <NA> 7 2010 NA H1313 393064 ## season_nr episode_nr series_years md5sum ## 1 37 31 <NA> 0aa27b40a841bb0bdcca3c6dc2331018 ## 2 NA NA <NA> c4263d236c4087c27a36ef812b2bc074 ## 3 38 38 <NA> f5ff5818078da0a204632e8d00ee8335 ## 4 1 17 <NA> 4c08c96d557841bfb22546ee04bd0720 ## 5 7 50 <NA> 2d75f6c4b9c0229818d472789a2abcbd ## 6 6 49 <NA> 4da07219ce38fb5cca3b7200dd513992 ## 7 7 49 <NA> 997b12a3e71540e42144a64838e04912 ## 8 2 40 <NA> 8163c5df6efeead72cd117ceaa1def0a ## 9 7 48 <NA> b790775301e0905feac0301ae8ee5f45 ## 10 NA NA <NA> f0a69898f6a3f4be5719a15f96c57081 ``` * Can get much more sophisticated. + See examples [here](https://dev.mysql.com/doc/refman/5.7/en/pattern-matching.html). --- ## `ORDER BY` ```r db %>% dbGetQuery("SELECT * FROM title ORDER BY production_year DESC LIMIT 0, 10;") ``` ``` ## id title imdb_index kind_id ## 1 3204054 100 Years <NA> 1 ## 2 4448363 The Zero Century: Maetel <NA> 1 ## 3 3313453 Avatar 5 <NA> 1 ## 4 3313452 Avatar 4 <NA> 1 ## 5 3500140 Despicable Me 4 <NA> 1 ## 6 3591414 Fantastic Beasts and Where to Find Them 5 <NA> 1 ## 7 4339206 The Dark Room <NA> 1 ## 8 4504125 Under the Red Hood <NA> 1 ## 9 4540635 War in Black <NA> 1 ## 10 3465504 Dag III <NA> 1 ## production_year imdb_id phonetic_code episode_of_id season_nr episode_nr ## 1 2115 NA Y62 NA NA NA ## 2 2026 NA Z6253 NA NA NA ## 3 2025 NA A136 NA NA NA ## 4 2024 NA A136 NA NA NA ## 5 2024 NA D2121 NA NA NA ## 6 2024 NA F5323 NA NA NA ## 7 2024 NA D6265 NA NA NA ## 8 2024 NA U5363 NA NA NA ## 9 2024 NA W6514 NA NA NA ## 10 2023 NA D2 NA NA NA ## series_years md5sum ## 1 <NA> 09bf2aeba5b3ad6f20a9189b24977160 ## 2 <NA> 0a6b3bb9ea4283dd675b9e80c9d746e5 ## 3 <NA> 6f162e78084f4eea4b41a60798a3b059 ## 4 <NA> 78f8c4c0db18d9ecc7bbe21ff280a7c5 ## 5 <NA> 05d26ac8773334b1e94d210b5bcb7202 ## 6 <NA> 4148d5a83722967b5bae325b7020f66b ## 7 <NA> e5ecb2acc42c86b9792bc4bc25187861 ## 8 <NA> 138f6db743aab2ad3d7e5c4835777bee ## 9 <NA> 8a9c42dde74fd4cbbeecc87f0eba9b1a ## 10 <NA> 20eaf52dc7e30522cc4b269edc90d6a4 ``` --- ## New Example * Want to see how to use the aggregate functions but the `imdb` database doesn't have a lot of quantitative data. ```r library(tidyverse) library(RMySQL) db <- dbConnect(MySQL(), host = "scidb.smith.edu", user = "waiuser", password = "smith_waiDB", dbname = "wai") ``` --- ## New Example * A database of wideband acoustic immitance variables from humans with normal hearing. + [More info here](http://www.science.smith.edu/wai-database/home/download-data/) ```r dbGetQuery(db, "SHOW TABLES;") ``` ``` ## Tables_in_wai ## 1 Codebook ## 2 Measurements ## 3 Measurements_pre2020 ## 4 PI_Info ## 5 PI_Info_OLD ## 6 Subjects ## 7 Subjects_pre2020 ``` ```r dbGetQuery(db, "EXPLAIN PI_Info;") ``` ``` ## Field Type Null Key Default Extra ## 1 Identifier varchar(50) NO PRI <NA> ## 2 Year int(11) NO <NA> ## 3 Authors text NO <NA> ## 4 AuthorsShortList text NO <NA> ## 5 Title text NO <NA> ## 6 Journal text NO <NA> ## 7 URL text NO <NA> ## 8 Abstract text NO <NA> ## 9 DataSubmitterName text NO <NA> ## 10 DataSubmitterEmail text NO <NA> ## 11 DateSubmitted text NO <NA> ## 12 PI_Notes text NO <NA> ``` --- ## New Example ```r dbGetQuery(db, "SELECT * FROM Measurements LIMIT 10;") ``` ``` ## Identifier SubjectNumber Session Ear Instrument Age AgeCategory EarStatus ## 1 Abur_2014 1 1 Left HearID 20 Adult Normal ## 2 Abur_2014 1 1 Left HearID 20 Adult Normal ## 3 Abur_2014 1 1 Left HearID 20 Adult Normal ## 4 Abur_2014 1 1 Left HearID 20 Adult Normal ## 5 Abur_2014 1 1 Left HearID 20 Adult Normal ## 6 Abur_2014 1 1 Left HearID 20 Adult Normal ## 7 Abur_2014 1 1 Left HearID 20 Adult Normal ## 8 Abur_2014 1 1 Left HearID 20 Adult Normal ## 9 Abur_2014 1 1 Left HearID 20 Adult Normal ## 10 Abur_2014 1 1 Left HearID 20 Adult Normal ## TPP AreaCanal PressureCanal SweepDirection Frequency Absorbance Zmag ## 1 -5 0.000044 0 Ambient 211 0.033 113780000 ## 2 -5 0.000044 0 Ambient 234 0.032 103585000 ## 3 -5 0.000044 0 Ambient 258 0.041 92951700 ## 4 -5 0.000044 0 Ambient 281 0.044 86058000 ## 5 -5 0.000044 0 Ambient 305 0.049 79492800 ## 6 -5 0.000044 0 Ambient 328 0.053 73326200 ## 7 -5 0.000044 0 Ambient 352 0.058 68793600 ## 8 -5 0.000044 0 Ambient 375 0.064 64088600 ## 9 -5 0.000044 0 Ambient 398 0.069 60200600 ## 10 -5 0.000044 0 Ambient 422 0.083 56990900 ## Zang ## 1 -0.23 ## 2 -0.24 ## 3 -0.23 ## 4 -0.23 ## 5 -0.23 ## 6 -0.23 ## 7 -0.23 ## 8 -0.23 ## 9 -0.23 ## 10 -0.23 ``` --- ## Aggregate Functions * We have already seen `COUNT` * What does this code seem to do? ```r db %>% dbGetQuery("SELECT Identifier, COUNT(*) AS n FROM Measurements GROUP BY Identifier ORDER BY n DESC;") ``` ``` ## Identifier n ## 1 Lewis_2018 245820 ## 2 Lewis_2015 114716 ## 3 Shahnaz_2006 58776 ## 4 Hunter_2016 44726 ## 5 Groon_2015 35469 ## 6 Feeney_2017 22134 ## 7 Abur_2014 21328 ## 8 Voss_2016 19344 ## 9 Voss_2010 14880 ## 10 Rosowski_2012 14384 ## 11 Werner_2010 7860 ## 12 Merchant_2010 7688 ## 13 Liu_2008 5520 ## 14 Voss_1994 5120 ## 15 Keefe_2017 3696 ## 16 Shaver_2013 2880 ## 17 Sun_2016 2604 ``` --- ## Aggregate Functions * **What is the `dplyr`/R version?** ```r db %>% dbGetQuery("SELECT AVG(Frequency) AS avgFreq, MAX(Frequency) AS maxFreq, SUM(Absorbance) AS totalAbsorbance FROM Measurements;") ``` ``` ## avgFreq maxFreq totalAbsorbance ## 1 8285 24000 251699 ``` --- ## Aggregate Functions ```r db %>% dbGetQuery("SELECT AVG(Frequency) FROM Measurements WHERE Ear = 'Left';") ``` ``` ## AVG(Frequency) ## 1 5175 ``` --- ## Aggregate Functions * **What is the `dplyr`/R version?** ```r db %>% dbGetQuery("SELECT AVG(Frequency), Ear FROM Measurements GROUP BY Ear;") ``` ``` ## AVG(Frequency) Ear ## 1 5175 Left ## 2 9493 Right ## 3 5051 Unknown ``` --- ## Aggregate Functions * But the aggregate functions are limited... ```r db %>% dbGetQuery("SELECT (Frequency - AVG(Frequency))/STD(Frequency) FROM Measurements;") ``` ``` ## (Frequency - AVG(Frequency))/STD(Frequency) ## 1 -1.2 ``` --- ## One More SQL Function for Today * Need to fix this error ```r db %>% dbGetQuery("SELECT AVG(Frequency) AS avgfreq FROM Measurements WHERE avgfreq > 6000 GROUP BY IDENTIFIER;") ``` ``` ## Error in .local(conn, statement, ...): could not run statement: Unknown column 'avgfreq' in 'where clause' ``` --- ## One More SQL Function for Today: `HAVING` ```r db %>% dbGetQuery("SELECT AVG(Frequency) AS avgfreq FROM Measurements GROUP BY IDENTIFIER HAVING avgfreq > 6000") ``` ``` ## avgfreq ## 1 12000 ## 2 12000 ## 3 10020 ``` --- ## Query Order in SQL: Order matters! * `SELECT` * `FROM` * `JOIN` * `WHERE` * `GROUP BY` * `HAVING` * `ORDER BY` * `LIMIT` --- ## Next Class * More comparing `dplyr` and `SQL` * Joins!