<!-- background-color: #006DAE --> <!-- class: middle center hide-slide-number --> <div class="shade_black" style="width:60%;right:0;bottom:0;padding:10px;border: dashed 4px white;margin: auto;"> <i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See <a href=/>here for PDF <i class="fas fa-file-pdf"></i></a>. </div> <br> .white[Press the **right arrow** to progress to the next slide!] --- background-image: url(images/bg1.jpg) background-size: cover class: hide-slide-number split-70 title-slide count: false .column.shade_black[.content[ <br> # .monash-blue.outline-text[ETC5510: Introduction to Data Analysis] <h2 class="monash-blue2 outline-text" style="font-size: 30pt!important;">Week 5, part B</h2> <br> <h2 style="font-weight:900!important;">Web scraping</h2> .bottom_abs.width100[ Lecturer: *Nicholas Tierney & Stuart Lee* Department of Econometrics and Business Statistics
<i class="fas fa-envelope faa-float animated "></i>
ETC5510.Clayton-x@monash.edu April 2020 <br> ] ]] <div class="column transition monash-m-new delay-1s" style="clip-path:url(#swipe__clip-path);"> <div class="background-image" style="background-image:url('images/large.png');background-position: center;background-size:cover;margin-left:3px;"> <svg class="clip-svg absolute"> <defs> <clipPath id="swipe__clip-path" clipPathUnits="objectBoundingBox"> <polygon points="0.5745 0, 0.5 0.33, 0.42 0, 0 0, 0 1, 0.27 1, 0.27 0.59, 0.37 1, 0.634 1, 0.736 0.59, 0.736 1, 1 1, 1 0, 0.5745 0" /> </clipPath> </defs> </svg> </div> </div> --- background-image: url(https://www.kdnuggets.com/images/cartoon-turkey-data-science.jpg) background-size: contain background-position: 50% 50% --- # Overview - Different file formats - Audio / binary - Web data - ethics of web scraping - how to get data off the web - JSON --- class: transition # Recap on some tricky topics - assignment ("gets" - `<-`) - pipes (from the textbook) --- # The pipe operator: `%>%` - Code to tell a story about a little bunny foo foo (borrowed from https://r4ds.had.co.nz/pipes.html): - Using functions for each verb: `hop()`, `scoop()`, `bop()`. > Little bunny Foo Foo Went hopping through the forest Scooping up the field mice And bopping them on the head --- # Approach: Intermediate steps ```r foo_foo_1 <- hop(foo_foo, through = forest) foo_foo_2 <- scoop(foo_foo_1, up = field_mice) foo_foo_3 <- bop(foo_foo_2, on = head) ``` -- - Main downside: forces you to name each intermediate element. - Sometimes these steps form natural names. If this is the case - go ahead. - **But many times there are not natural names** - Adding number suffixes to make the names unique leads to problems. --- # Approach: Intermediate steps ```r foo_foo_1 <- hop(foo_foo, through = forest) foo_foo_2 <- scoop(foo_foo_1, up = field_mice) foo_foo_3 <- bop(foo_foo_2, on = head) ``` -- - Code is cluttered with unimportant names - Suffix has to be carefully incremented on each line. - I've done this! - 99% of the time I miss a number somewhere, and there goes my evening ... debugging my code. --- # Another Approach: Overwrite the original ```r foo_foo <- hop(foo_foo, through = forest) foo_foo <- scoop(foo_foo, up = field_mice) foo_foo <- bop(foo_foo, on = head) ``` -- - Overwrite originals instead of creating intermediate objects - Less typing (and less thinking). Less likely to make mistakes? - **Painful debugging**: need to re-run the code from the top. - Repitition of object - (`foo_foo` written 6 times!) Obscures what changes. --- # (Yet) Another approach: function composition ```r bop( scoop( hop(foo_foo, through = forest), up = field_mice ), on = head ) ``` -- - You need to read inside-out, and right-to-left. - Arguments are spread far apart - Harder to read --- # Pipe `%>%` can help! .pull-left[ `f(x)` `g(f(x))` `h(g(f(x)))` ] -- .pull-right[ `x %>% f()` `x %>% f() %>% g()` `x %>% f() %>% g() %>% h()` ] --- # Solution: Use the pipe - `%>%` ```r foo_foo %>% hop(through = forest) %>% scoop(up = field_mice) %>% bop(on = head) ``` - focusses on verbs, not nouns. - Can be read as a series of function compositions like actions. > Foo Foo hops, then scoops, then bops. - read more at: https://r4ds.had.co.nz/pipes.html --- class: transition # Assignment `<-` "gets" --- # Assignment We can perform calculations in R: ```r 1 + 1 read_csv("data.csv") ``` --- # Assignment But what if we want to use that information later? ```r 1 + 1 read_csv("data.csv") ``` --- # Assignment We can assign these things to an object using `<-` This reads as "gets". ```r x <- 1 + 1 my_data <- read_csv("data.csv") ``` -- - x 'gets' 1+1 - my_data 'gets' the output of read_csv... --- # Assignment Then we can use those things in other calculations ```r x <- 1 + 1 my_data <- read_csv("data.csv") x * x my_data %>% select(age, height, weight) %>% mutate(bmi = weight / height^2) ``` --- class: transition # Take 3 minutes to think about these two concepts - What are pipes `%>%` - What is assignment? `<-` --- class: transition # The many shapes and sizes of data --- # Data as an audio file ```r library(tuneR) wv_data <- readWave("data/data3.wav") %>% extractWave(from = 25000, to = 75000) df_wav_data <- tibble( t = seq_len(length(wv_obj)), left = wv_obj@left, right = wv_obj@right, word = "data" ) ``` ``` ## Rows: 100,002 ## Columns: 4 ## $ t <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, … ## $ left <int> 28, 27, 26, 24, 22, 15, 15, 12, 15, 18, 20, 27, 20, 18, 18, 12,… ## $ right <int> 29, 28, 24, 27, 18, 19, 13, 13, 16, 16, 21, 26, 18, 22, 13, 17,… ## $ word <chr> "data", "data", "data", "data", "data", "data", "data", "data",… ``` --- # Plotting audio data? <img src="lecture_5b_files/figure-html/show-audio-1.png" width="80%" style="display: block; margin: auto;" /> --- # Compare left and right channels <img src="lecture_5b_files/figure-html/gg-compare-left-and-right-1.png" width="80%" style="display: block; margin: auto;" /> ??? Oh, same sound is on both channels! A tad drab. --- # Compute statistics ``` ## # A tibble: 200,004 x 4 ## t word channel value ## <int> <chr> <chr> <int> ## 1 1 data left 28 ## 2 1 data right 29 ## 3 2 data left 27 ## 4 2 data right 28 ## 5 3 data left 26 ## 6 3 data right 24 ## 7 4 data left 24 ## 8 4 data right 27 ## 9 5 data left 22 ## 10 5 data right 18 ## # … with 199,994 more rows ``` <table> <thead> <tr> <th style="text-align:left;"> word </th> <th style="text-align:right;"> m </th> <th style="text-align:right;"> s </th> <th style="text-align:right;"> mx </th> <th style="text-align:right;"> mn </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> data </td> <td style="text-align:right;"> 0.004 </td> <td style="text-align:right;"> 1602.577 </td> <td style="text-align:right;"> 8393 </td> <td style="text-align:right;"> -15386 </td> </tr> <tr> <td style="text-align:left;"> statistics </td> <td style="text-align:right;"> 0.009 </td> <td style="text-align:right;"> 1506.626 </td> <td style="text-align:right;"> 6601 </td> <td style="text-align:right;"> -11026 </td> </tr> </tbody> </table> --- # Di's music ``` ## # A tibble: 62 x 8 ## X1 artist type lvar lave lmax lfener lfreq ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Dancing Queen Abba Rock 17600756. -90.0 29921 106. 59.6 ## 2 Knowing Me Abba Rock 9543021. -75.8 27626 103. 58.5 ## 3 Take a Chance Abba Rock 9049482. -98.1 26372 102. 125. ## 4 Mamma Mia Abba Rock 7557437. -90.5 28898 102. 48.8 ## 5 Lay All You Abba Rock 6282286. -89.0 27940 100. 74.0 ## 6 Super Trouper Abba Rock 4665867. -69.0 25531 100. 81.4 ## 7 I Have A Dream Abba Rock 3369670. -71.7 14699 105. 305. ## 8 The Winner Abba Rock 1135862 -67.8 8928 104. 278. ## 9 Money Abba Rock 6146943. -76.3 22962 102. 165. ## 10 SOS Abba Rock 3482882. -74.1 15517 104. 147. ## # … with 52 more rows ``` --- # Plot Di's music <img src="lecture_5b_files/figure-html/gg-di-music-1.png" width="80%" style="display: block; margin: auto;" /> --- # Plot Di's Music <img src="lecture_5b_files/figure-html/gg-di-music-points-1.png" width="80%" style="display: block; margin: auto;" /> Abba is just different from everyone else! --- # Question time: - "How does `data` appear different than `statistics` in the time series?" - "What format is the data in an audio file?" - "How is Abba different from the other music clips?", --- # Why look at audio data? - Data comes in many shapes and sizes - Audio data can be transformed ("rectangled") into a data.frame - Try on your own music with the [spotifyr](https://github.com/charlie86/spotifyr) package! --- # Scraping the web: what? why? - Increasing amount of data is available on the web. - These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors. - Web scraping is the process of extracting this information automatically and transform it into a structured dataset. --- # Scraping the web: what? why? 1. Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy). 2. Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files. - Why R? It includes all tools necessary to do web scraping, familiarity, direct analysis of data... But python, perl, java are also efficient tools. --- class: transition # Web Scraping with `rvest` and `polite` --- # Hypertext Markup Language Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy). ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` --- # What if we want to extract parts of this text out? ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` -- - `read_html()`: read HTML in (like `read_csv` and co!) -- - `html_nodes()`: select specified nodes from the HTML document using CSS selectors. --- # Let's read it in with `read_html` ```r example <- read_html(here::here("slides/data/example.html")) example ## {html_document} ## <html> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body>\n <p align="center">Hello world!</p>\n </body> ``` -- - We have two parts - head and body - which makes sense: ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` --- # Now let's get the title ```r example %>% html_nodes("title") ## {xml_nodeset (1)} ## [1] <title>This is a title</title> ``` -- ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` --- # Now let's get the paragraph text ```r example %>% html_nodes("p") ## {xml_nodeset (1)} ## [1] <p align="center">Hello world!</p> ``` -- ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` --- # Rough summary - `read_html` - read in a html file - `html_nodes` - select the parts of the html file we want to look at - This requires knowing about the website structure - But it turns out website are much...much more complicated than out little example file --- class: transition # rvest + polite: Simplify processing and manipulating HTML data - `bow()` - check if the data can be scraped appropriately - `scrape()` - scrape website data (with nice defaults) - `html_nodes()` - select specified nodes from the HTML document using CSS selectors. - `html_table` - parse an HTML table into a data frame. - `html_text` - extract tag pairs' content. --- # SelectorGadget: css selectors - Using a tool called selector gadget to **help** identify the html elements of interest - Does this by constructing a css selector which can be used to subset the html document. --- # SelectorGadget: css selectors .small[ Selector | Example | Description ------------ |------------------| ------------------------------------------------ element | `p` | Select all <p> elements element element | `div p` | Select all <p> elements inside a <div> element element>element | `div > p` | Select all <p> elements with <div> as a parent .class | `.title` | Select all elements with class="title" \#id | `.name` | Select all elements with id="name" [attribute] | `[class]` | Select all elements with a class attribute [attribute=value] | `[class=title]` | Select all elements with class="title" ] --- # SelectorGadget - SelectorGadget: Open source tool that eases CSS selector generation and discovery - Install the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) - A box will open in the bottom right of the website. Click on a page element that you would like your selector to match (it will turn green). SelectorGadget will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector. - Now click on a highlighted element to remove it from the selector (red), or click on an unhighlighted element to add it to the selector. Through this process of selection and rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs. --- # Top 250 movies on IMDB Take a look at the source code, look for the tag `table` tag: <br> http://www.imdb.com/chart/top <img src="images/imdb_top_250.png" width="80%" style="display: block; margin: auto;" /> --- # First check to make sure you're allowed! ```r # install.packages("polite") library(polite) bow("http://www.imdb.com") ## <polite session> http://www.imdb.com ## User-agent: polite R package - https://github.com/dmi3kno/polite ## robots.txt: 26 rules are defined for 1 bots ## Crawl delay: 5 sec ## The path is scrapable for this user-agent ``` -- ```r bow("http://www.google.com") ## <polite session> http://www.google.com ## User-agent: polite R package - https://github.com/dmi3kno/polite ## robots.txt: 275 rules are defined for 4 bots ## Crawl delay: 5 sec ## The path is scrapable for this user-agent ``` --- # Join in - Open rstudio and download today's exercises ```r # install.packages("usethis") library(usethis) use_course("https://mida.numbat.space/exercises/5b/mida-exercise-5b.zip") ``` --- # Demo Let's go to http://www.imdb.com/chart/top --- # Bow and scrape ```r imdb_session <- bow("http://www.imdb.com/chart/top") imdb_session ## <polite session> http://www.imdb.com/chart/top ## User-agent: polite R package - https://github.com/dmi3kno/polite ## robots.txt: 26 rules are defined for 1 bots ## Crawl delay: 5 sec ## The path is scrapable for this user-agent imdb_data <- scrape(imdb_session) imdb_data ## {html_document} ## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ... ``` --- # Select and format pieces: titles - `html_nodes()` ```r library(rvest) imdb_data %>% html_nodes(".titleColumn a") ## {xml_nodeset (250)} ## [1] <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [2] <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [3] <a href="/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [4] <a href="/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [5] <a href="/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [6] <a href="/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [7] <a href="/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [8] <a href="/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [9] <a href="/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [10] <a href="/title/tt0120737/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [11] <a href="/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [12] <a href="/title/tt0109830/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [13] <a href="/title/tt1375666/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [14] <a href="/title/tt0080684/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [15] <a href="/title/tt0167261/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [16] <a href="/title/tt0133093/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [17] <a href="/title/tt0099685/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [18] <a href="/title/tt0073486/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [19] <a href="/title/tt0047478/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [20] <a href="/title/tt0114369/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## ... ``` --- # Select and format pieces: titles - `html_text() ` ```r imdb_data %>% html_nodes(".titleColumn a") %>% html_text() ## [1] "The Shawshank Redemption" ## [2] "The Godfather" ## [3] "The Godfather: Part II" ## [4] "The Dark Knight" ## [5] "12 Angry Men" ## [6] "Schindler's List" ## [7] "The Lord of the Rings: The Return of the King" ## [8] "Pulp Fiction" ## [9] "The Good, the Bad and the Ugly" ## [10] "The Lord of the Rings: The Fellowship of the Ring" ## [11] "Fight Club" ## [12] "Forrest Gump" ## [13] "Inception" ## [14] "Star Wars: Episode V - The Empire Strikes Back" ## [15] "The Lord of the Rings: The Two Towers" ## [16] "The Matrix" ## [17] "Goodfellas" ## [18] "One Flew Over the Cuckoo's Nest" ## [19] "Seven Samurai" ## [20] "Se7en" ## [21] "City of God" ## [22] "Life Is Beautiful" ## [23] "The Silence of the Lambs" ## [24] "It's a Wonderful Life" ## [25] "Star Wars" ## [26] "Parasite" ## [27] "Saving Private Ryan" ## [28] "Spirited Away" ## [29] "The Green Mile" ## [30] "Interstellar" ## [31] "Leon: The Professional" ## [32] "The Usual Suspects" ## [33] "Seppuku" ## [34] "The Lion King" ## [35] "American History X" ## [36] "The Pianist" ## [37] "Terminator 2: Judgment Day" ## [38] "Back to the Future" ## [39] "Modern Times" ## [40] "Psycho" ## [41] "Gladiator" ## [42] "City Lights" ## [43] "The Intouchables" ## [44] "The Departed" ## [45] "Whiplash" ## [46] "The Prestige" ## [47] "Once Upon a Time in the West" ## [48] "Grave of the Fireflies" ## [49] "Casablanca" ## [50] "Joker" ## [51] "Cinema Paradiso" ## [52] "Rear Window" ## [53] "Alien" ## [54] "Apocalypse Now" ## [55] "Memento" ## [56] "Raiders of the Lost Ark" ## [57] "The Great Dictator" ## [58] "The Lives of Others" ## [59] "Django Unchained" ## [60] "Paths of Glory" ## [61] "The Shining" ## [62] "Avengers: Infinity War" ## [63] "WALL·E" ## [64] "Sunset Boulevard" ## [65] "Spider-Man: Into the Spider-Verse" ## [66] "Princess Mononoke" ## [67] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb" ## [68] "Oldboy" ## [69] "Witness for the Prosecution" ## [70] "Avengers: Endgame" ## [71] "The Dark Knight Rises" ## [72] "Once Upon a Time in America" ## [73] "Aliens" ## [74] "Your Name." ## [75] "Coco" ## [76] "American Beauty" ## [77] "Braveheart" ## [78] "1917" ## [79] "Das Boot" ## [80] "3 Idiots" ## [81] "Tengoku to jigoku" ## [82] "Toy Story" ## [83] "Taare Zameen Par" ## [84] "Star Wars: Episode VI - Return of the Jedi" ## [85] "Amadeus" ## [86] "Reservoir Dogs" ## [87] "Inglourious Basterds" ## [88] "Good Will Hunting" ## [89] "2001: A Space Odyssey" ## [90] "Requiem for a Dream" ## [91] "Vertigo" ## [92] "M - Eine Stadt sucht einen Mörder" ## [93] "Dangal" ## [94] "Eternal Sunshine of the Spotless Mind" ## [95] "Citizen Kane" ## [96] "The Hunt" ## [97] "Capharnaüm" ## [98] "Full Metal Jacket" ## [99] "North by Northwest" ## [100] "A Clockwork Orange" ## [101] "Snatch" ## [102] "The Kid" ## [103] "Bicycle Thieves" ## [104] "Singin' in the Rain" ## [105] "Scarface" ## [106] "Taxi Driver" ## [107] "Amelie" ## [108] "Lawrence of Arabia" ## [109] "The Sting" ## [110] "Toy Story 3" ## [111] "Metropolis" ## [112] "For a Few Dollars More" ## [113] "Ikiru" ## [114] "Jodaeiye Nader az Simin" ## [115] "Double Indemnity" ## [116] "The Apartment" ## [117] "To Kill a Mockingbird" ## [118] "Incendies" ## [119] "Indiana Jones and the Last Crusade" ## [120] "Up" ## [121] "L.A. Confidential" ## [122] "Monty Python and the Holy Grail" ## [123] "Heat" ## [124] "Rashomon" ## [125] "Die Hard" ## [126] "Yojimbo" ## [127] "Batman Begins" ## [128] "Green Book" ## [129] "Downfall" ## [130] "Unforgiven" ## [131] "Idi i smotri" ## [132] "Bacheha-Ye aseman" ## [133] "Some Like It Hot" ## [134] "Howl's Moving Castle" ## [135] "Ran" ## [136] "The Great Escape" ## [137] "All About Eve" ## [138] "A Beautiful Mind" ## [139] "Casino" ## [140] "Pan's Labyrinth" ## [141] "My Neighbor Totoro" ## [142] "El secreto de sus ojos" ## [143] "Raging Bull" ## [144] "Lock, Stock and Two Smoking Barrels" ## [145] "The Wolf of Wall Street" ## [146] "The Treasure of the Sierra Madre" ## [147] "Judgment at Nuremberg" ## [148] "There Will Be Blood" ## [149] "Babam ve Oglum" ## [150] "Three Billboards Outside Ebbing, Missouri" ## [151] "The Gold Rush" ## [152] "Chinatown" ## [153] "Dial M for Murder" ## [154] "V for Vendetta" ## [155] "Det sjunde inseglet" ## [156] "Inside Out" ## [157] "No Country for Old Men" ## [158] "Warrior" ## [159] "Shutter Island" ## [160] "Trainspotting" ## [161] "The Elephant Man" ## [162] "The Sixth Sense" ## [163] "The Thing" ## [164] "Room" ## [165] "Gone with the Wind" ## [166] "Jurassic Park" ## [167] "Blade Runner" ## [168] "The Bridge on the River Kwai" ## [169] "Smultronstället" ## [170] "Finding Nemo" ## [171] "The Third Man" ## [172] "On the Waterfront" ## [173] "Stalker" ## [174] "Fargo" ## [175] "Kill Bill: Vol. 1" ## [176] "The Truman Show" ## [177] "Gran Torino" ## [178] "Tôkyô monogatari" ## [179] "The Deer Hunter" ## [180] "Memories of Murder" ## [181] "Relatos salvajes" ## [182] "Eskiya" ## [183] "Andhadhun" ## [184] "Klaus" ## [185] "The Big Lebowski" ## [186] "Mary and Max" ## [187] "In the Name of the Father" ## [188] "Gone Girl" ## [189] "Hacksaw Ridge" ## [190] "The Grand Budapest Hotel" ## [191] "Ford v Ferrari" ## [192] "Persona" ## [193] "Mr. Smith Goes to Washington" ## [194] "How to Train Your Dragon" ## [195] "Before Sunrise" ## [196] "The General" ## [197] "Catch Me If You Can" ## [198] "Sherlock Jr." ## [199] "Prisoners" ## [200] "12 Years a Slave" ## [201] "Cool Hand Luke" ## [202] "Mad Max: Fury Road" ## [203] "Network" ## [204] "Stand by Me" ## [205] "Le salaire de la peur" ## [206] "Barry Lyndon" ## [207] "Into the Wild" ## [208] "Million Dollar Baby" ## [209] "Monty Python's Life of Brian" ## [210] "Platoon" ## [211] "Hachi: A Dog's Tale" ## [212] "Ben-Hur" ## [213] "Rush" ## [214] "La passion de Jeanne d'Arc" ## [215] "Andrei Rublev" ## [216] "Logan" ## [217] "Harry Potter and the Deathly Hallows: Part 2" ## [218] "Dead Poets Society" ## [219] "Les quatre cents coups" ## [220] "Rang De Basanti" ## [221] "Hotel Rwanda" ## [222] "Amores perros" ## [223] "Kaze no tani no Naushika" ## [224] "Spotlight" ## [225] "Ah-ga-ssi" ## [226] "Rocky" ## [227] "Rebecca" ## [228] "Portrait of a Lady on Fire" ## [229] "Monsters, Inc." ## [230] "La haine" ## [231] "It Happened One Night" ## [232] "Faa yeung nin wa" ## [233] "Gangs of Wasseypur" ## [234] "Before Sunset" ## [235] "The Princess Bride" ## [236] "The Help" ## [237] "Ace in the Hole" ## [238] "Paris, Texas" ## [239] "The Invisible Guest" ## [240] "The Red Shoes" ## [241] "Drishyam" ## [242] "The Terminator" ## [243] "Lagaan: Once Upon a Time in India" ## [244] "Butch Cassidy and the Sundance Kid" ## [245] "Akira" ## [246] "Aladdin" ## [247] "PK" ## [248] "Kis Uykusu" ## [249] "Fanny och Alexander" ## [250] "Throne of Blood" ``` --- # Select and format pieces: save it ```r titles <- imdb_data %>% html_nodes(".titleColumn a") %>% html_text() ``` --- # Select and format pieces: years - nodes ```r imdb_data %>% html_nodes(".secondaryInfo") ## {xml_nodeset (250)} ## [1] <span class="secondaryInfo">(1994)</span> ## [2] <span class="secondaryInfo">(1972)</span> ## [3] <span class="secondaryInfo">(1974)</span> ## [4] <span class="secondaryInfo">(2008)</span> ## [5] <span class="secondaryInfo">(1957)</span> ## [6] <span class="secondaryInfo">(1993)</span> ## [7] <span class="secondaryInfo">(2003)</span> ## [8] <span class="secondaryInfo">(1994)</span> ## [9] <span class="secondaryInfo">(1966)</span> ## [10] <span class="secondaryInfo">(2001)</span> ## [11] <span class="secondaryInfo">(1999)</span> ## [12] <span class="secondaryInfo">(1994)</span> ## [13] <span class="secondaryInfo">(2010)</span> ## [14] <span class="secondaryInfo">(1980)</span> ## [15] <span class="secondaryInfo">(2002)</span> ## [16] <span class="secondaryInfo">(1999)</span> ## [17] <span class="secondaryInfo">(1990)</span> ## [18] <span class="secondaryInfo">(1975)</span> ## [19] <span class="secondaryInfo">(1954)</span> ## [20] <span class="secondaryInfo">(1995)</span> ## ... ``` --- # Select and format pieces: years - text ```r imdb_data %>% html_nodes(".secondaryInfo") %>% html_text() ## [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)" "(2003)" "(1994)" ## [9] "(1966)" "(2001)" "(1999)" "(1994)" "(2010)" "(1980)" "(2002)" "(1999)" ## [17] "(1990)" "(1975)" "(1954)" "(1995)" "(2002)" "(1997)" "(1991)" "(1946)" ## [25] "(1977)" "(2019)" "(1998)" "(2001)" "(1999)" "(2014)" "(1994)" "(1995)" ## [33] "(1962)" "(1994)" "(1998)" "(2002)" "(1991)" "(1985)" "(1936)" "(1960)" ## [41] "(2000)" "(1931)" "(2011)" "(2006)" "(2014)" "(2006)" "(1968)" "(1988)" ## [49] "(1942)" "(2019)" "(1988)" "(1954)" "(1979)" "(1979)" "(2000)" "(1981)" ## [57] "(1940)" "(2006)" "(2012)" "(1957)" "(1980)" "(2018)" "(2008)" "(1950)" ## [65] "(2018)" "(1997)" "(1964)" "(2003)" "(1957)" "(2019)" "(2012)" "(1984)" ## [73] "(1986)" "(2016)" "(2017)" "(1999)" "(1995)" "(2019)" "(1981)" "(2009)" ## [81] "(1963)" "(1995)" "(2007)" "(1983)" "(1984)" "(1992)" "(2009)" "(1997)" ## [89] "(1968)" "(2000)" "(1958)" "(1931)" "(2016)" "(2004)" "(1941)" "(2012)" ## [97] "(2018)" "(1987)" "(1959)" "(1971)" "(2000)" "(1921)" "(1948)" "(1952)" ## [105] "(1983)" "(1976)" "(2001)" "(1962)" "(1973)" "(2010)" "(1927)" "(1965)" ## [113] "(1952)" "(2011)" "(1944)" "(1960)" "(1962)" "(2010)" "(1989)" "(2009)" ## [121] "(1997)" "(1975)" "(1995)" "(1950)" "(1988)" "(1961)" "(2005)" "(2018)" ## [129] "(2004)" "(1992)" "(1985)" "(1997)" "(1959)" "(2004)" "(1985)" "(1963)" ## [137] "(1950)" "(2001)" "(1995)" "(2006)" "(1988)" "(2009)" "(1980)" "(1998)" ## [145] "(2013)" "(1948)" "(1961)" "(2007)" "(2005)" "(2017)" "(1925)" "(1974)" ## [153] "(1954)" "(2005)" "(1957)" "(2015)" "(2007)" "(2011)" "(2010)" "(1996)" ## [161] "(1980)" "(1999)" "(1982)" "(2015)" "(1939)" "(1993)" "(1982)" "(1957)" ## [169] "(1957)" "(2003)" "(1949)" "(1954)" "(1979)" "(1996)" "(2003)" "(1998)" ## [177] "(2008)" "(1953)" "(1978)" "(2003)" "(2014)" "(1996)" "(2018)" "(2019)" ## [185] "(1998)" "(2009)" "(1993)" "(2014)" "(2016)" "(2014)" "(2019)" "(1966)" ## [193] "(1939)" "(2010)" "(1995)" "(1926)" "(2002)" "(1924)" "(2013)" "(2013)" ## [201] "(1967)" "(2015)" "(1976)" "(1986)" "(1953)" "(1975)" "(2007)" "(2004)" ## [209] "(1979)" "(1986)" "(2009)" "(1959)" "(2013)" "(1928)" "(1966)" "(2017)" ## [217] "(2011)" "(1989)" "(1959)" "(2006)" "(2004)" "(2000)" "(1984)" "(2015)" ## [225] "(2016)" "(1976)" "(1940)" "(2019)" "(2001)" "(1995)" "(1934)" "(2000)" ## [233] "(2012)" "(2004)" "(1987)" "(2011)" "(1951)" "(1984)" "(2016)" "(1948)" ## [241] "(2015)" "(1984)" "(2001)" "(1969)" "(1988)" "(1992)" "(2014)" "(2014)" ## [249] "(1982)" "(1957)" ``` --- # Select and format pieces: years - remove-brackets ```r imdb_data %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") %>% # remove ) as.numeric() ## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 2001 1999 1994 2010 1980 2002 ## [16] 1999 1990 1975 1954 1995 2002 1997 1991 1946 1977 2019 1998 2001 1999 2014 ## [31] 1994 1995 1962 1994 1998 2002 1991 1985 1936 1960 2000 1931 2011 2006 2014 ## [46] 2006 1968 1988 1942 2019 1988 1954 1979 1979 2000 1981 1940 2006 2012 1957 ## [61] 1980 2018 2008 1950 2018 1997 1964 2003 1957 2019 2012 1984 1986 2016 2017 ## [76] 1999 1995 2019 1981 2009 1963 1995 2007 1983 1984 1992 2009 1997 1968 2000 ## [91] 1958 1931 2016 2004 1941 2012 2018 1987 1959 1971 2000 1921 1948 1952 1983 ## [106] 1976 2001 1962 1973 2010 1927 1965 1952 2011 1944 1960 1962 2010 1989 2009 ## [121] 1997 1975 1995 1950 1988 1961 2005 2018 2004 1992 1985 1997 1959 2004 1985 ## [136] 1963 1950 2001 1995 2006 1988 2009 1980 1998 2013 1948 1961 2007 2005 2017 ## [151] 1925 1974 1954 2005 1957 2015 2007 2011 2010 1996 1980 1999 1982 2015 1939 ## [166] 1993 1982 1957 1957 2003 1949 1954 1979 1996 2003 1998 2008 1953 1978 2003 ## [181] 2014 1996 2018 2019 1998 2009 1993 2014 2016 2014 2019 1966 1939 2010 1995 ## [196] 1926 2002 1924 2013 2013 1967 2015 1976 1986 1953 1975 2007 2004 1979 1986 ## [211] 2009 1959 2013 1928 1966 2017 2011 1989 1959 2006 2004 2000 1984 2015 2016 ## [226] 1976 1940 2019 2001 1995 1934 2000 2012 2004 1987 2011 1951 1984 2016 1948 ## [241] 2015 1984 2001 1969 1988 1992 2014 2014 1982 1957 ``` --- # Select and format pieces: years - `parse_number()` ```r imdb_data %>% html_nodes(".secondaryInfo") %>% html_text() %>% parse_number() ## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 2001 1999 1994 2010 1980 2002 ## [16] 1999 1990 1975 1954 1995 2002 1997 1991 1946 1977 2019 1998 2001 1999 2014 ## [31] 1994 1995 1962 1994 1998 2002 1991 1985 1936 1960 2000 1931 2011 2006 2014 ## [46] 2006 1968 1988 1942 2019 1988 1954 1979 1979 2000 1981 1940 2006 2012 1957 ## [61] 1980 2018 2008 1950 2018 1997 1964 2003 1957 2019 2012 1984 1986 2016 2017 ## [76] 1999 1995 2019 1981 2009 1963 1995 2007 1983 1984 1992 2009 1997 1968 2000 ## [91] 1958 1931 2016 2004 1941 2012 2018 1987 1959 1971 2000 1921 1948 1952 1983 ## [106] 1976 2001 1962 1973 2010 1927 1965 1952 2011 1944 1960 1962 2010 1989 2009 ## [121] 1997 1975 1995 1950 1988 1961 2005 2018 2004 1992 1985 1997 1959 2004 1985 ## [136] 1963 1950 2001 1995 2006 1988 2009 1980 1998 2013 1948 1961 2007 2005 2017 ## [151] 1925 1974 1954 2005 1957 2015 2007 2011 2010 1996 1980 1999 1982 2015 1939 ## [166] 1993 1982 1957 1957 2003 1949 1954 1979 1996 2003 1998 2008 1953 1978 2003 ## [181] 2014 1996 2018 2019 1998 2009 1993 2014 2016 2014 2019 1966 1939 2010 1995 ## [196] 1926 2002 1924 2013 2013 1967 2015 1976 1986 1953 1975 2007 2004 1979 1986 ## [211] 2009 1959 2013 1928 1966 2017 2011 1989 1959 2006 2004 2000 1984 2015 2016 ## [226] 1976 1940 2019 2001 1995 1934 2000 2012 2004 1987 2011 1951 1984 2016 1948 ## [241] 2015 1984 2001 1969 1988 1992 2014 2014 1982 1957 ``` --- # Select and format pieces: years - remove-brackets ```r years <- imdb_data %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") %>% # remove ) as.numeric() ``` --- # Select and format pieces: scores - nodes ```r imdb_data %>% html_nodes(".imdbRating strong") ## {xml_nodeset (250)} ## [1] <strong title="9.2 based on 2,223,871 user ratings">9.2</strong> ## [2] <strong title="9.1 based on 1,532,498 user ratings">9.1</strong> ## [3] <strong title="9.0 based on 1,072,747 user ratings">9.0</strong> ## [4] <strong title="9.0 based on 2,198,238 user ratings">9.0</strong> ## [5] <strong title="8.9 based on 649,952 user ratings">8.9</strong> ## [6] <strong title="8.9 based on 1,158,162 user ratings">8.9</strong> ## [7] <strong title="8.9 based on 1,575,677 user ratings">8.9</strong> ## [8] <strong title="8.8 based on 1,744,483 user ratings">8.8</strong> ## [9] <strong title="8.8 based on 657,904 user ratings">8.8</strong> ## [10] <strong title="8.8 based on 1,588,407 user ratings">8.8</strong> ## [11] <strong title="8.8 based on 1,772,401 user ratings">8.8</strong> ## [12] <strong title="8.8 based on 1,715,513 user ratings">8.8</strong> ## [13] <strong title="8.7 based on 1,950,039 user ratings">8.7</strong> ## [14] <strong title="8.7 based on 1,111,799 user ratings">8.7</strong> ## [15] <strong title="8.7 based on 1,423,109 user ratings">8.7</strong> ## [16] <strong title="8.6 based on 1,597,884 user ratings">8.6</strong> ## [17] <strong title="8.6 based on 967,841 user ratings">8.6</strong> ## [18] <strong title="8.6 based on 874,414 user ratings">8.6</strong> ## [19] <strong title="8.6 based on 300,682 user ratings">8.6</strong> ## [20] <strong title="8.6 based on 1,369,407 user ratings">8.6</strong> ## ... ``` --- # Select and format pieces: scores - text ```r imdb_data %>% html_nodes(".imdbRating strong") %>% html_text() ## [1] "9.2" "9.1" "9.0" "9.0" "8.9" "8.9" "8.9" "8.8" "8.8" "8.8" "8.8" "8.8" ## [13] "8.7" "8.7" "8.7" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" ## [25] "8.6" "8.6" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" ## [37] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.4" ## [49] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" ## [61] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.3" "8.3" "8.3" ## [73] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" ## [85] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" ## [97] "8.3" "8.3" "8.3" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [109] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [121] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [133] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [145] "8.2" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ## [157] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ## [169] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ## [181] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ## [193] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ## [205] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.0" "8.0" "8.0" ## [217] "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" ## [229] "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" ## [241] "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" ``` --- # Select and format pieces: scores - as-numeric ```r imdb_data %>% html_nodes(".imdbRating strong") %>% html_text() %>% as.numeric() ## [1] 9.2 9.1 9.0 9.0 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.8 8.7 8.7 8.7 8.6 8.6 8.6 ## [19] 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 ## [37] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 ## [55] 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.3 8.3 8.3 ## [73] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 ## [91] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [109] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [127] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [145] 8.2 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [163] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [181] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [199] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.0 8.0 8.0 ## [217] 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 ## [235] 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 ``` --- # Select and format pieces: scores - save ```r scores <- imdb_data %>% html_nodes(".imdbRating strong") %>% html_text() %>% as.numeric() ``` --- # Select and format pieces: put it all together ```r imdb_top_250 <- tibble(title = titles, year = years, score = scores) imdb_top_250 ## # A tibble: 250 x 3 ## title year score ## <chr> <dbl> <dbl> ## 1 The Shawshank Redemption 1994 9.2 ## 2 The Godfather 1972 9.1 ## 3 The Godfather: Part II 1974 9 ## 4 The Dark Knight 2008 9 ## 5 12 Angry Men 1957 8.9 ## 6 Schindler's List 1993 8.9 ## 7 The Lord of the Rings: The Return of the King 2003 8.9 ## 8 Pulp Fiction 1994 8.8 ## 9 The Good, the Bad and the Ugly 1966 8.8 ## 10 The Lord of the Rings: The Fellowship of the Ring 2001 8.8 ## # … with 240 more rows ``` --- <table> <thead> <tr> <th style="text-align:left;"> title </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> score </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> The Shawshank Redemption </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 9.2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 9.1 </td> </tr> <tr> <td style="text-align:left;"> The Godfather: Part II </td> <td style="text-align:left;"> 1974 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> The Dark Knight </td> <td style="text-align:left;"> 2008 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 12 Angry Men </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> Schindler's List </td> <td style="text-align:left;"> 1993 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td> <td style="text-align:left;"> 2003 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> Pulp Fiction </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> The Good, the Bad and the Ugly </td> <td style="text-align:left;"> 1966 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> --- # Aside: Yet another approach - pull the table with `html_table()` - requires notation we haven't used yet (e.g., what is `[[]]`) - requires substantial text cleaning - If there is time we can cover this at the end of class ```r imdb_table <- html_table(imdb_data) glimpse(imdb_table[[1]]) ## Rows: 250 ## Columns: 5 ## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ `Rank & Title` <chr> "1.\n The Shawshank Redemption\n (1994)", … ## $ `IMDb Rating` <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8.8,… ## $ `Your Rating` <chr> "12345678910\n \n \n \n … ## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ``` --- # Clean up / enhance May or may not be a lot of work depending on how messy the data are - See if you like what you got: ```r glimpse(imdb_top_250) ## Rows: 250 ## Columns: 3 ## $ title <chr> "The Shawshank Redemption", "The Godfather", "The Godfather: Pa… ## $ year <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994, 1966, 2001, 199… ## $ score <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8.8, 8.8, 8.7… ``` --- # Clean up / enhance - Add a variable for rank ```r imdb_top_250 %>% mutate( rank = 1:nrow(imdb_top_250) ) ## # A tibble: 250 x 4 ## title year score rank ## <chr> <dbl> <dbl> <int> ## 1 The Shawshank Redemption 1994 9.2 1 ## 2 The Godfather 1972 9.1 2 ## 3 The Godfather: Part II 1974 9 3 ## 4 The Dark Knight 2008 9 4 ## 5 12 Angry Men 1957 8.9 5 ## 6 Schindler's List 1993 8.9 6 ## 7 The Lord of the Rings: The Return of the King 2003 8.9 7 ## 8 Pulp Fiction 1994 8.8 8 ## 9 The Good, the Bad and the Ugly 1966 8.8 9 ## 10 The Lord of the Rings: The Fellowship of the Ring 2001 8.8 10 ## # … with 240 more rows ``` --- <table> <thead> <tr> <th style="text-align:left;"> title </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> score </th> <th style="text-align:left;"> rank </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> The Shawshank Redemption </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 9.2 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> The Godfather </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 9.1 </td> <td style="text-align:left;"> 2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather: Part II </td> <td style="text-align:left;"> 1974 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 3 </td> </tr> <tr> <td style="text-align:left;"> The Dark Knight </td> <td style="text-align:left;"> 2008 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> 12 Angry Men </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Schindler's List </td> <td style="text-align:left;"> 1993 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 6 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td> <td style="text-align:left;"> 2003 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 7 </td> </tr> <tr> <td style="text-align:left;"> Pulp Fiction </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 8 </td> </tr> <tr> <td style="text-align:left;"> The Good, the Bad and the Ugly </td> <td style="text-align:left;"> 1966 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> --- # Your Turn: Which 1995 movies made the list? -- ```r imdb_top_250 %>% filter(year == 1995) ## # A tibble: 8 x 3 ## title year score ## <chr> <dbl> <dbl> ## 1 Se7en 1995 8.6 ## 2 The Usual Suspects 1995 8.5 ## 3 Braveheart 1995 8.3 ## 4 Toy Story 1995 8.3 ## 5 Heat 1995 8.2 ## 6 Casino 1995 8.2 ## 7 Before Sunrise 1995 8.1 ## 8 La haine 1995 8 ``` --- # Your turn: Which years have the most movies on the list? -- ```r imdb_top_250 %>% group_by(year) %>% summarise(total = n()) %>% arrange(desc(total)) %>% head(5) ## # A tibble: 5 x 2 ## year total ## <dbl> <int> ## 1 1995 8 ## 2 1957 7 ## 3 2014 7 ## 4 2019 7 ## 5 2000 6 ``` --- # Your Turn: Visualize top 250 yearly mean score over time -- ```r imdb_top_250 %>% group_by(year) %>% summarise(avg_score = mean(score)) %>% ggplot(aes(y = avg_score, x = year)) + geom_point() + geom_smooth(method = "lm") + xlab("year") ``` --- <img src="lecture_5b_files/figure-html/visualise-score-year-print-1.png" width="80%" style="display: block; margin: auto;" /> --- # Other common formats: JSON - JavaScript Object Notation (JSON). - A language-independent data format, and supplants extensible markup language (XML). - Data are sometimes stored as JSON, which requires special unpacking --- # Unpacking JSON: Example JSON from [jsonlite](https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html) .pull-left[ ```r library(jsonlite) json_mario <- '[ { "Name": "Mario", "Age": 32, "Occupation": "Plumber" }, { "Name": "Peach", "Age": 21, "Occupation": "Princess" }, {}, { "Name": "Bowser", "Occupation": "Koopa" } ]' ``` ] .pull-right[ ```r mydf <- fromJSON(json_mario) mydf ## Name Age Occupation ## 1 Mario 32 Plumber ## 2 Peach 21 Princess ## 3 <NA> NA <NA> ## 4 Bowser NA Koopa ``` ] --- # Potential challenges with web scraping - Unreliable formatting at the source - Data broken into many pages - Data arriving in multiple excel file formats - ... We will come back to this when we learn about functions next week. > Compare the display of information at [gumtree melbourne](https://www.gumtree.com.au/s-monash/l3001600) to the list on the IMDB top 250 list. What challenges can you foresee in scraping a list of the available apartments? ] --- # Further exploring People write R packages to access online data! Check out: - [cricinfo by Sayani Gupta and Rob Hyndman](https://docs.ropensci.org/cricketdata/) - [rwalkr by Earo Wang](https://github.com/earowang/rwalkr) - [fitzRoy for AFL data](https://github.com/jimmyday12/fitzRoy/) - [Top 40 lists of R packages by Joe Rickert](https://rviews.rstudio.com/2019/07/24/june-2019-top-40-r-packages/) - they usually include a "data" section. --- # A note on midsemester test - Will be made available next Wednesday after class - This will be a multiple choice and short answer exam - The test will be conducted on Moodle - More details to come soon!