ETC5510: Introduction to Data Analysis

<div class="shade_black"  style="width:60%;right:0;bottom:0;padding:10px;border: dashed 4px white;margin: auto;">
<i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See <a href=/>here for PDF <i class="fas fa-file-pdf"></i></a>.
</div>

<br>

.white[Press the **right arrow** to progress to the next slide!]

---

background-image: url(images/bg1.jpg)
background-size: cover
class: hide-slide-number split-70 title-slide
count: false

.column.shade_black[.content[

<br>

# .monash-blue.outline-text[ETC5510: Introduction to Data Analysis]

<br>

<h2 style="font-weight:900!important;">Web scraping</h2>

.bottom_abs.width100[

Lecturer: *Nicholas Tierney & Stuart Lee*

Department of Econometrics and Business Statistics

<span><i class="fas  fa-envelope faa-float animated "></i></span>  ETC5510.Clayton-x@monash.edu

April 2020

<br>
]

]]

---

background-image: url(https://www.kdnuggets.com/images/cartoon-turkey-data-science.jpg)
background-size: contain
background-position: 50% 50%

---
# Overview

- Different file formats
    - Audio / binary
- Web data
    - ethics of web scraping
    - how to get data off the web
    - JSON

---
class: transition
# Recap on some tricky topics

- assignment ("gets" - `<-`)
- pipes (from the textbook)

---
# The pipe operator: `%>%`

- Code to tell a story about a little bunny foo foo (borrowed from https://r4ds.had.co.nz/pipes.html):
- Using functions for each verb: `hop()`, `scoop()`, `bop()`.

> Little bunny Foo Foo
Went hopping through the forest
Scooping up the field mice
And bopping them on the head

---
# Approach: Intermediate steps

```r
foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)
```

- Main downside: forces you to name each intermediate element. 
- Sometimes these steps form natural names. If this is the case - go ahead.
- **But many times there are not natural names**
- Adding number suffixes to make the names unique leads to problems.

---
# Approach: Intermediate steps

```r
foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)
```
--

- Code is cluttered with unimportant names
- Suffix has to be carefully incremented on each line.
- I've done this! 
- 99% of the time I miss a number somewhere, and there goes my evening ... debugging my code.

---
# Another Approach: Overwrite the original

```r
foo_foo <- hop(foo_foo, through = forest)
foo_foo <- scoop(foo_foo, up = field_mice)
foo_foo <- bop(foo_foo, on = head)
```

- Overwrite originals instead of creating intermediate objects 
- Less typing (and less thinking). Less likely to make mistakes?
- **Painful debugging**: need to re-run the code from the top.
- Repitition of object - (`foo_foo` written 6 times!) Obscures what changes.

---
# (Yet) Another approach: function composition

```r
bop(
  scoop(
    hop(foo_foo, through = forest),
    up = field_mice
  ), 
  on = head
)
```

- You need to read inside-out, and right-to-left.
- Arguments are spread far apart
- Harder to read

---
# Pipe `%>%` can help!

.pull-left[
`f(x)`

`g(f(x))`

`h(g(f(x)))`
]

.pull-right[
`x %>% f()`

`x %>% f() %>% g()`

`x %>% f() %>% g() %>% h()`
]

---
# Solution: Use the pipe - `%>%`

```r
foo_foo %>%
  hop(through = forest) %>%
  scoop(up = field_mice) %>%
  bop(on = head)
```

- focusses on verbs, not nouns. 
- Can be read as a series of function compositions like actions.

> Foo Foo hops, then scoops, then bops.

- read more at: https://r4ds.had.co.nz/pipes.html

---
class: transition

# Assignment `<-`

"gets"

---
# Assignment

We can perform calculations in R:

```r
1 + 1
read_csv("data.csv")
```

---
# Assignment

But what if we want to use that information later?

```r
1 + 1
read_csv("data.csv")
```

---
# Assignment

We can assign these things to an object using `<-`

This reads as "gets".

```r
x <- 1 + 1
my_data <- read_csv("data.csv")
```

- x 'gets' 1+1
- my_data 'gets' the output of read_csv...
---
# Assignment

Then we can use those things in other calculations

```r
x <- 1 + 1
my_data <- read_csv("data.csv")

x * x

my_data %>% 
  select(age, height, weight) %>% 
  mutate(bmi = weight / height^2)
```

---
class: transition
# Take 3 minutes to think about these two concepts

- What are pipes `%>%`
- What is assignment? `<-`

---
class: transition
# The many shapes and sizes of data

---
# Data as an audio file

```r
library(tuneR)
wv_data <- readWave("data/data3.wav") %>% 
    extractWave(from = 25000, to = 75000)
df_wav_data <- tibble(
  t = seq_len(length(wv_obj)),
  left = wv_obj@left,
  right = wv_obj@right,
  word = "data"
)
```

```
## Rows: 100,002
## Columns: 4
## $ t     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ left  <int> 28, 27, 26, 24, 22, 15, 15, 12, 15, 18, 20, 27, 20, 18, 18, 12,…
## $ right <int> 29, 28, 24, 27, 18, 19, 13, 13, 16, 16, 21, 26, 18, 22, 13, 17,…
## $ word  <chr> "data", "data", "data", "data", "data", "data", "data", "data",…
```

---
# Plotting audio data?

---
# Compare left and right channels

???

Oh, same sound is on both channels! A tad drab.

---
# Compute statistics

```
## # A tibble: 200,004 x 4
##        t word  channel value
##    <int> <chr> <chr>   <int>
##  1     1 data  left       28
##  2     1 data  right      29
##  3     2 data  left       27
##  4     2 data  right      28
##  5     3 data  left       26
##  6     3 data  right      24
##  7     4 data  left       24
##  8     4 data  right      27
##  9     5 data  left       22
## 10     5 data  right      18
## # … with 199,994 more rows
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> word </th>
   <th style="text-align:right;"> m </th>
   <th style="text-align:right;"> s </th>
   <th style="text-align:right;"> mx </th>
   <th style="text-align:right;"> mn </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> data </td>
   <td style="text-align:right;"> 0.004 </td>
   <td style="text-align:right;"> 1602.577 </td>
   <td style="text-align:right;"> 8393 </td>
   <td style="text-align:right;"> -15386 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> statistics </td>
   <td style="text-align:right;"> 0.009 </td>
   <td style="text-align:right;"> 1506.626 </td>
   <td style="text-align:right;"> 6601 </td>
   <td style="text-align:right;"> -11026 </td>
  </tr>
</tbody>
</table>

---
# Di's music

```
## # A tibble: 62 x 8
##    X1             artist type       lvar  lave  lmax lfener lfreq
##    <chr>          <chr>  <chr>     <dbl> <dbl> <dbl>  <dbl> <dbl>
##  1 Dancing Queen  Abba   Rock  17600756. -90.0 29921   106.  59.6
##  2 Knowing Me     Abba   Rock   9543021. -75.8 27626   103.  58.5
##  3 Take a Chance  Abba   Rock   9049482. -98.1 26372   102. 125. 
##  4 Mamma Mia      Abba   Rock   7557437. -90.5 28898   102.  48.8
##  5 Lay All You    Abba   Rock   6282286. -89.0 27940   100.  74.0
##  6 Super Trouper  Abba   Rock   4665867. -69.0 25531   100.  81.4
##  7 I Have A Dream Abba   Rock   3369670. -71.7 14699   105. 305. 
##  8 The Winner     Abba   Rock   1135862  -67.8  8928   104. 278. 
##  9 Money          Abba   Rock   6146943. -76.3 22962   102. 165. 
## 10 SOS            Abba   Rock   3482882. -74.1 15517   104. 147. 
## # … with 52 more rows
```

---
# Plot Di's music

---
# Plot Di's Music

Abba is just different from everyone else!

---
# Question time:

-   "How does `data` appear different than `statistics` in the time series?"
-   "What format is the data in an audio file?"
-   "How is Abba different from the other music clips?",

---
# Why look at audio data?

- Data comes in many shapes and sizes
- Audio data can be transformed ("rectangled") into a data.frame
- Try on your own music with the [spotifyr](https://github.com/charlie86/spotifyr) package!

---
# Scraping the web: what? why?

- Increasing amount of data is available on the web.
- These data are provided in an unstructured format: you can always copy&paste, but it's 
time-consuming and prone to errors.
- Web scraping is the process of extracting this information automatically and transform it into 
a structured dataset.

---
# Scraping the web: what? why?

1. Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
2. Web APIs (application programming interface): website offers a set of structured http  requests that return JSON or XML files.
- Why R? It includes all tools necessary to do web scraping, familiarity, direct analysis of data... But python, perl, java are also efficient tools.

---
class: transition
# Web Scraping with `rvest` and `polite`

---
# Hypertext Markup Language

Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy).