ETC5510: Introduction to Data Analysis

class: center, middle, inverse, title-slide

# ETC5510: Introduction to Data Analysis
## Week of Tidy Data + Style
### Stuart Lee & Nicholas Tierney
### 11th Mar 2020

---

# How to learn

I want to some time to discuss ideas on learning, and how it ties into the course.

---
background-image: url(images/how-to-learn-img-page-1.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-2.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-3.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-4.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-5.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-6.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-7.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-8.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-9.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-10.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-11.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-12.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
class: transition
# (demo)

---
class: refresher
# recap

.pull-left[
- R + Rstudio
- Functions are  ___
- columns in data frames are accessed with ___ ?
- packages are installed with ___ ?
- packages are loaded with ___ ?
]

.pull-right[
- Why do we care about Reproducibility?
- Output + input of rmarkdown
- I have an assignment group
- I have made contact with my assignment group
]

---
# Style guide

> "Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." -- Hadley Wickham

- Style guide for this course is based on the Tidyverse style guide: http://style.tidyverse.org/
- There's more to it than what we'll cover today, we'll mention more as we introduce more functionality, and do a recap later in the semester

---
# File names and code chunk labels

- Do not use spaces in file names, use `-` or `_` to separate words
- Use all lowercase letters

```r
# Good
ucb-admit.csv

# Bad
UCB Admit.csv
```

---
# Object names

- Use `_` to separate words in object names
- Use informative but short object names
- Do not reuse object names within an analysis

```r
# Good
acs_employed

# Bad
acs.employed
acs2
acs_subset
acs_subsetted_for_males
```

---
# Spacing

- Put a space before and after all infix operators (=, +, -, <-, etc.), and when naming arguments in function calls. 
- Always put a space after a comma, and never before (just like in regular English).

```r
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)

# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
```

---
# ggplot

- Always end a line with `+`
- Always indent the next line

```r
# Good
ggplot(diamonds, mapping = aes(x = price)) +
  geom_histogram()

# Bad
ggplot(diamonds,mapping=aes(x=price))+geom_histogram()
```

---
# Long lines

- Limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font.
- Take advantage of RStudio editor's auto formatting for indentation at line breaks.

---
# Assignment

- Use `<-` not `=`

```r
# Good
x <- 2

# Bad
x = 2
```

---
# Quotes

Use `"`, not `'`, for quoting text. The only exception is when the text already contains double quotes and no single quotes.

```r
ggplot(diamonds, mapping = aes(x = price)) +
  geom_histogram() +
  # Good
  labs(title = "`Shine bright like a diamond`",
  # Good
       x = "Diamond prices",
  # Bad
       y = 'Frequency')
```

---
background-image: url(images/allison-horst-dplyr-wrangling.png)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

.black.large[
Source: Artwork by @allison_horst
]

---
# Overview

.pull-left[
- `filter()`
- `select()`
- `mutate()`
- `arrange()`

]

.pull-right[
- `group_by()`
- `summarise()`
- `count()`
]

---
background-image: url(images/allison-horst-tidyverse-celestial.png)
background-size: contain
background-position: 50% 50%
class: center, bottom, bg-black

.left.white.large[
Artwork by @allison_horst
]

---
class: transition
# R Packages

```r
avail_pkg <- available.packages(contriburl = contrib.url("https://cran.rstudio.com"))
dim(avail_pkg)
## [1] 15367    17
```

As of 2020-03-18 there are 15367 R packages available

---

# Name clashes

```r
library(tidyverse)
## ── Attaching packages ─────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0.9000     ✓ purrr   0.3.3     
## ✓ tibble  2.1.3          ✓ dplyr   0.8.5     
## ✓ tidyr   1.0.2          ✓ stringr 1.4.0     
## ✓ readr   1.3.1          ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()
```

---
# Many R packages

- A blessing & a curse! 
- So many packages available, it can make it hard to choose!
- Many of the packages are designed to solve a specific problem
- The tidyverse is designed to work with many other packages following a consistent philosophy
- What this means is that you shouldn't notice it!

???

Extra reading:

We have been loading the `tidyverse` package. Its actually a suite of packages, and you can learn more about the individual packages at https://www.tidyverse.org. You could load each individually.

Because so many people contribute packages to R, it is a blessing and a curse.

???

The best techniques are available, but there can be conflicts between function names. When you load tidyverse it prints a great summary of conflicts that it knows about, between its functions and others.

For example, there is a `filter` function in the `stats` package that comes with the R distribution. This can cause confusion when you want to use the filter function in `dplyr` (part of tidyverse). To be sure the function you use is the one you want to use, you can prefix it with the package name, `dplyr::filter()`.
---
class: transition

# Let's talk about data

---
background-image: url(images/french_fries.png)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

???

This was an actual experiment in Food Sciences at Iowa State University. The goal was to find out if some cheaper oil options could be used to make hot chips: that people would not be able to distinguish the difference between chips fried in the new oils relative to those fried in the current market leader.

Twelve tasters were recruited to sample two chips from each batch, over a period of ten weeks. The same oil was kept for a period of 10 weeks! May be a bit gross by the end!

This data set was brought to R by Hadley Wickham, and was one of the problems that inspired the thinking about tidy data, and the evolution of the `tidyverse` tools.

---

# Example: french fries

- Experiment in Food Sciences at Iowa State University. 
- Aim: find if cheaper oil could be used to make hot chips
- Question: Can people distinguish between chips fried in the new oils relative to those current market leader oil.
- 12 tasters recruited 
- Each sampled two chips from each batch
- Over a period of ten weeks.

Same oil kept for a period of 10 weeks! May be a bit gross!

---
# Example: french-fries - pivoting into long form

```r
french_fries <- read_csv("data/french_fries.csv")
french_fries
```

```
## # A tibble: 6 x 9
##    time treatment subject   rep potato buttery grassy rancid painty
##   <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
## 1     1         1       3     1    2.9     0      0      0      5.5
## 2     1         1       3     2   14       0      0      1.1    0  
## 3     1         1      10     1   11       6.4    0      0      0  
## 4     1         1      10     2    9.9     5.9    2.9    2.2    0  
## 5     1         1      15     1    1.2     0.1    0      1.1    5.1
## 6     1         1      15     2    8.8     3      3.6    1.5    2.3
```

This data set was brought to R by Hadley Wickham, and was one of the problems that inspired the thinking about tidy data and the plyr tools.

---
# Example: french-fries - pivoting into long form

```r
fries_long <- french_fries %>% 
  pivot_longer(cols = potato:painty,
               names_to = "type", 
               values_to = "rating") %>%
  mutate(type = as.factor(type))
fries_long
## # A tibble: 3,480 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <fct>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 3,470 more rows
```

---
class: transition
# `filter()`

choose observations from your data

---
# `filter()`: example

```r
fries_long %>%
  filter(subject == 10)
## # A tibble: 300 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <fct>    <dbl>
##  1     1         1      10     1 potato    11  
##  2     1         1      10     1 buttery    6.4
##  3     1         1      10     1 grassy     0  
##  4     1         1      10     1 rancid     0  
##  5     1         1      10     1 painty     0  
##  6     1         1      10     2 potato     9.9
##  7     1         1      10     2 buttery    5.9
##  8     1         1      10     2 grassy     2.9
##  9     1         1      10     2 rancid     2.2
## 10     1         1      10     2 painty     0  
## # … with 290 more rows
```

---
# `filter()`: details

Filtering requires comparison to find the subset of observations of interest.  What do you think the following mean?

- `subject != 10` 
- `x > 10` 
- `x >= 10` 
- `class %in% c("A", "B")` 
- `!is.na(y)`

---
# `filter()`: details

`subject != 10`

Find rows corresponding to all subjects except subject 10

`x > 10`

find all rows where variable `x` has values bigger than 10

`x >= 10`

finds all rows variable `x` is greater than or equal to 10.

`class %in% c("A", "B")`

finds all rows where variable `class` is either A or B

`!is.na(y)`

finds all rows that *DO NOT* have a missing value for variable `y`

---
# Your turn: open french-fries.Rmd

Filter the french fries data to have:

- only week 1
- oil type 1 (oil type is called treatment)
- oil types 1 and 3 but not 2
- weeks 1-4 only

---
# French Fries Filter: only week 1

```r
fries_long %>% filter(time == 1)
## # A tibble: 360 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <fct>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 350 more rows
```

---
# French Fries Filter: oil type 1

```r
fries_long %>% filter(treatment == 1)
## # A tibble: 1,160 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <fct>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 1,150 more rows
```

---
# French Fries Filter: oil types 1 and 3 but not 2

```r
fries_long %>% filter(treatment != 2)
## # A tibble: 2,320 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <fct>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 2,310 more rows
```

---
# French Fries Filter: weeks 1-4 only

```r
fries_long %>% filter(time %in% c("1", "2", "3", "4"))
## # A tibble: 1,440 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <fct>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 1,430 more rows
```

---
class: transition

# about  `%in%`

[demo]

---
# `select()`

- Chooses which variables to keep in the data set. 
- Useful when there are many variables but you only need some of them for an analysis.

---
# `select()`: a comma separated list of variables, by name.

```r
french_fries %>% 
  select(time, 
         treatment, 
         subject)
## # A tibble: 696 x 3
##     time treatment subject
##    <dbl>     <dbl>   <dbl>
##  1     1         1       3
##  2     1         1       3
##  3     1         1      10
##  4     1         1      10
##  5     1         1      15
##  6     1         1      15
##  7     1         1      16
##  8     1         1      16
##  9     1         1      19
## 10     1         1      19
## # … with 686 more rows
```

---
# `select()`: **drop** selected variables by prefixing with `-`

```r
french_fries %>% 
  select(-time, 
         -treatment, 
         -subject)
## # A tibble: 696 x 6
##      rep potato buttery grassy rancid painty
##    <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1     1    2.9     0      0      0      5.5
##  2     2   14       0      0      1.1    0  
##  3     1   11       6.4    0      0      0  
##  4     2    9.9     5.9    2.9    2.2    0  
##  5     1    1.2     0.1    0      1.1    5.1
##  6     2    8.8     3      3.6    1.5    2.3
##  7     1    9       2.6    0.4    0.1    0.2
##  8     2    8.2     4.4    0.3    1.4    4  
##  9     1    7       3.2    0      4.9    3.2
## 10     2   13       0      3.1    4.3   10.3
## # … with 686 more rows
```

---
# `select()`

.left-code[
Inside `select()` you can use text-matching of the names like `starts_with()`, `ends_with()`, `contains()`, `matches()`, or `everything()`
]

.right-plot[

```r
french_fries %>% 
  select(contains("e"))
## # A tibble: 696 x 5
##     time treatment subject   rep buttery
##    <dbl>     <dbl>   <dbl> <dbl>   <dbl>
##  1     1         1       3     1     0  
##  2     1         1       3     2     0  
##  3     1         1      10     1     6.4
##  4     1         1      10     2     5.9
##  5     1         1      15     1     0.1
##  6     1         1      15     2     3  
##  7     1         1      16     1     2.6
##  8     1         1      16     2     4.4
##  9     1         1      19     1     3.2
## 10     1         1      19     2     0  
## # … with 686 more rows
```
]

---
# `select()`: Using it

.left-code[
You can use the colon, `:`, to choose variables in order of the columns
]

.right-plot[

```r
french_fries %>% 
  select(time:subject)
## # A tibble: 696 x 3
##     time treatment subject
##    <dbl>     <dbl>   <dbl>
##  1     1         1       3
##  2     1         1       3
##  3     1         1      10
##  4     1         1      10
##  5     1         1      15
##  6     1         1      15
##  7     1         1      16
##  8     1         1      16
##  9     1         1      19
## 10     1         1      19
## # … with 686 more rows
```
]

---
class: transition
# Your turn: back to the french fries data

- `select()` time, treatment and rep
- `select()` subject through to rating
- drop subject

---
background-image: url(images/allison-horst-dplyr-mutate.png)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

.purple.large.right[
Artwork by @allison_horst
]

---
# `mutate()`: create a new variable; keep existing ones

```r
french_fries 
## # A tibble: 696 x 9
##     time treatment subject   rep potato buttery grassy rancid painty
##    <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1     1         1       3     1    2.9     0      0      0      5.5
##  2     1         1       3     2   14       0      0      1.1    0  
##  3     1         1      10     1   11       6.4    0      0      0  
##  4     1         1      10     2    9.9     5.9    2.9    2.2    0  
##  5     1         1      15     1    1.2     0.1    0      1.1    5.1
##  6     1         1      15     2    8.8     3      3.6    1.5    2.3
##  7     1         1      16     1    9       2.6    0.4    0.1    0.2
##  8     1         1      16     2    8.2     4.4    0.3    1.4    4  
##  9     1         1      19     1    7       3.2    0      4.9    3.2
## 10     1         1      19     2   13       0      3.1    4.3   10.3
## # … with 686 more rows
```

---
# `mutate()`: create a new variable; keep existing ones

```r
french_fries %>% 
* mutate(rainty = rancid + painty)
## # A tibble: 696 x 10
##     time treatment subject   rep potato buttery grassy rancid painty rainty
##    <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1     1         1       3     1    2.9     0      0      0      5.5   5.5 
##  2     1         1       3     2   14       0      0      1.1    0     1.1 
##  3     1         1      10     1   11       6.4    0      0      0     0   
##  4     1         1      10     2    9.9     5.9    2.9    2.2    0     2.2 
##  5     1         1      15     1    1.2     0.1    0      1.1    5.1   6.20
##  6     1         1      15     2    8.8     3      3.6    1.5    2.3   3.8 
##  7     1         1      16     1    9       2.6    0.4    0.1    0.2   0.3 
##  8     1         1      16     2    8.2     4.4    0.3    1.4    4     5.4 
##  9     1         1      19     1    7       3.2    0      4.9    3.2   8.1 
## 10     1         1      19     2   13       0      3.1    4.3   10.3  14.6 
## # … with 686 more rows
```

---
class: transition

# Your turn: french fries

Compute a new variable called `lrating` by taking a log of the rating

---
# `summarise()`: boil data down to one row observation

```r
fries_long
```

```
## # A tibble: 6 x 6
##    time treatment subject   rep type    rating
##   <dbl>     <dbl>   <dbl> <dbl> <fct>    <dbl>
## 1     1         1       3     1 potato     2.9
## 2     1         1       3     1 buttery    0  
## 3     1         1       3     1 grassy     0  
## 4     1         1       3     1 rancid     0  
## 5     1         1       3     1 painty     5.5
## 6     1         1       3     2 potato    14
```

---
# `summarise()`: boil data down to one row observation

```r
fries_long %>% 
  summarise(rating = mean(rating, na.rm = TRUE))
## # A tibble: 1 x 1
##   rating
##    <dbl>
## 1   3.16
```

---
class: transition
# What if we want a summary for each `type`?

use `group_by()`

---
# Using `summarise()` + `group_by()`

Produce summaries for every group:

```r
fries_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE))
## # A tibble: 5 x 2
##   type    rating
##   <fct>    <dbl>
## 1 buttery  1.82 
## 2 grassy   0.664
## 3 painty   2.52 
## 4 potato   6.95 
## 5 rancid   3.85
```

---
class: transition
# Your turn: Back to french-fries.Rmd

- Compute the average rating by subject
- Compute the average rancid rating per week

---
# french fries answers

```r
fries_long %>% 
  group_by(subject) %>%
  summarise(rating = mean(rating, na.rm=TRUE))
## # A tibble: 12 x 2
##    subject rating
##      <dbl>  <dbl>
##  1       3   2.46
##  2      10   4.24
##  3      15   2.16
##  4      16   3.00
##  5      19   4.54
##  6      31   4.00
##  7      51   4.39
##  8      52   2.72
##  9      63   3.48
## 10      78   1.94
## 11      79   1.94
## 12      86   2.94
```

---
# french fries answers

```r
fries_long %>% 
  filter(type == "rancid") %>%
  group_by(time) %>%
  summarise(rating = mean(rating, na.rm=TRUE))
## # A tibble: 10 x 2
##     time rating
##    <dbl>  <dbl>
##  1     1   2.36
##  2     2   2.85
##  3     3   3.72
##  4     4   3.60
##  5     5   3.53
##  6     6   4.08
##  7     7   3.89
##  8     8   4.27
##  9     9   4.67
## 10    10   6.07
```

---
# `arrange()`: orders data by a given variable.

Useful for display of results (but there are other uses!)

```r
fries_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) 
## # A tibble: 5 x 2
##   type    rating
##   <fct>    <dbl>
## 1 buttery  1.82 
## 2 grassy   0.664
## 3 painty   2.52 
## 4 potato   6.95 
## 5 rancid   3.85
```

---
# `arrange()`

```r
fries_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) %>%
  arrange(rating)
## # A tibble: 5 x 2
##   type    rating
##   <fct>    <dbl>
## 1 grassy   0.664
## 2 buttery  1.82 
## 3 painty   2.52 
## 4 rancid   3.85 
## 5 potato   6.95
```

---
class: transition
# Your turn: french-fries.Rmd - arrange

- Arrange the average rating by type in decreasing order
- Arrange the average subject rating in order lowest to highest.

---
# `arrange()` answers

```r
fries_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) %>%
  arrange(desc(rating))
## # A tibble: 5 x 2
##   type    rating
##   <fct>    <dbl>
## 1 potato   6.95 
## 2 rancid   3.85 
## 3 painty   2.52 
## 4 buttery  1.82 
## 5 grassy   0.664
```

---
# `arrange()` answers

```r
fries_long %>% 
  group_by(subject) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) %>%
  arrange(rating)
## # A tibble: 12 x 2
##    subject rating
##      <dbl>  <dbl>
##  1      78   1.94
##  2      79   1.94
##  3      15   2.16
##  4       3   2.46
##  5      52   2.72
##  6      86   2.94
##  7      16   3.00
##  8      63   3.48
##  9      31   4.00
## 10      10   4.24
## 11      51   4.39
## 12      19   4.54
```

---
# `count()` the number of things in a given column

```r
fries_long %>% 
  count(type, sort = TRUE)
## # A tibble: 5 x 2
##   type        n
##   <fct>   <int>
## 1 buttery   696
## 2 grassy    696
## 3 painty    696
## 4 potato    696
## 5 rancid    696
```

---
class: transition left
# Your turn: `count()`

- count the number of subjects
- count the number of types

---
class: transition
# French Fries: Putting it together to problem solve

---
# French Fries: Are ratings similar?

.pull-left[

```r
fries_long %>% 
  group_by(type) %>%
  summarise(
    m = mean(rating, 
             na.rm = TRUE), 
    sd = sd(rating, 
            na.rm = TRUE)) %>%
  arrange(-m)
## # A tibble: 5 x 3
##   type        m    sd
##   <fct>   <dbl> <dbl>
## 1 potato  6.95   3.58
## 2 rancid  3.85   3.78
## 3 painty  2.52   3.39
## 4 buttery 1.82   2.41
## 5 grassy  0.664  1.32
```
]

.pull-right[

The scales of the ratings are quite different. Mostly the chips are rated highly on potato'y, but low on grassy.

]

---
# French Fries: Are ratings similar?

```r
ggplot(fries_long,
       aes(x = type, 
           y = rating)) +
  geom_boxplot()
```

---
# French Fries: Are reps like each other?

```r
fries_spread <- fries_long %>% 
  pivot_wider(names_from = rep, 
              values_from = rating)
  
fries_spread
## # A tibble: 1,740 x 6
##     time treatment subject type      `1`   `2`
##    <dbl>     <dbl>   <dbl> <fct>   <dbl> <dbl>
##  1     1         1       3 potato    2.9  14  
##  2     1         1       3 buttery   0     0  
##  3     1         1       3 grassy    0     0  
##  4     1         1       3 rancid    0     1.1
##  5     1         1       3 painty    5.5   0  
##  6     1         1      10 potato   11     9.9
##  7     1         1      10 buttery   6.4   5.9
##  8     1         1      10 grassy    0     2.9
##  9     1         1      10 rancid    0     2.2
## 10     1         1      10 painty    0     0  
## # … with 1,730 more rows
```

---
# French Fries: Are reps like each other?

```r
summarise(fries_spread,
          r = cor(`1`, `2`, use = "complete.obs"))
## # A tibble: 1 x 1
##       r
##   <dbl>
## 1 0.668
```

---
# French Fries:

```r
  ggplot(fries_spread,
         aes(x = `1`, 
             y = `2`)) + 
  geom_point() + 
  labs(title = "Data is poor quality: the replicates do not look like each other!")
```

---
# French Fries:

---
# French Fries: Replicates by rating type

```r
fries_spread %>%
  group_by(type) %>%
  summarise(r = cor(x = `1`, 
                    y = `2`, 
                    use = "complete.obs"))
## # A tibble: 5 x 2
##   type        r
##   <fct>   <dbl>
## 1 buttery 0.650
## 2 grassy  0.239
## 3 painty  0.479
## 4 potato  0.616
## 5 rancid  0.391
```

---

# French Fries: Replicates by rating type

```r
ggplot(fries_spread, aes(x=`1`, y=`2`)) + 
  geom_point() + facet_wrap(~type, ncol = 5)
```

Potato'y and buttery have better replication than the other scales, but there is still a lot of variation from rep 1 to 2.

---
<iframe width="1040" height="650" src="https://www.youtube.com/embed/i4RGqzaNEtg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

---
# Lab exercise: Exploring data PISA data

Open `pisa.Rmd` by downloading today's exercises.

---
class: transition
# Lab Quiz

Time to take the lab quiz.

???

# `select()` example

```r
tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>%
  select(country, year, starts_with("new_sp_")) 
tb %>% top_n(20)
## # A tibble: 22 x 22
##    country  year new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524 new_sp_m2534
##    <chr>   <dbl>      <dbl>       <dbl>       <dbl>        <dbl>        <dbl>
##  1 Argent…  2008         11          58          69          633          611
##  2 Argent…  2009          8          36          44          546          483
##  3 Argent…  2011         50          93         143          664          657
##  4 Argent…  2012          8          51          59          533          484
##  5 Brazil   2010        130         168         298         4405         6381
##  6 Brazil   2012        112         165         277         5027         6811
##  7 Centra…  2010         23          55          78          379          633
##  8 Centra…  2011         14          56          70          362          576
##  9 Guinea…  2012          1           6           7          145          262
## 10 Italy    2005          7           1           8           93          191
## # … with 12 more rows, and 15 more variables: new_sp_m3544 <dbl>,
## #   new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>,
## #   new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## #   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
## #   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sp_fu <dbl>
```

---

Source: A drawing made by Alison Horst [@allison_horst](https://twitter.com/allison_horst?lang=en)

# Learning is where you:

1. Receive information accurately
2. Remember the information (long term memory)
3. In such a way that you can reapply the information when appropriate

# Your Turn:

Go to the data source at this link: [bit.ly/dmac-noaa-data](https://bit.ly/dmac-noaa-data) 
- "Which is the best description of the temperature units?"
- "What is the best description of the precipitation units"
- "What does -9999 mean?"

???

- "Which is the best description of the temperature units?"

- degrees farehnheit F
- degrees Kelvin K
- "degrees C x10"

"What is the best description of the precipitation units"

- "mm x10"
- inches

"What does -9999 mean?"

- it was really cold
- the keyboard got stuck
- "the value was missing"