<!-- background-color: #006DAE --> <!-- class: middle center hide-slide-number --> <div class="shade_black" style="width:60%;right:0;bottom:0;padding:10px;border: dashed 4px white;margin: auto;"> <i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See <a href=/>here for PDF <i class="fas fa-file-pdf"></i></a>. </div> <br> .white[Press the **right arrow** to progress to the next slide!] --- background-image: url(images/bg1.jpg) background-size: cover class: hide-slide-number split-70 title-slide count: false .column.shade_black[.content[ <br> # .monash-blue.outline-text[ETC5510: Introduction to Data Analysis] <h2 class="monash-blue2 outline-text" style="font-size: 30pt!important;">Week 3, part A</h2> <br> <h2 style="font-weight:900!important;">Data Visualisation</h2> .bottom_abs.width100[ Lecturer: *Nicholas Tierney & Stuart Lee* Department of Econometrics and Business Statistics
<i class="fas fa-envelope faa-float animated "></i>
ETC5510.Clayton-x@monash.edu 23rd March 2020 <br> ] ]] <div class="column transition monash-m-new delay-1s" style="clip-path:url(#swipe__clip-path);"> <div class="background-image" style="background-image:url('images/large.png');background-position: center;background-size:cover;margin-left:3px;"> <svg class="clip-svg absolute"> <defs> <clipPath id="swipe__clip-path" clipPathUnits="objectBoundingBox"> <polygon points="0.5745 0, 0.5 0.33, 0.42 0, 0 0, 0 1, 0.27 1, 0.27 0.59, 0.37 1, 0.634 1, 0.736 0.59, 0.736 1, 1 1, 1 0, 0.5745 0" /> </clipPath> </defs> </svg> </div> </div> --- # Understanding learning - Growth and fixed mindsets - Reframe success + failure as opportunities for growth - Growing area of research by [Carol Dweck of Stanford](https://www.youtube.com/watch?v=hiiEeMN7vbQ) --- # Reframing .pull-left[ # From > "I'll never understand" > "I just don't get programming" > "I'm not a maths person" ] -- .pull-right[ # To > "I understand more than I did yesterday" > "I can learn how to program" > "Compared to this last week, I've learnt quite a bit!" ] --- class: transition # Overview for today - Going from tidy data to a data plot, using a grammar - Mapping of variables from the data to graphical elements - Using different geoms --- # Example: Tuberculosis data .left-code[ The case notifications table From [WHO](http://www.who.int/tb/country/data/download/en/). Data is tidied here, with only counts for Australia. ] .right-plot[ ```r tb_au ## # A tibble: 192 x 6 ## country iso3 year count gender age ## <chr> <chr> <dbl> <dbl> <chr> <chr> ## 1 Australia AUS 1997 8 m 15-24 ## 2 Australia AUS 1998 11 m 15-24 ## 3 Australia AUS 1999 13 m 15-24 ## 4 Australia AUS 2000 16 m 15-24 ## 5 Australia AUS 2001 23 m 15-24 ## 6 Australia AUS 2002 15 m 15-24 ## 7 Australia AUS 2003 14 m 15-24 ## 8 Australia AUS 2004 18 m 15-24 ## 9 Australia AUS 2005 32 m 15-24 ## 10 Australia AUS 2006 33 m 15-24 ## # … with 182 more rows ``` ] --- # The "100% charts" ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_bar(stat = "identity", position = "fill") + facet_grid(~ age) + scale_fill_brewer(palette="Dark2") ``` <img src="lecture_3a_files/figure-html/show-100-pct-1.png" width="100%" style="display: block; margin: auto;" /> ??? 100% charts, is what excel names these beasts. What do we learn? --- class: transition # Let's unpack a bit. --- # Data Visualisation .blockquote[ "The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey ] --- # Data Visualisation - The creation and study of the visual representation of data. - Many tools for visualizing data (R is one of them) - Many approaches/systems within R for making data visualizations (**ggplot2** is one of them, and that's what we're going to use). --- # ggplot2 `\(\in\)` tidyverse .left-code[ <img src="images/ggplot2-part-of-tidyverse.png" width="80%" style="display: block; margin: auto;" /> ] .right-plot[ - **ggplot2** is tidyverse's data visualization package - The `gg` in "ggplot2" stands for Grammar of Graphics - It is inspired by the book **Grammar of Graphics** by Leland Wilkinson <sup>†</sup> - A grammar of graphics is a tool that enables us to concisely describe the components of a graphic - (Source: [BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html)) ] --- background-image: url(images/grammar-of-graphics.png) background-size: contain background-position: 50% 50% class: center, bottom, white [From BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html) --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au) ``` ] .right-plot[ <img src="lecture_3a_files/figure-html/first-gg-1-out-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, * aes(x = year, * y = count)) ``` ] .right-plot[ <img src="lecture_3a_files/figure-html/first-gg-2-out-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, aes(x = year, y = count)) + * geom_point() ``` ] .right-plot[ <img src="lecture_3a_files/figure-html/first-gg-3-out-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Our first ggplot! (what's the data again?) <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> iso3 </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> count </th> <th style="text-align:left;"> gender </th> <th style="text-align:left;"> age </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15-24 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 1998 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15-24 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15-24 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15-24 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2001 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15-24 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15-24 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2003 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15-24 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2004 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15-24 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2005 </td> <td style="text-align:right;"> 32 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15-24 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2006 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15-24 </td> </tr> </tbody> </table> --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, aes(x = year, y = count)) + * geom_col() ``` ] .right-plot[ <img src="lecture_3a_files/figure-html/first-gg-4-out-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, aes(x = year, y = count, * fill = gender)) + geom_col() ``` ] .right-plot[ <img src="lecture_3a_files/figure-html/first-gg-5-out-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, aes(x = year, y = count, fill = gender)) + * geom_col(position = "fill") ``` ] .right-plot[ <img src="lecture_3a_files/figure-html/first-gg-6-out-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_col(position = "fill") + * scale_fill_brewer( * palette = "Dark2" * ) ``` ] .right-plot[ <img src="lecture_3a_files/figure-html/first-gg-7-out-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_col(position = "fill") + scale_fill_brewer( palette = "Dark2" ) + * facet_wrap(~ age) ``` ] .right-plot[ <img src="lecture_3a_files/figure-html/first-gg-8-out-1.png" width="100%" style="display: block; margin: auto;" /> ] ??? - First argument provided is the name of the data, `tb_au` - Variable mapping: year is mapped to x, count is mapped to y, gender is mapped to colour, and age is used to subset the data and make separate plots - The column geom is used, `geom_col` - We are mostly interested in proportions between gender, over years, separately by age. The `position = "fill"` option in `geom_bar` sets the heights of the bars to be all at 100%. It ignores counts, and emphasizes the proportion of males and females. --- # The "100% charts" ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_bar(stat = "identity", position = "fill") + facet_grid(~ age) + scale_fill_brewer(palette="Dark2") ``` <img src="lecture_3a_files/figure-html/gg-show-100pct-1.png" width="576" style="display: block; margin: auto;" /> -- What do we learn ??? 100% charts, is what excel names these beasts. What do we learn? --- class: transition left # What do we learn? - Focus is on **proportion** in each category. - Across (almost) all ages, and years, the proportion of males having TB is higher than females - These proportions tend to be higher in the older age groups, for all years. --- # Code structure of ggplot - `ggplot()` is the main function - Plots are constructed in layers - Structure of code for plots can often be summarised as ```r ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) + geom_xxx() + other options ``` --- # How to use ggplot - To use ggplot2 functions, first load tidyverse ```r library(tidyverse) ``` - For help with the ggplot2, see [ggplot2.tidyverse.org](http://ggplot2.tidyverse.org/) --- class: transition # Let's look at some more options to emphasise different features --- .left-code[ ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_col(position = "fill") + scale_fill_brewer( palette = "Dark2" ) + * facet_wrap(~ age) ``` ] .right-plot[ <img src="lecture_3a_files/figure-html/first-gg-9-out-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Emphasizing different features with ggplot2 ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_col(position = "fill") + scale_fill_brewer( palette = "Dark2") + * facet_grid(~ age) ``` <img src="lecture_3a_files/figure-html/first-gg-10-1.png" width="100%" style="display: block; margin: auto;" /> --- # Emphasise ... ? ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + * geom_col() + scale_fill_brewer( palette = "Dark2") + facet_grid(~ age) ``` <img src="lecture_3a_files/figure-html/first-gg-11-1.png" width="100%" style="display: block; margin: auto;" /> --- # What do we learn? - `, position = "fill"` was removed - Focus is on **counts** in each category. - Different across ages, and years, counts tend to be lower in middle age (45-64) - 1999 saw a bit of an outbreak, in most age groups, with numbers doubling or tripling other years. - Incidence has been increasing among younger age groups in recent years. --- # Emphasise ... ? ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + * geom_col(position = "dodge") + scale_fill_brewer(palette = "Dark2") + facet_grid(~ age) ``` <img src="lecture_3a_files/figure-html/gg-side-by-side-1.png" width="100%" style="display: block; margin: auto;" /> --- # What do we learn? - `, position="dodge"` is used in `geom_col` - Focus is on **counts by gender**, predominantly male incidence. - Incidence among males relative to females is from middle age on. - There is similar incidence between males and females in younger age groups. --- # Separate bar charts ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_col() + scale_fill_brewer(palette = "Dark2") + * facet_grid(gender ~ age) ``` <img src="lecture_3a_files/figure-html/gg-separate-1.png" width="576" style="display: block; margin: auto;" /> --- class: transition # What do we learn? - `facet_grid(gender ~ age) +` faceted by gender as well as age - note `facet_grid` vs `facet_wrap` - Easier to focus separately on males and females. - 1999 outbreak mostly affected males. - Growing incidence in the 25-34 age group is still affecting females but seems to be have stablised for males. --- # ~~Pie charts?~~ Rose Charts ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_col() + scale_fill_brewer(palette="Dark2") + facet_grid(gender ~ age) + * coord_polar() + theme(axis.text = element_blank()) ``` <img src="lecture_3a_files/figure-html/gg-rose-1.png" width="576" style="display: block; margin: auto;" /> --- # What do we learn? - Bar charts in polar coordinates produce rose charts. - `coord_polar() +` plot is made in polar coordinates, rather than the default Cartesian coordinates - Emphasizes the middle years as low incidence. --- # Rainbow charts? ```r ggplot(tb_au, aes(x = 1, y = count, * fill = factor(year))) + geom_col(position = "fill") + facet_grid(gender ~ age) ``` <img src="lecture_3a_files/figure-html/gg-rainbow-1.png" width="60%" style="display: block; margin: auto;" /> --- # What do we see in the code?? - A single stacked bar, in each facet. - Year is mapped to colour. - Notice how the mappings are different. A single number is mapped to x, that makes a single stacked bar chart. - year is now mapped to colour (that's what gives us the rainbow charts!) --- class: transition # What do we learn? - Pretty chart but not easy to interpret. --- # (Actual) Pie charts ```r ggplot(tb_au, aes(x = 1, y = count, fill = factor(year))) + geom_col(position = "fill") + * facet_grid(gender ~ age) + * coord_polar(theta = "y") + theme(axis.text = element_blank()) ``` <img src="lecture_3a_files/figure-html/gg-pie2-1.png" width="576" style="display: block; margin: auto;" /> --- # What is different in the code? - `coord_polar(theta="y")` is using the y variable to do the angles for the polar coordinates to give a pie chart. --- class: transition # What do we learn? - Pretty chart but not easy to interpret, or make comparisons across age groups. --- # Why? [The various looks of David Bowie](https://www.wired.com/wp-content/uploads/2016/01/DB-Transformation-Colour.gif) .left-code[ <img src="https://www.wired.com/wp-content/uploads/2016/01/DB-Transformation-Colour.gif" style="width:50%" /> ] - Using named plots, eg pie chart, bar chart, scatterplot, is like seeing animals in the zoo. - The grammar of graphics allows you to define the mapping between variables in the data, with elements of the plot. - It allows us to see and understand how plots are similar or different. - And you can see how variations in the definition create variations in the plot. --- class: transition # Your Turn: - Do the lab exercises - Take the lab quiz - Use the rest of the lab time to coordinate with your group on the first assignment. --- # References - [Chapter 3 of R for Data Science](https://r4ds.had.co.nz/data-visualisation.html) - [Data made available from WHO](https://www.who.int/tb/country/data/download/en/) - [Garret Aden Buie's gentle introduction to ggplot2](https://pkg.garrickadenbuie.com/gentle-ggplot2/#1) - [Mine Çetinkaya-Rundel's introduction to ggplot using star wars.](https://github.com/rstudio-education/datascience-box/tree/master/slides/u1_d02-data-and-viz)