class: center, middle, inverse, title-slide # ETC5510: Introduction to Data Analysis ## Week of Tidy Data ### Stuart Lee and Nick Tierney ### 16th Mar 2020 --- class: transition middle # About your instructors --- # Stuart .pull-left[ * š Bachelor of Mathematical Sciences at University of Adelaide * š PhD Candidate in Statistics at Monash EBS. * Research: genomics, data visualisation, statistical computing * ā¤ļø: board games, cooking, music, reading and video games ] .pull-right[ <img src="images/stuart.jpeg" width="80%" style="display: block; margin: auto;" /> ] --- # Steph .pull-left[ * š Bachelor of Economics and Bachelor of Commerce from Monash * Studying a Masters of Statistics at QUT, based at Monash. * Loves to read š, any and all recommendations are welcome. * Has an R package called [taipan](https://github.com/srkobakian/taipan), and another called [sugarbag](https://github.com/srkobakian/sugarbag). ] .pull-right[ <img src="images/steff.jpeg" width="80%" style="display: block; margin: auto;" /> ] --- # Sherry .pull-left[ - š Bachelor of Commerce 2018 - Honours in Econometrics 2019 with Di Cook - Commenced PhD programme 2020 - Created her first ever R package, `quickdraw` - Loves puzzles games like jigsaws š§©. ] .pull-right[ <img src="images/sherry.jpeg" width="80%" style="display: block; margin: auto;" /> ] --- # Nick .pull-left[ * š Bachelor of Psychological Sciences UQ * š PhD in Statistics at QUT. * Research: missing data, data visualisation, statistical computing * R š¦: `naniar`, `visdat`, * `#rstats` š¤: Credibly Curious w Saskia Freytag * ā¤ļø outdoors, especially: š„¾, šāāļø, and š§āāļø. ] .pull-right[ <img src="images/njtierney.jpg" width="80%" style="display: block; margin: auto;" /> ] --- # Di .pull-left[ - Professor at Monash University in Melbourne Australia, doing research in statistics, data science, visualisation, and statistical computing. - Created the current version of the course - Likes to play all sorts of sports, tennis, soccer, hockey, cricket, and go boogie boarding. ] .pull-right[ <img src="images/di.png" width="80%" style="display: block; margin: auto;" /> ] --- class: transition left # Your Turn: Making the groups We are going to set up the groups for doing assignment work. 1. Find your name from the list at [this link](https://mida.numbat.space/groups.html) 2. Find the other people in the class with the same gorup name as you (feel free to wander around the class!) 3. Grab your gear and claim a table to work together at. --- class: transition left # Your Turn: Ask your team mates these questions: 1. What is one food you'd never want to taste again? 2. If you were a comic strip character, who would you be and why? LASTLY, come up with a name for your team (we have provided a suggested name, but you are free to change it!) and tell this to a tutor, along with the names of members of the team.
05
:
00
--- # Traffic Light System <img src="gifs/help-me-help-you.gif" width="90%" style="display: block; margin: auto;" /> --- # Traffic Light System .pull-left.middle[ .red[ # Red Post-it ] * I need a hand * Slow down ] -- .pull-right.middle[ .green[ # Green Post-it ] * I am up to speed * I have completed the thing ] --- class: refresher # Recap - packages are installed with ___ ? - packages are loaded with ___ ? - Why do we care about Reproducibility? - Output + input of rmarkdown - I have an assignment group - If I have an assignment group, have recorded my assignment group in the ED survey --- # Today: Outline - An aside on learning - Tidy Data - Terminology of data - Different examples of data - Steps in making data tidy - Lots of examples --- # A note on difficulty * This is not a programming course - it is a course about **data, modelling, and computing**. -- - At the moment, you might be sitting there, feeling a bit confused about where we are, what are are doing, what R is, and how it even works. - That is OK! -- - The theory of this class will only get you so far - The real learning happens from doing the data analysis - the **pressure of a deadline can also help.** --- # Tidy Data <img src="images/cleaning-data.jpg" width="533" style="display: block; margin: auto;" /> .blockquote[ You're ready to sit down with a newly-obtained dataset, excited about how it will open a world of insight and understanding, and then find you can't use it. You'll first have to spend a significant amount of time to restructure the data to even begin to produce a set of basic descriptive statistics or link it to other data you've been using. --John Spencer ([Measure Evaluation](https://www.measureevaluation.org/resources/newsroom/blogs/tidy-data-and-how-to-get-it)) ] --- # Tidy Data <img src="images/cleaning-data.jpg" width="533" style="display: block; margin: auto;" /> .blockquote[ "Tidy data" is a term meant to provide a framework for producing data that conform to standards that make data easier to use. Tidy data may still require some cleaning for analysis, but the job will be much easier. --John Spencer ([Measure Evaluation](https://www.measureevaluation.org/resources/newsroom/blogs/tidy-data-and-how-to-get-it)) ] --- # Example: US graduate programs - Data from a study on US grad programs. - Originally came in an excel file containing rankings of many different programs. - Contains information on four programs: 1. Astronomy 1. Economics 1. Entomology, and 1. Psychology --- # Example: US graduate programs ```r library(tidyverse) grad <- read_csv(here::here("slides/data/graduate-programs.csv")) grad ## # A tibble: 412 x 16 ## subject Inst AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDegā¦ ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 economā¦ ARIZā¦ 0.9 1.57 31.3 31.7 5.6 ## 2 economā¦ AUBUā¦ 0.79 0.64 77.6 44.4 3.84 ## 3 economā¦ BOSTā¦ 0.51 1.03 43.5 46.8 5 ## 4 economā¦ BOSTā¦ 0.49 2.66 36.9 34.2 5.5 ## 5 economā¦ BRANā¦ 0.3 3.03 36.8 48.7 5.29 ## 6 economā¦ BROWā¦ 0.84 2.31 27.1 54.6 6 ## 7 economā¦ CALIā¦ 0.99 2.31 56.4 83.3 4 ## 8 economā¦ CARNā¦ 0.43 1.67 35.2 45.6 5.05 ## 9 economā¦ CITYā¦ 0.35 1.06 38.1 27.9 5.2 ## 10 economā¦ CLARā¦ 0.47 0.7 24.7 37.7 5.17 ## # ā¦ with 402 more rows, and 9 more variables: PctMinorityFac <dbl>, ## # PctFemaleFac <dbl>, PctFemaleStud <dbl>, PctIntlStud <dbl>, ## # AvNumPhDs <dbl>, AvGREs <dbl>, TotFac <dbl>, PctAsstProf <dbl>, ## # NumStud <dbl> ``` --- # Example: US graduate programs Good things about the format: .pull-left[ ``` ## # A tibble: 6 x 16 ## subject Inst AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDegā¦ ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 economā¦ ARIZā¦ 0.9 1.57 31.3 31.7 5.6 ## 2 economā¦ AUBUā¦ 0.79 0.64 77.6 44.4 3.84 ## 3 economā¦ BOSTā¦ 0.51 1.03 43.5 46.8 5 ## 4 economā¦ BOSTā¦ 0.49 2.66 36.9 34.2 5.5 ## 5 economā¦ BRANā¦ 0.3 3.03 36.8 48.7 5.29 ## 6 economā¦ BROWā¦ 0.84 2.31 27.1 54.6 6 ## # ā¦ with 9 more variables: PctMinorityFac <dbl>, PctFemaleFac <dbl>, ## # PctFemaleStud <dbl>, PctIntlStud <dbl>, AvNumPhDs <dbl>, AvGREs <dbl>, ## # TotFac <dbl>, PctAsstProf <dbl>, NumStud <dbl> ``` ] .pull-right[ **Rows** contain information about the institution **Columns** contain types of information, like average number of publications, average number of citations, % completion, ] --- # Example: US graduate programs Easy to make summaries: ```r grad %>% count(subject) ## # A tibble: 4 x 2 ## subject n ## <chr> <int> ## 1 astronomy 32 ## 2 economics 117 ## 3 entomology 27 ## 4 psychology 236 ``` --- # Example: US graduate programs Easy to make summaries: ```r grad %>% filter(subject == "economics") %>% summarise( mean = mean(NumStud), s = sd(NumStud) ) ## # A tibble: 1 x 2 ## mean s ## <dbl> <dbl> ## 1 60.7 39.4 ``` --- # Example: US graduate programs .pull-left[ Easy to make a plot ```r grad %>% filter(subject == "economics") %>% ggplot(aes(x = NumStud, y = MedianTimetoDegree)) + geom_point() + theme(aspect.ratio = 1) ``` ] .pull-right[ <img src="lecture_2a_files/figure-html/gra-dplot-out-1.png" height="100%" style="display: block; margin: auto;" /> ] --- class: transition left ## Your Turn: download exercises for today's lecture! - Notice the `data/` directory with many datasets! - Open `graduate-programs.Rmd` - Answer these questions: - "What is the average number of graduate students per economics program?" - "What is the best description of the relationship between number of students and median time to degree?" - Use the traffic light system if you need a hand.
03
:
00
??? - "The average number of graduate students per economics program is:" - "about 61" (correct) - about 39 "What is the best description of the relationship between number of students and median time to degree?" - "as the number of students increases the median time to degree increases, weakly" (correct) - as the number of students increases the variability in median time to degree decreases --- class: refresher .left-code[ What could this image say about R?
03
:
00
] .right-plot[ <img src="images/tower-of-babel.jpg" width="100%" style="display: block; margin: auto;" /> ] <!-- There can be many ways to achieve the same result. I don't know everything. You might find a better solution than I have give you. Your tutors might give you different ways to do it than I told you. --> --- # Terminology of data: Variable - A quantity, quality, or property that you can measure. - For the grad programs, these would be all the column headers. ``` ## # A tibble: 412 x 16 ## subject Inst AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDegā¦ ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 economā¦ ARIZā¦ 0.9 1.57 31.3 31.7 5.6 ## 2 economā¦ AUBUā¦ 0.79 0.64 77.6 44.4 3.84 ## 3 economā¦ BOSTā¦ 0.51 1.03 43.5 46.8 5 ## 4 economā¦ BOSTā¦ 0.49 2.66 36.9 34.2 5.5 ## 5 economā¦ BRANā¦ 0.3 3.03 36.8 48.7 5.29 ## 6 economā¦ BROWā¦ 0.84 2.31 27.1 54.6 6 ## 7 economā¦ CALIā¦ 0.99 2.31 56.4 83.3 4 ## 8 economā¦ CARNā¦ 0.43 1.67 35.2 45.6 5.05 ## 9 economā¦ CITYā¦ 0.35 1.06 38.1 27.9 5.2 ## 10 economā¦ CLARā¦ 0.47 0.7 24.7 37.7 5.17 ## # ā¦ with 402 more rows, and 9 more variables: PctMinorityFac <dbl>, ## # PctFemaleFac <dbl>, PctFemaleStud <dbl>, PctIntlStud <dbl>, ## # AvNumPhDs <dbl>, AvGREs <dbl>, TotFac <dbl>, PctAsstProf <dbl>, ## # NumStud <dbl> ``` --- # Terminology of data: Observation - A set of measurements made under similar conditions - Contains several values, each associated with a different variable. - For the grad programs, this is institution, and program, uniquley define the observation. ``` ## # A tibble: 412 x 16 ## subject Inst AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDegā¦ ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 economā¦ ARIZā¦ 0.9 1.57 31.3 31.7 5.6 ## 2 economā¦ AUBUā¦ 0.79 0.64 77.6 44.4 3.84 ## 3 economā¦ BOSTā¦ 0.51 1.03 43.5 46.8 5 ## 4 economā¦ BOSTā¦ 0.49 2.66 36.9 34.2 5.5 ## 5 economā¦ BRANā¦ 0.3 3.03 36.8 48.7 5.29 ## 6 economā¦ BROWā¦ 0.84 2.31 27.1 54.6 6 ## 7 economā¦ CALIā¦ 0.99 2.31 56.4 83.3 4 ## 8 economā¦ CARNā¦ 0.43 1.67 35.2 45.6 5.05 ## 9 economā¦ CITYā¦ 0.35 1.06 38.1 27.9 5.2 ## 10 economā¦ CLARā¦ 0.47 0.7 24.7 37.7 5.17 ## # ā¦ with 402 more rows, and 9 more variables: PctMinorityFac <dbl>, ## # PctFemaleFac <dbl>, PctFemaleStud <dbl>, PctIntlStud <dbl>, ## # AvNumPhDs <dbl>, AvGREs <dbl>, TotFac <dbl>, PctAsstProf <dbl>, ## # NumStud <dbl> ``` --- # Terminology of data: Value - Is the state of a variable when you measure it. - The value of a variable typically changes from observation to observation. - For the grad programs, this is the value in each cell ``` ## # A tibble: 412 x 16 ## subject Inst AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDegā¦ ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 economā¦ ARIZā¦ 0.9 1.57 31.3 31.7 5.6 ## 2 economā¦ AUBUā¦ 0.79 0.64 77.6 44.4 3.84 ## 3 economā¦ BOSTā¦ 0.51 1.03 43.5 46.8 5 ## 4 economā¦ BOSTā¦ 0.49 2.66 36.9 34.2 5.5 ## 5 economā¦ BRANā¦ 0.3 3.03 36.8 48.7 5.29 ## 6 economā¦ BROWā¦ 0.84 2.31 27.1 54.6 6 ## 7 economā¦ CALIā¦ 0.99 2.31 56.4 83.3 4 ## 8 economā¦ CARNā¦ 0.43 1.67 35.2 45.6 5.05 ## 9 economā¦ CITYā¦ 0.35 1.06 38.1 27.9 5.2 ## 10 economā¦ CLARā¦ 0.47 0.7 24.7 37.7 5.17 ## # ā¦ with 402 more rows, and 9 more variables: PctMinorityFac <dbl>, ## # PctFemaleFac <dbl>, PctFemaleStud <dbl>, PctIntlStud <dbl>, ## # AvNumPhDs <dbl>, AvGREs <dbl>, TotFac <dbl>, PctAsstProf <dbl>, ## # NumStud <dbl> ``` --- # Tidy tabular form __Tabular data__ is a set of values, each associated with a variable and an observation. Tabular data is __tidy__ iff (if and only if): * Each variable in its own column, * Each observation in its own row, * Each value is placed in its own `cell`. --- background-image: url(https://imgs.njtierney.com/tidy-data.png) background-size: contain background-position: 50% 50% class: center, bottom, black --- # The grad program Is in **tidy** tabular form. ``` ## # A tibble: 412 x 16 ## subject Inst AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDegā¦ ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 economā¦ ARIZā¦ 0.9 1.57 31.3 31.7 5.6 ## 2 economā¦ AUBUā¦ 0.79 0.64 77.6 44.4 3.84 ## 3 economā¦ BOSTā¦ 0.51 1.03 43.5 46.8 5 ## 4 economā¦ BOSTā¦ 0.49 2.66 36.9 34.2 5.5 ## 5 economā¦ BRANā¦ 0.3 3.03 36.8 48.7 5.29 ## 6 economā¦ BROWā¦ 0.84 2.31 27.1 54.6 6 ## 7 economā¦ CALIā¦ 0.99 2.31 56.4 83.3 4 ## 8 economā¦ CARNā¦ 0.43 1.67 35.2 45.6 5.05 ## 9 economā¦ CITYā¦ 0.35 1.06 38.1 27.9 5.2 ## 10 economā¦ CLARā¦ 0.47 0.7 24.7 37.7 5.17 ## # ā¦ with 402 more rows, and 9 more variables: PctMinorityFac <dbl>, ## # PctFemaleFac <dbl>, PctFemaleStud <dbl>, PctIntlStud <dbl>, ## # AvNumPhDs <dbl>, AvGREs <dbl>, TotFac <dbl>, PctAsstProf <dbl>, ## # NumStud <dbl> ``` --- class: transition # Different examples of data For each of these data examples, **let's try together to identify the variables and the observations** - some are HARD! --- # Your Turn: Genes experiment š¤ ``` ## # A tibble: 3 x 12 ## id `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2` `WI-12.R1` `WI-12.R2` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Geneā¦ 2.18 2.20 4.20 2.63 5.06 4.54 5.53 ## 2 Geneā¦ 1.46 0.585 1.86 0.515 2.88 1.36 2.96 ## 3 Geneā¦ 2.03 0.870 3.28 0.533 4.63 2.18 5.56 ## # ā¦ with 4 more variables: `WI-12.R4` <dbl>, `WM-12.R1` <dbl>, ## # `WM-12.R2` <dbl>, `WM-12.R4` <dbl> ```
02
:
00
--- # Melbourne weather šØ ``` ## # A tibble: 1,593 x 12 ## X1 X2 X3 X4 X5 X9 X13 X17 X21 X25 X29 X33 ## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ASN00086282 1970 07 TMAX 141 124 113 123 148 149 139 153 ## 2 ASN00086282 1970 07 TMIN 80 63 36 57 69 47 84 78 ## 3 ASN00086282 1970 07 PRCP 3 30 0 0 36 3 0 0 ## 4 ASN00086282 1970 08 TMAX 145 128 150 122 109 112 116 142 ## 5 ASN00086282 1970 08 TMIN 50 61 75 67 41 51 48 -7 ## 6 ASN00086282 1970 08 PRCP 0 66 0 53 13 3 8 0 ## 7 ASN00086282 1970 09 TMAX 168 168 162 162 162 150 184 179 ## 8 ASN00086282 1970 09 TMIN 19 29 62 81 81 55 73 97 ## 9 ASN00086282 1970 09 PRCP 0 0 0 0 3 5 0 38 ## 10 ASN00086282 1970 10 TMAX 189 194 204 267 256 228 237 144 ## # ā¦ with 1,583 more rows ```
02
:
00
--- # Tuberculosis notifications data taken from [WHO](http://www.who.int/tb/country/data/download/en/) š¤§ ``` ## # A tibble: 3,202 x 22 ## country year new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524 new_sp_m2534 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanā¦ 1997 NA NA 0 10 6 ## 2 Afghanā¦ 1998 NA NA 30 129 128 ## 3 Afghanā¦ 1999 NA NA 8 55 55 ## 4 Afghanā¦ 2000 NA NA 52 228 183 ## 5 Afghanā¦ 2001 NA NA 129 379 349 ## 6 Afghanā¦ 2002 NA NA 90 476 481 ## 7 Afghanā¦ 2003 NA NA 127 511 436 ## 8 Afghanā¦ 2004 NA NA 139 537 568 ## 9 Afghanā¦ 2005 NA NA 151 606 560 ## 10 Afghanā¦ 2006 NA NA 193 837 791 ## # ā¦ with 3,192 more rows, and 15 more variables: new_sp_m3544 <dbl>, ## # new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>, ## # new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>, ## # new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, ## # new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sp_fu <dbl> ```
02
:
00
--- # French fries .pull-left[ - 10 week sensory experiment - 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?) - fried in one of 3 different oils, replicated twice. ] .pull-right[ <img src="images/french_fries.png" width="100%" style="display: block; margin: auto;" /> ] --- # French fries: Variables? Observations? ``` ## # A tibble: 696 x 9 ## time treatment subject rep potato buttery grassy rancid painty ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 1 3 1 2.9 0 0 0 5.5 ## 2 1 1 3 2 14 0 0 1.1 0 ## 3 1 1 10 1 11 6.4 0 0 0 ## 4 1 1 10 2 9.9 5.9 2.9 2.2 0 ## 5 1 1 15 1 1.2 0.1 0 1.1 5.1 ## 6 1 1 15 2 8.8 3 3.6 1.5 2.3 ## 7 1 1 16 1 9 2.6 0.4 0.1 0.2 ## 8 1 1 16 2 8.2 4.4 0.3 1.4 4 ## 9 1 1 19 1 7 3.2 0 4.9 3.2 ## 10 1 1 19 2 13 0 3.1 4.3 10.3 ## # ā¦ with 686 more rows ``` --- # Rude Recliners data - data is collated from this story: [41% Of Fliers Think You're Rude If You Recline Your Seat](http://fivethirtyeight.com/datalab/airplane-etiquette-recline-seat/) - What are the variables? ``` ## # A tibble: 3 x 6 ## V1 `V2:Always` `V2:Usually` `V2:About half tā¦ `V2:Once in a wā¦ `V2:Never` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 No, noā¦ 124 145 82 116 35 ## 2 Yes, sā¦ 9 27 35 129 81 ## 3 Yes, vā¦ 3 3 NA 11 54 ``` --- # Messy vs tidy .pull-left[ Messy data is messy in its own way. You can make unique solutions, but then another data set comes along, and you have to again make a unique solution. ] .pull-right[ Tidy data can be though of as legos. Once you have this form, you can put it together in so many different ways, to make different analyses. <img src="images/lego.png" width="100%" style="display: block; margin: auto;" /> ] --- # Data Tidying verbs - `pivot_longer`: Specify the **names_to** (identifiers) and the **values_to** (measures) to make longer form data. - `pivot_wider`: Variables split out in columns - `separate`: Split one column into many --- # one more time: `pivot_longer` ```r pivot_longer(<DATA>, <COLS>, <NAMES_TO> <VALUES_TO>) ``` - **cols** to select are those that represent values, not variables. - **names_to** is the name of the variable whose values for the column names. - **values_to** is the name of the variable whose values are spread over the cells. --- # pivot_longer: example .pull-left[ ``` ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 ``` ] .pull-right[ ```r table4a %>% pivot_longer(cols = c("1999", "2000"), names_to = "year", values_to = "cases") ## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766 ``` ] --- # Tidying genes data Tell me what to put in the following? - **cols** are the columns that represent values, not variables. - **names_to** is the name of new variable whose values for the column names. - **values_to** is the name of the new variable whose values are spread over the cells. ``` ## # A tibble: 3 x 12 ## id `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2` `WI-12.R1` `WI-12.R2` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Geneā¦ 2.18 2.20 4.20 2.63 5.06 4.54 5.53 ## 2 Geneā¦ 1.46 0.585 1.86 0.515 2.88 1.36 2.96 ## 3 Geneā¦ 2.03 0.870 3.28 0.533 4.63 2.18 5.56 ## # ā¦ with 4 more variables: `WI-12.R4` <dbl>, `WM-12.R1` <dbl>, ## # `WM-12.R2` <dbl>, `WM-12.R4` <dbl> ``` --- # Tidy genes data .pull-left[ ``` ## # A tibble: 3 x 12 ## id `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2` `WI-12.R1` `WI-12.R2` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Geneā¦ 2.18 2.20 4.20 2.63 5.06 4.54 5.53 ## 2 Geneā¦ 1.46 0.585 1.86 0.515 2.88 1.36 2.96 ## 3 Geneā¦ 2.03 0.870 3.28 0.533 4.63 2.18 5.56 ## # ā¦ with 4 more variables: `WI-12.R4` <dbl>, `WM-12.R1` <dbl>, ## # `WM-12.R2` <dbl>, `WM-12.R4` <dbl> ``` ] ```r genes_long <- genes %>% pivot_longer(cols = -id, names_to = "variable", values_to = "expr") genes_long ## # A tibble: 33 x 3 ## id variable expr ## <chr> <chr> <dbl> ## 1 Gene 1 WI-6.R1 2.18 ## 2 Gene 1 WI-6.R2 2.20 ## 3 Gene 1 WI-6.R4 4.20 ## 4 Gene 1 WM-6.R1 2.63 ## 5 Gene 1 WM-6.R2 5.06 ## 6 Gene 1 WI-12.R1 4.54 ## 7 Gene 1 WI-12.R2 5.53 ## 8 Gene 1 WI-12.R4 4.41 ## 9 Gene 1 WM-12.R1 3.85 ## 10 Gene 1 WM-12.R2 4.18 ## # ā¦ with 23 more rows ``` --- # Separate columns .pull-left[ ``` ## # A tibble: 33 x 3 ## id variable expr ## <chr> <chr> <dbl> ## 1 Gene 1 WI-6.R1 2.18 ## 2 Gene 1 WI-6.R2 2.20 ## 3 Gene 1 WI-6.R4 4.20 ## 4 Gene 1 WM-6.R1 2.63 ## 5 Gene 1 WM-6.R2 5.06 ## 6 Gene 1 WI-12.R1 4.54 ## 7 Gene 1 WI-12.R2 5.53 ## 8 Gene 1 WI-12.R4 4.41 ## 9 Gene 1 WM-12.R1 3.85 ## 10 Gene 1 WM-12.R2 4.18 ## # ā¦ with 23 more rows ``` ] .pull-right[ ```r genes_long %>% separate(col = variable, into = c("trt", "leftover"), "-") ## # A tibble: 33 x 4 ## id trt leftover expr ## <chr> <chr> <chr> <dbl> ## 1 Gene 1 WI 6.R1 2.18 ## 2 Gene 1 WI 6.R2 2.20 ## 3 Gene 1 WI 6.R4 4.20 ## 4 Gene 1 WM 6.R1 2.63 ## 5 Gene 1 WM 6.R2 5.06 ## 6 Gene 1 WI 12.R1 4.54 ## 7 Gene 1 WI 12.R2 5.53 ## 8 Gene 1 WI 12.R4 4.41 ## 9 Gene 1 WM 12.R1 3.85 ## 10 Gene 1 WM 12.R2 4.18 ## # ā¦ with 23 more rows ``` ] --- # Separate columns ```r genes_long_tidy <- genes_long %>% separate(variable, c("trt", "leftover"), "-") %>% separate(leftover, c("time", "rep"), "\\.") genes_long_tidy ## # A tibble: 33 x 5 ## id trt time rep expr ## <chr> <chr> <chr> <chr> <dbl> ## 1 Gene 1 WI 6 R1 2.18 ## 2 Gene 1 WI 6 R2 2.20 ## 3 Gene 1 WI 6 R4 4.20 ## 4 Gene 1 WM 6 R1 2.63 ## 5 Gene 1 WM 6 R2 5.06 ## 6 Gene 1 WI 12 R1 4.54 ## 7 Gene 1 WI 12 R2 5.53 ## 8 Gene 1 WI 12 R4 4.41 ## 9 Gene 1 WM 12 R1 3.85 ## 10 Gene 1 WM 12 R2 4.18 ## # ā¦ with 23 more rows ``` --- class: transition # Now let's use `pivot_wider` to examine different aspects --- # Examine treatments against each other .pull-left[ ```r genes_long_tidy %>% pivot_wider(id_cols = c(id, rep, time), names_from = trt, values_from = expr) %>% ggplot(aes(x=WI, y=WM, colour=id)) + geom_point() ``` ] .pull-right[ <img src="lecture_2a_files/figure-html/plot-genes-out-1.png" height="100%" style="display: block; margin: auto;" /> Generally, some negative association within each gene, WM is low if WI is high. ] --- # Examine replicates against each other .pull-left[ ```r genes_long_tidy %>% pivot_wider(id_cols = c(id, trt, time), names_from = rep, values_from = expr) %>% ggplot(aes(x=R1, y=R4, colour=id)) + geom_point() + coord_equal() ``` ] .pull-right[ <img src="lecture_2a_files/figure-html/shoe-replicates-out-1.png" height="100%" style="display: block; margin: auto;" /> Roughly, replicate 4 is like replicate 1, eg if one is low, the other is low. That's a good thing, that the replicates are fairly similar. ] --- ## Your turn: Demonstrate with koala bilby data (live code) Here is a little data to practice `pivot_longer`, `pivot_wider` and `separate` on. - Read over `koala-bilby.Rmd` - pivot_longer the data into long form, naming the two new variables, `label` and `count` - Separate the labels into two new variables, `animal`, `state` - pivot_wider the long form data into wide form, where the columns are the states. - pivot_wider the long form data into wide form, where the columns are the animals. --- # Exercise 1: Rude Recliners - Open `rude-recliners.Rmd` - This contains data from the article [41% Of Fliers Think You're Rude If You Recline Your Seat](http://fivethirtyeight.com/datalab/airplane-etiquette-recline-seat/). - V1 is the response to question: "Is it rude to recline your seat on a plane?" - V2 is the response to question: "Do you ever recline your seat when you fly?". ``` ## # A tibble: 3 x 6 ## V1 `V2:Always` `V2:Usually` `V2:About half tā¦ `V2:Once in a wā¦ `V2:Never` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 No, noā¦ 124 145 82 116 35 ## 2 Yes, sā¦ 9 27 35 129 81 ## 3 Yes, vā¦ 3 3 NA 11 54 ``` --- # Exercise 1: Rude Recliners (15 minutes) Answer the following questions in the rmarkdown document. - A) What are the variables and observations in this data? - 1B) Put the data in tidy long form (using the names `V2` as the key variable, and `count` as the value). - 1C) Use the `rename` function to make the variable names a little shorter. --- class: transition left # Exercise 1: Answers --- class: transition left # Your Turn: Turn to the people next to you and ask 2 questions: - Are you more of a dog or a cat person? - What languages do you know how to speak?
03
:
00
--- # Exercise 2: Tuberculosis Incidence data (15 minutes) Open: `tb-incidence.Rmd` Tidy the TB incidence data, using the Rmd to prompt questions. --- # Exercise 3: Currency rates (15 minutes) - open `currency-rates.Rmd` - read in `rates.csv` - Answer the following questions: 1. What are the variables and observations? 2. pivot_longer the five currencies, AUD, GBP, JPY, CNY, CAD, make it into tidy long form. 3. Make line plots of the currencies, describe the similarities and differences between the currencies. --- # Exercise 4: Australian Airport Passengers (optional!) - Open `oz-airport.Rmd` - Contains data from the web site [Department of Infrastructure, Regional Development and Cities](https://bitre.gov.au/publications/ongoing/airport_traffic_data.aspx), containing data on Airport Traffic Data 1985ā86 to 2017ā18. - Read the dataset, into R, naming it `passengers` - Tidy the data, to produce a data set with these columns - airport: all of the airports. - year - type_of_flight: DOMESTIC, INTERNATIONAL - bound: IN or OUT --- # Lab quiz Time to take the lab quiz. --- # Learning is where you: 1. Receive information accurately 2. Remember the information (long term memory) 3. In such a way that you can reapply the information when appropriate --- # Your Turn: Go to the data source at this link: [bit.ly/dmac-noaa-data](https://bit.ly/dmac-noaa-data) - "Which is the best description of the temperature units?" - "What is the best description of the precipitation units" - "What does -9999 mean?" ??? - "Which is the best description of the temperature units?" - degrees farehnheit F - degrees Kelvin K - "degrees C x10" "What is the best description of the precipitation units" - "mm x10" - inches "What does -9999 mean?" - it was really cold - the keyboard got stuck - "the value was missing" --- class: refresher # Recap - Traffic Light System: Green = "good!" ; Red = "Help!" - R + Rstudio - Functions are ___ - columns in data frames are accessed with ___ ? .red[If you have questions, place a red sticky note on your laptop.] .white[If you are done, place a green sticky on your laptop] --- class: refresher # Traffic Light System # .red[Red] Post it - I need a hand - Slow down -- # Green Post it * I am up to speed * I have completed the thing --- class: middle, center # That's it!