background-color: #006DAE class: middle center hide-slide-number <div class="shade_black" style="width:60%;right:0;bottom:0;padding:10px;border: dashed 4px white;margin: auto;"> <i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See <a href=/>here for PDF <i class="fas fa-file-pdf"></i></a>. </div> <br> .white[Press the **right arrow** to progress to the next slide!] --- background-image: url(images/bg1.jpg) background-size: cover class: hide-slide-number split-70 title-slide count: false .column.shade_black[.content[ <br> # .monash-blue.outline-text[ETC5510: Introduction to Data Analysis] <h2 class="monash-blue2 outline-text" style="font-size: 30pt!important;">Week 1</h2> <br> <h2 style="font-weight:900!important;">Week of introduction</h2> .bottom_abs.width100[ Lecturer: *Stuart Lee & Nicholas Tierney* Department of Econometrics and Business Statistics
<i class="fas fa-envelope faa-float animated "></i>
ETC5510.Clayton-x@monash.edu 9th Mar 2020 <br> ] ]] <div class="column transition monash-m-new delay-1s" style="clip-path:url(#swipe__clip-path);"> <div class="background-image" style="background-image:url('images/large.png');background-position: center;background-size:cover;margin-left:3px;"> <svg class="clip-svg absolute"> <defs> <clipPath id="swipe__clip-path" clipPathUnits="objectBoundingBox"> <polygon points="0.5745 0, 0.5 0.33, 0.42 0, 0 0, 0 1, 0.27 1, 0.27 0.59, 0.37 1, 0.634 1, 0.736 0.59, 0.736 1, 1 1, 1 0, 0.5745 0" /> </clipPath> </defs> </svg> </div> </div> --- background-image: url(https://media.giphy.com/media/OkJat1YNdoD3W/giphy.gif) background-size: contain background-position: 50% 50% class: center, bottom, bg-yellow --- # What is this course? This is a course on introduction to **data analysis**. -- You can also think of it as introduction to data science. -- **Q - What data analysis background does this course assume?** A - None. -- **Q - Is this an intro stat course?** A - Statistics `\(\ne\)` data science. BUT they are closely related. This course is a great way to get started with statistics. But is **not** your typical high school statistics course. -- **Q - Will we be doing computing?** A - Yes. --- # What is this course? **Q - Is this an intro Computer Science course?** A - No, but there are some shared themes. -- **Q - What computing language will we learn?** A - R. -- **Q: Why not language X?** A: This course gives you the skills to hopefully learn *X* later! -- Taught as a **lectorial** (Lecture + Tutorial) -- It is **not** (typically) recorded because **you** are doing work -- You have to show up to class to practice! --- class: motivator # The _language_ of data analysis .left-code[ This course is brought to you today by the letter "R"! ] .right-plot[ <img src="images/grover_R.png" width="70%" style="display: block; margin: auto;" /> Grover image sourced from https://en.wikipedia.org/wiki/Grover. ] --- # What is R? .blockquote[ R is a language for data analysis. If R seems a bit confusing, disorganized, and perhaps incoherent at times, in some ways that's because so is data analysis. -- Roger Peng, 12/07/2018 ] --- # Why R? - __Free__ - __Powerful__: Over 15000 contributed packages on the main repository (CRAN), as of March 2020, provided by top international researchers and programmers. - __Flexible__: It is a language, and thus allows you to create your own solutions - __Community__: Large global community friendly and helpful, lots of resources --- # Community .left-code[ R Consortium conducted a survey of users 2017. These are the locations of respondents to an R Consortium survey conducted in 2017. **8% of R users are between 18-24 BUT 45% of R users are between 25-34!** ] .right-plot[ <img src="images/R_community.png" width="90%" style="display: block; margin: auto;" /> ] --- # Sample of Australian organisations/companies that sent employees to [useR! 2018](https://user2018.r-project.org) .blockquote[ ABS, **CSIRO**, ATO, **Microsoft**, Energy Qld, Auto and General, Bank of Qld, BHP, AEMO, Google, Flight Centre, Youi, Amadeus Investment Partners, Yahoo, Sydney Trains, Tennis Australia, Rio Tinto, Reserve Bank of Australia, PwC, Oracle, **Netflix**, NOAA Fisheries, NAB, Menulog, Macquarie Bank, Honeywell, Geoscience Australia, DFAT, DPI, CBA, Bank of Italy, Australian Red Cross Blood Service, **Amazon**, **Bunnings**. ] --- class: center middle # R and RStudio .pull-left[ <img src="images/Rlogo.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/rstudio-logo.png" width="100%" style="display: block; margin: auto;" /> ] --- class: informative # What is R/RStudio? - R is a statistical programming language - RStudio is a convenient interface for R (an integrated development environment, IDE) -- .blockquote[ If R were **an airplane**, RStudio would be **the airport**, providing many, many supporting services that make it easier for you, the pilot, to take off and go to awesome places. Sure, you can fly an airplane without an airport, but having those runways and supporting infrastructure is a game-changer -- [Julie Lowndes](http://jules32.github.io/resources/RStudio_intro/) ] --- class: transition middle # Let's take a tour of R and RStudio --- class: bg-main1 background-image: url(images/rstudio-screenshot.png) background-size: contain background-position: 50% 50% class: center, bottom, white ??? - __Source code editor__: This is where you edit your script! Experiment with writing your data analysis pipeline, ... You can have multiple files open, there are useful shortcuts (eg "Run", "Knit"), code is highlighted usefully and there is tab-completion when you start typing. *Also, if you want to look at any of the data you have read or created, it can be viewed in a tab on this pane.* - __Console window__: This is where the code is executed. There is a prompt ">" which says *R is waiting for your command*. You don't actually need to type anything in this pane, you can run code directly from the Editor pane. The functions will show up in this window, and the results will be here. (Although, if you use an R notebook the result will show up in that notebook.) - __Help__: *?function* will show the help pages for the function here. - __Plot__: Plots you ask R to make will show up here, and you can zoom to make them bigger in a separate window if you want. - __Environment__: Data that you have read into R, or data or functions that you create will be listed here. **When you quit R, you will be asked if you want to save the environment. I suggest that you always answer NO.** Because we are scripting everything it is always easy to re-create the objects in the environment. The only time it can be useful is if you have created something that took a while, but a better option is to save this object to a file, and read it in at the start of the next R session. --- class: transition # End of part 1 of Lecture 1A --- class: transition # Start of part 2 of Lecture 1A --- # Let's start writing... In your own time, read over the handout linked [here](https://mida.numbat.space/slides/setup.html) Once you have R and Rstudio installed, we can begin the first excercise! .small[ This section is based on an exercise from [data science in a box by Mine Çetinkaya-Rundel](https://github.com/rstudio-education/datascience-box/blob/c2e3ed13417c896fb49b78fd4a5a551392286351/extras/exercises/01-unvotes/unvotes.Rmd) ] --- # Create your first data visualisation - Once you have opened RStudio, downloaded and unzip the linked lab exercise into your course project [download link](https://mida.numbat.space/exercises/1a/mida-exercise-1a.zip) - Open the folder called "mida-exercise-1a", click on the .Rproj file. This will open a new session in Rstudio. - In the Files pane in the bottom right corner, open the file called `unvotes.Rmd`. Then click on the "Knit" button. - Go back to the file and change your name on top (in the `yaml` -- we'll talk about what this means later) and knit again. - Change the country names to those you're interested in. Spelling and capitalization should match the data so take a peek at the Appendix to see how the country names are spelled. Knit again. And voila, your first data visualization! --- class: transition # End of part 2 of Lecture 1A --- class: transition # Start of part 3 of Lecture 1A --- # R essentials: A short list (for now) - Functions are (most often) verbs, followed by what they will be applied to in parentheses: ```r do_this(to_this) do_that(to_this, to_that, with_those) ``` -- For example: -- ```r mean(c(1,2,1,2)) ## [1] 1.5 ``` --- # R essentials: A short list (for now) - Columns (variables) in data frames are accessed with `$`: ```r dataframe$var_name ``` -- For example: -- ```r starwars$name ## [1] "Luke Skywalker" "C-3PO" "R2-D2" ## [4] "Darth Vader" "Leia Organa" "Owen Lars" ## [7] "Beru Whitesun lars" "R5-D4" "Biggs Darklighter" ## [10] "Obi-Wan Kenobi" "Anakin Skywalker" "Wilhuff Tarkin" ## [13] "Chewbacca" "Han Solo" "Greedo" ## [16] "Jabba Desilijic Tiure" "Wedge Antilles" "Jek Tono Porkins" ## [19] "Yoda" "Palpatine" "Boba Fett" ## [22] "IG-88" "Bossk" "Lando Calrissian" ## [25] "Lobot" "Ackbar" "Mon Mothma" ## [28] "Arvel Crynyd" "Wicket Systri Warrick" "Nien Nunb" ## [31] "Qui-Gon Jinn" "Nute Gunray" "Finis Valorum" ## [34] "Jar Jar Binks" "Roos Tarpals" "Rugor Nass" ## [37] "Ric Olié" "Watto" "Sebulba" ## [40] "Quarsh Panaka" "Shmi Skywalker" "Darth Maul" ## [43] "Bib Fortuna" "Ayla Secura" "Dud Bolt" ## [46] "Gasgano" "Ben Quadinaros" "Mace Windu" ## [49] "Ki-Adi-Mundi" "Kit Fisto" "Eeth Koth" ## [52] "Adi Gallia" "Saesee Tiin" "Yarael Poof" ## [55] "Plo Koon" "Mas Amedda" "Gregar Typho" ## [58] "Cordé" "Cliegg Lars" "Poggle the Lesser" ## [61] "Luminara Unduli" "Barriss Offee" "Dormé" ## [64] "Dooku" "Bail Prestor Organa" "Jango Fett" ## [67] "Zam Wesell" "Dexter Jettster" "Lama Su" ## [70] "Taun We" "Jocasta Nu" "Ratts Tyerell" ## [73] "R4-P17" "Wat Tambor" "San Hill" ## [76] "Shaak Ti" "Grievous" "Tarfful" ## [79] "Raymus Antilles" "Sly Moore" "Tion Medon" ## [82] "Finn" "Rey" "Poe Dameron" ## [85] "BB8" "Captain Phasma" "Padmé Amidala" ``` --- # R essentials: A short list (for now) - Packages are installed with the `install.packages` function and loaded with the `library` function, once per session: ```r install.packages("package_name") library(package_name) ``` --- # What can you do at the end of semester? Some of our best final projects: * [instagram](https://ebsmonash.shinyapps.io/Instagram/) * [babynames](https://ebsmonash.shinyapps.io/BabyNames/) * [oztourism](https://ebsmonash.shinyapps.io/OzTourism/) * [salary gaps](https://dmac.dicook.org/project/project_python_r/project#introduction) * [FantasyAFL](https://ebsmonash.shinyapps.io/FantasyAFL/) --- # What you need to learn .blockquote[ Data preparation accounts for about 80% of the work of data scientists -- [Gil Press, Forbes 2016](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/##47cbbbf46f63) ] -- ## Data Preparation * One of the least taught parts of data science, and business analytics, and yet it is what data scientists spend most of their time on. * By the end of this semester, you will have the tools to be more efficient and effective in this area, so that you have more time to spend on your mining and modeling. --- # Learning objectives The learning goals associated with this unit are to: 1. Learn to read different data formats, learn about tidy data and wrangling techniques 2. Apply effective visualisation and modelling to understand relationships between variables, and make decisions with data 3. Develop communication skills using reproducible reporting. --- # Philosophy .blockquote[ If you feed a person a fish, they eat for a day. If you teach a person to fish, they eat for a lifetime. ] Whatever I do in the data analysis that is shown to you during the class, you can do it, too. --- class: informative # Course Website: mida.numbat.space - "mida" = Masters Introduction to Data Analysis - "numbat" = Non-Uniform-Monash-Business-Analyics-Team - [unit guide](https://unitguidemanager.monash.edu/view?unitCode=ETC5510&tpCode=S1-01&tpYear=2020) (authority on course structure). - Lecture notes for each class - Assignment and project instructions - Textbook + other online resources related to topics - Consultation times (4 x 1Hr consultations) --- # Using laptops - We will assume that you have R & Rstudio installed on your own computer. - This course is also set up as a "MoVE unit", which means you can borrow a laptop from the university for class hours. - It is also possible to set up R and RStudio onto a USB stick to use with your borrowed laptop. --- # Grading <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;position: sticky; top:0; background-color: #FFFFFF;"> Assessment </th> <th style="text-align:center;position: sticky; top:0; background-color: #FFFFFF;"> Weight </th> <th style="text-align:center;position: sticky; top:0; background-color: #FFFFFF;"> Task </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Reading Quiz </td> <td style="text-align:center;"> 5% </td> <td style="text-align:center;"> Complete prior to each class, for the first 8 weeks on ED. Quiz needs to be completed by class time. No mulligans. One can be missed without penalty. </td> </tr> <tr> <td style="text-align:center;"> Lab Exercise </td> <td style="text-align:center;"> 5% </td> <td style="text-align:center;"> Each class period will have a quiz to be completed individually. Two can be missed without penalty. </td> </tr> </tbody> </table> --- # Grading Example: Reading Quiz - Before 6pm on Wednesday, you need to complete the 5 question **reading quiz** on ED - Before 6pm **next Monday** You need to complete the 5 question **reading quiz** on ED. --- # Grading Example: Lab Exercise There is time at the end of class to complete **lab exercise on ED**: - Before 8pm **Next Monday (16th March)**, you need to complete the 10 question **Lab Exercise** on ED - Before 8pm **Mext Wednesday (18th March)** you need to complete the 10 question **Lab Exercise** on ED. --- # Grading <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Assessment </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Weight </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Task </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Assignment </td> <td style="text-align:left;"> 20% </td> <td style="text-align:left;"> Teamwork, data analysis challenge, due in weeks 4, and 8 </td> </tr> <tr> <td style="text-align:left;"> Mid-Sem Theory + Concept exam </td> <td style="text-align:left;"> 20% </td> <td style="text-align:left;"> Due week 6 </td> </tr> <tr> <td style="text-align:left;"> Data Analysis Exam </td> <td style="text-align:left;"> 20% </td> <td style="text-align:left;"> Due week 11 </td> </tr> <tr> <td style="text-align:left;"> Project </td> <td style="text-align:left;"> 30% </td> <td style="text-align:left;"> Due week 11 </td> </tr> </tbody> </table> --- # Textbook .pull-left[ <img src="https://raw.githubusercontent.com/hadley/r4ds/master/cover.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ * Free * Written by authors of [Tidyverse R packages](http://tidyverse.org/) ] --- class: informative # Ed System - Online quizzes - Conduct discussions - Ask questions about the course material and exercises, and turn in assignments and project. *Only your name and email address are recorded in the ED systems.* -- (DEMO) --- background-image: url(https://imgs.njtierney.com/edstem.png) background-size: contain class: center, bottom, bg-indigo --- # Tips for asking questions - First search existing discussion for answers. If the question has already been answered, you're done! If it has already been asked but you're not satisfied with the answer, add to the thread. -- - Give your question context from course concepts not course assignments. - Good context: "I have a question on filtering data" - Bad context: "I have a question on Assignment 1" --- # Tips for asking questions - Be precise in your description: - Good description: "I am getting the following error and I'm not sure how to resolve it - `Error: could not find function "ggplot"`" - Bad description: "R giving errors, help me! Aaaarrrrrgh!" - Remember: you can edit a question after posting it. --- class: informative # How do you do well in this class - Do the reading prior to each class period. -- - Participate actively in this class. -- - Ask questions on the **ed**. --- class: informative # How do you do well in this class - Come to consultation if you have questions. -- - Practice the materials taught in each lectorial by doing more exercises from the textbook. -- - Be curious, be positive, be engaged. --- class: motivator # Remember: All information is on the website 😄 -- Post questions on ED **instead of** questions over email --- # Diversity & Inclusiveness: - Intent: Students from all diverse backgrounds and perspectives be well-served by this course, that students' learning needs be addressed both in and out of class, and that the diversity that the students bring to this class be viewed as a resource, strength and benefit. - It is my intent to present materials and activities that are respectful of diversity: gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, and culture. Let me know ways to improve the effectiveness of the course for you personally, or for other students or student groups. --- # Diversity & Inclusiveness: - If you have a name and/or set of pronouns that differ from those that appear in your official Monash records, please let me know! - If you feel like your performance in the class is being impacted by your experiences outside of class, please don't hesitate to come and talk with me. I want to be a resource for you. If you prefer to speak with someone outside of the course, talk to Di Cook, or look at the services available to you in the [Monash student support services](https://www.monash.edu/students/support). --- # Diversity & Inclusiveness: - I (like many people) am still in the process of learning about diverse perspectives and identities. If something was said in class (by anyone) that made you feel uncomfortable, please talk to me about it. --- # Sharing / Reusing code - I am well aware that a huge volume of code is available on the web to solve any number of problems. - Unless I explicitly tell you not to use something the course's policy is that you may make use of any online resources (e.g. StackOverflow) but you must explicitly cite where you obtained any code you directly use (or use as inspiration). This can be as simple as pasting the link in a references section. --- # Sharing / Reusing code - Any recycled code not explicitly cited will be treated as plagiarism. - Assignment groups may not directly share code with another group. - You are welcome to discuss the problems together and ask for advice, but you may not make direct use of code from another team. --- # Group Assignments What we expect: - Conducted according to the [Monash policies](https://www.monash.edu/__data/assets/pdf_file/0011/1098659/Team-Assessment-Guidelines.pdf). - Each member of the group completes the entire assignment, as best they can. - Group members compare answers and combine it into one document for the final asubmission. - 25% of the assignment grade will come from peer evaluation. - Peer evaluation is an important learning tool. --- # Group Assignments: Peer evaluation Each student will be randomly assigned another team's submission to provide feedback on three things: 1. Could you reproduce the analysis? 2. Did you learn something new from the other team's approach? 3. What would you suggest to improve their work? --- # Group Assignments: Working in groups - Conflicts can arise in group work. -- - They can be both productive and destructive. -- - Teams need to work on managing conflicts and building on the strengths of all team members. --- # Group Assignments: Working in groups - For each assignment, you will be given the option to comment on the efforts of your other group members. -- - If a team member has not contributed to an assignment submission, they might score a 0. -- - In this situation the team will need to discuss team function and dysfunction with the instructor. --- class: transition middle # Group Assignments Assignment 1 will be announced at class on Monday Week 2 --- class: refresher # Concepts introduced: .pull-left[ - How to edit R code - Creating Data Visualisations - R - RStudio ] .pull-right[ - Console - Using R as a calculator - Environment - Loading and viewing a data frame - Accessing a variable in a data frame - R functions ] --- class: bg-main1 # Lab Exercise Check your knowledge and comprehension by taking your first lab quiz on Ed Go to the ED page, and complete the lab quiz before next Monday, 16th March. --- background-image: url(images/bg1.jpg) background-size: cover class: hide-slide-number split-70 count: false .column.shade_black[.content[ <br><br> # That's it! .bottom_abs.width100[ Lecturer: Stuart Lee & Nicholas Tierney Department of Econometrics and Business Statistics<br>
<i class="fas fa-envelope faa-float animated "></i>
ETC5510.Clayton-x@monash.edu 9th Mar 2020 ] ]] <div class="column transition monash-m-new delay-1s" style="clip-path:url(#swipe__clip-path);"> <div class="background-image" style="background-image:url('images/large.png');background-position: center;background-size:cover;margin-left:3px;"> <svg class="clip-svg absolute"> <defs> <clipPath id="swipe__clip-path" clipPathUnits="objectBoundingBox"> <polygon points="0.5745 0, 0.5 0.33, 0.42 0, 0 0, 0 1, 0.27 1, 0.27 0.59, 0.37 1, 0.634 1, 0.736 0.59, 0.736 1, 1 1, 1 0, 0.5745 0" /> </clipPath> </defs> </svg> </div> </div> ??? ### Cheatsheets RStudio started a trend by writing really concise summaries, and others have added to the collection. You can find the RStudio collection in the "Help" menu on the IDE, and at https://www.rstudio.com/resources/cheatsheets/. Start with the [RStudio IDE cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf). ### Using the IDE - __Source code editor__: This is where you edit your script! Experiment with writing your data analysis pipeline, ... You can have multiple files open, there are useful shortcuts (eg "Run", "Knit"), code is highlighted usefully and there is tab-completion when you start typing. *Also, if you want to look at any of the data you have read or created, it can be viewed in a tab on this pane.* - __Console window__: This is where the code is executed. There is a prompt ">" which says *R is waiting for your command*. You don't actually need to type anything in this pane, you can run code directly from the Editor pane. The functions will show up in this window, and the results will be here. (Although, if you use an R notebook the result will show up in that notebook.) - __Help__: *?function* will show the help pages for the function here. - __Plot__: Plots you ask R to make will show up here, and you can zoom to make them bigger in a separate window if you want. - __Environment__: Data that you have read into R, or data or functions that you create will be listed here. **When you quit R, you will be asked if you want to save the environment. I suggest that you always answer NO.** Because we are scripting everything it is always easy to re-create the objects in the environment. The only time it can be useful is if you have created something that took a while, but a better option is to save this object to a file, and read it in at the start of the next R session.