<!-- background-color: #006DAE --> <!-- class: middle center hide-slide-number --> <div class="shade_black" style="width:60%;right:0;bottom:0;padding:10px;border: dashed 4px white;margin: auto;"> <i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See <a href=/>here for PDF <i class="fas fa-file-pdf"></i></a>. </div> <br> .white[Press the **right arrow** to progress to the next slide!] --- background-image: url(images/bg1.jpg) background-size: cover class: hide-slide-number split-70 title-slide count: false .column.shade_black[.content[ <br> # .monash-blue.outline-text[ETC5510: Introduction to Data Analysis] <h2 class="monash-blue2 outline-text" style="font-size: 30pt!important;">Week 4, part A</h2> <br> <h2 style="font-weight:900!important;">Relational data, and joins</h2> .bottom_abs.width100[ Lecturer: *Nicholas Tierney & Stuart Lee* Department of Econometrics and Business Statistics
<i class="fas fa-envelope faa-float animated "></i>
ETC5510.Clayton-x@monash.edu April 2020 <br> ] ]] <div class="column transition monash-m-new delay-1s" style="clip-path:url(#swipe__clip-path);"> <div class="background-image" style="background-image:url('images/large.png');background-position: center;background-size:cover;margin-left:3px;"> <svg class="clip-svg absolute"> <defs> <clipPath id="swipe__clip-path" clipPathUnits="objectBoundingBox"> <polygon points="0.5745 0, 0.5 0.33, 0.42 0, 0 0, 0 1, 0.27 1, 0.27 0.59, 0.37 1, 0.634 1, 0.736 0.59, 0.736 1, 1 1, 1 0, 0.5745 0" /> </clipPath> </defs> </svg> </div> </div> --- class: refresher # Recap - consultation hours - assignment 1 - tidy data --- class: refresher # Recap: Tidy data <img src="images/original-dfs-tidy.png" width="75%" style="display: block; margin: auto;" /> --- class: refresher # Overview - What is relational data? - Keys - Different sorts of joins - Using joins to follow an aircraft flight path --- # Relational data - Data analysis **rarely involves** only a single table of data. - To answer questions you generally need to combine many tables of data - Multiple tables of data are called *relational data* - It is the **relations**, not just the individual datasets, that are important. --- # `nycflights13` - Data set of flights that departed NYC in 2013 from https://www.transtats.bts.gov - a public database of all USA commercial airline flights. It has five tables: 1. flights 1. airlines 1. airports 1. planes 1. weather --- # flights ```r library(nycflights13) flights ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- # airlines ```r airlines ## # A tibble: 16 x 2 ## carrier name ## <chr> <chr> ## 1 9E Endeavor Air Inc. ## 2 AA American Airlines Inc. ## 3 AS Alaska Airlines Inc. ## 4 B6 JetBlue Airways ## 5 DL Delta Air Lines Inc. ## 6 EV ExpressJet Airlines Inc. ## 7 F9 Frontier Airlines Inc. ## 8 FL AirTran Airways Corporation ## 9 HA Hawaiian Airlines Inc. ## 10 MQ Envoy Air ## 11 OO SkyWest Airlines Inc. ## 12 UA United Air Lines Inc. ## 13 US US Airways Inc. ## 14 VX Virgin America ## 15 WN Southwest Airlines Co. ## 16 YV Mesa Airlines Inc. ``` --- # airports ```r airports ## # A tibble: 1,458 x 8 ## faa name lat lon alt tz dst tzone ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_Yo… ## 2 06A Moton Field Municipal A… 32.5 -85.7 264 -6 A America/Chicago ## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chicago ## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_Yo… ## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_Yo… ## 6 0A9 Elizabethton Municipal … 36.4 -82.2 1593 -5 A America/New_Yo… ## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/New_Yo… ## 8 0G7 Finger Lakes Regional A… 42.9 -76.8 492 -5 A America/New_Yo… ## 9 0P2 Shoestring Aviation Air… 39.8 -76.6 1000 -5 U America/New_Yo… ## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_An… ## # … with 1,448 more rows ``` --- # planes ```r planes ## # A tibble: 3,322 x 9 ## tailnum year type manufacturer model engines seats speed engine ## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr> ## 1 N10156 2004 Fixed wing m… EMBRAER EMB-1… 2 55 NA Turbo-… ## 2 N102UW 1998 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 3 N103US 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 4 N104UW 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 5 N10575 2002 Fixed wing m… EMBRAER EMB-1… 2 55 NA Turbo-… ## 6 N105UW 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 7 N107US 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 8 N108UW 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 9 N109UW 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 10 N110UW 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## # … with 3,312 more rows ``` --- # weather ```r weather ## # A tibble: 26,115 x 15 ## origin year month day hour temp dewp humid wind_dir wind_speed ## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 ## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 ## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 ## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 ## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 ## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 ## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0 ## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4 ## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0 ## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8 ## # … with 26,105 more rows, and 5 more variables: wind_gust <dbl>, precip <dbl>, ## # pressure <dbl>, visib <dbl>, time_hour <dttm> ``` --- # Concept map of tables and joins from the text <img src="images/relational-nycflights.png" width="80%" style="display: block; margin: auto;" /> --- # Keys 🔑 - Keys = variables used to connect records in one table to another. - In the `nycflights13` data, - `flights` connects to `planes` by a single variable `tailnum` - `flights` connects to `airlines` by a single variable `carrier` - `flights` connects to `airports` by two variables, `origin` and `dest` - `flights` connects to `weather` using multiple variables, `origin`, and `year`, `month`, `day` and `hour`. --- # Your turn: go to Rstudio and open today's exercises - Load the `Lahman` package, which contains multiple tables of baseball data. - What key(s) connect the batting table with the salary table? - Can you draw out a diagram of the connections amongst the tables?
04
:
00
--- # Joins - "mutating joins", add variables from one table to another. - There is always a decision on what observations are copied to the new table as well. - Let's discuss how joins work using some [lovely animations](https://github.com/gadenbuie/tidyexplain) provided by [Garrick Aden-Buie](https://www.garrickadenbuie.com/). --- # Example data <img src="images/original-dfs.png" width="75%" style="display: block; margin: auto;" /> --- # Left Join (Generally the one you want to use) .left-code[ All observations from the "left" table, but only the observations from the "right" table that match those in the left. ] .right-plot[ <img src="gifs/left-join.gif" width="100%" style="display: block; margin: auto;" /> ] --- # Right Join .left-code[ Same as left join, but in reverse. ] .right-plot[ <img src="gifs/right-join.gif" width="100%" style="display: block; margin: auto;" /> ] --- # Inner join .left-code[ Intersection between the two tables, only the observations that are in both ] .right-plot[ <img src="gifs/inner-join.gif" width="100%" style="display: block; margin: auto;" /> ] --- # Outer (full) join .left-code[ Union of the two tables, all observations from both, and missing values might get added ] .right-plot[ <img src="gifs/full-join.gif" width="100%" style="display: block; margin: auto;" /> ] --- # Combine full airline name with flights data? ```r flights ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- # Combine full airline name with flights data? ```r airlines ## # A tibble: 16 x 2 ## carrier name ## <chr> <chr> ## 1 9E Endeavor Air Inc. ## 2 AA American Airlines Inc. ## 3 AS Alaska Airlines Inc. ## 4 B6 JetBlue Airways ## 5 DL Delta Air Lines Inc. ## 6 EV ExpressJet Airlines Inc. ## 7 F9 Frontier Airlines Inc. ## 8 FL AirTran Airways Corporation ## 9 HA Hawaiian Airlines Inc. ## 10 MQ Envoy Air ## 11 OO SkyWest Airlines Inc. ## 12 UA United Air Lines Inc. ## 13 US US Airways Inc. ## 14 VX Virgin America ## 15 WN Southwest Airlines Co. ## 16 YV Mesa Airlines Inc. ``` --- # Combine `airlines` & `flights` using `left_join()` .left-code[ ```r flights %>% left_join(airlines, by = "carrier") %>% glimpse() ``` ] .right-plot[ ``` ## Observations: 336,776 ## Variables: 20 ## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, … ## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,… ## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600,… ## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -… ## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849… ## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851… ## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -… ## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", … ## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, … ## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39… ## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"… ## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"… ## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, … ## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,… ## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, … ## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, … ## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 … ## $ name <chr> "United Air Lines Inc.", "United Air Lines Inc.", "Ame… ``` ] --- # Example: flights joining to airports .left-code[ ```r flights %>% left_join( airports, by = c("origin" = "faa")) %>% glimpse() ``` ] .right-plot[ ``` ## Observations: 336,776 ## Variables: 26 ## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, … ## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,… ## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600,… ## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -… ## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849… ## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851… ## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -… ## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", … ## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, … ## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39… ## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"… ## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"… ## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, … ## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,… ## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, … ## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, … ## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 … ## $ name <chr> "Newark Liberty Intl", "La Guardia", "John F Kennedy I… ## $ lat <dbl> 40.69250, 40.77725, 40.63975, 40.63975, 40.77725, 40.6… ## $ lon <dbl> -74.16867, -73.87261, -73.77893, -73.77893, -73.87261,… ## $ alt <dbl> 18, 22, 13, 13, 22, 18, 18, 22, 13, 22, 13, 13, 13, 18… ## $ tz <dbl> -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5… ## $ dst <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",… ## $ tzone <chr> "America/New_York", "America/New_York", "America/New_Y… ``` ] --- # Airline travel, ontime data ```r plane_N4YRAA <- read_csv("data/plane_N4YRAA.csv") glimpse(plane_N4YRAA) ## Observations: 145 ## Variables: 8 ## $ FL_DATE <date> 2017-05-26, 2017-05-02, 2017-05-05, 2017-05-11, 2017-05-03,… ## $ CARRIER <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", … ## $ FL_NUM <dbl> 2246, 2276, 2278, 2287, 2288, 2291, 2297, 2297, 2297, 2297, … ## $ ORIGIN <chr> "CVG", "DFW", "DFW", "STL", "IND", "CHS", "DFW", "DFW", "MKE… ## $ DEST <chr> "DFW", "IND", "OKC", "ORD", "DFW", "DFW", "MKE", "MKE", "DFW… ## $ DEP_TIME <chr> "0748", "2020", "0848", "0454", "0601", "0807", "0700", "065… ## $ ARR_TIME <chr> "0917", "2323", "0941", "0600", "0719", "0947", "0905", "090… ## $ DISTANCE <dbl> 812, 761, 175, 258, 761, 987, 853, 853, 853, 853, 447, 447, … ``` --- # Airline travel, airport location ```r airport_raw <- read_csv("data/airports.csv") airport_raw %>% select(AIRPORT, LATITUDE, LONGITUDE, AIRPORT_STATE_NAME) %>% glimpse() ## Observations: 13,094 ## Variables: 4 ## $ AIRPORT <chr> "01A", "03A", "04A", "05A", "06A", "07A", "08A", "… ## $ LATITUDE <dbl> 58.10944, 65.54806, 68.08333, 67.57000, 57.74528, … ## $ LONGITUDE <dbl> -152.90667, -161.07167, -163.16667, -148.18389, -1… ## $ AIRPORT_STATE_NAME <chr> "Alaska", "Alaska", "Alaska", "Alaska", "Alaska", … ``` --- class: transition # Our Turn: Joining the two tables to show flight movements - Go to Rstudio and open "flight-movements.Rmd" and complete exercise - the aim is to show flight movement on the map - Next: Open "nycflights.Rmd" --- # Learning more - The coat explanation of joins: Different types of joins explained using a person and a coat, by [Leight Tami](https://twitter.com/leigh_tami18/status/1021471889309487105/photo/1) <img src="images/joins_using_coat.jpg" width="75%" style="display: block; margin: auto;" /> --- # References - Chapter 13 of R4DS