This assignment is designed to simulate a scenario where you are given a dataset, and taking over someone’s existing work, and continuing with it to draw some further insights. This aspect of it is similar to Assignment 1, but it will provide less scaffolding, and ask you to draw more insights, as well as do more communication. cle Your previous submission of crime data was well received!
You’ve now been given a different next task to work on. Your colleague at your consulting firm, Amelia (in the text treatment below) has written some helpful hints throughout the assignment to help guide you.
Questions that are worth marks are indicated with “Q” at the start and end of the question, as well as the number of marks in parenthesis. For example
## Q1A some text (0.5 marks)
Is question one, part A, worth 0.5 marks
This assignment will be worth 10% of your total grade, and is marked out of 58 marks total.
9 Marks for presentation of the data visualisations
Your marks will be weighted according to peer evaluation.
As of week 6, you have seen most of the code for parts 1 - 2 that needs to be used here, and Week 7 will give you the skills to complete part 3. I do not expect you to know immediately what the code below does - this is a challenge for you! We will be covering skills on modelling in the next weeks, but this assignment is designed to simulate a real life work situation - this means that there are some things where you need to “learn on the job”. But the vast majority of the assignment will cover things that you will have seen in class, or the readings.
Remember, you can look up the help file for functions by typing ?function_name
. For example, ?mean
. Feel free to google questions you have about how to do other kinds of plots, and post on the ED if you have any questions about the assignment.
To complete the assignment you will need to fill in the blanks for function names, arguments, or other names. These sections are marked with ***
or ___
. At a minimum, your assignment should be able to be “knitted” using the knit
button for your Rmarkdown document.
If you want to look at what the assignment looks like in progress, but you do not have valid R code in all the R code chunks, remember that you can set the chunk options to eval = FALSE
like so:
```{r this-chunk-will-not-run, eval = FALSE}`r''`
ggplot()
```
If you do this, please remember to ensure that you remove this chunk option or set it to eval = TRUE
when you submit the assignment, to ensure all your R code runs.
You will be completing this assignment in your assigned groups. A reminder regarding our recommendations for completing group assignments:
Your assignments will be peer reviewed, and results checked for reproducibility. This means:
Each student will be randomly assigned another team’s submission to provide feedback on three things:
This assignment is due in by 6pm on Wednesday 20th May. You will submit the assignment via ED. Please change the file name to include your teams name. For example, if you are team dplyr
, your assignment file name could read: “assignment-2-2020-s1-team-dplyr.Rmd”
You work as a data scientist in the well named consulting company, “Consulting for You”.
On your second day at the company, you impressed the team with your work on crime data. Your boss says to you:
Amelia has managed to find yet another treasure trove of data - get this: pedestrian count data in inner city Melbourne! Amelia is still in New Zealand, and now won’t be back now for a while. They discovered this dataset the afternoon before they left on holiday, and got started on doing some data analysis.
We’ve got a meeting coming up soon where we need to discuss some new directions for the company, and we want you to tell us about this dataset and what we can do with it.
Most Importantly, can you get this to me by Wednesday 20th May, COB (COB = Close of Business at 6pm).
I’ve given this dataset to some of the other new hire data scientists as well, you’ll all be working as a team on this dataset. I’d like you to all try and work on the questions separately, and then combine your answers together to provide the best results.
From here, you are handed a USB stick. You load this into your computer, and you see a folder called “melbourne-walk”. In it is a folder called “data-raw”, and an Rmarkdown file. It contains the start of a data analysis. Your job is to explore the data and answer the questions in the document.
Note that the text that is written was originally written by Amelia, and you need to make sure that their name is kept up top, and to pay attention to what they have to say in the document!
The City of Melbourne has sensors set up in strategic locations across the inner city to keep hourly tallies of pedestrians. The data is updated on a monthly basis and available for download from Melbourne Open Data Portal. The rwalkr package provides an API in R to easily access sensor counts and geographic locations.
There are three parts to this work:
#install.packages('sugrrants')
#install.packages('timeDate')
#install.packages('tsibble')
#install.packages('timeDate')
#install.packages("kableExtra")
library(broom)
library(ggmap)
library(knitr)
library(lubridate)
library(rwalkr)
library(sugrrants)
library(timeDate)
library(tsibble)
library(here)
library(readr)
library(tidyverse)
library(kableExtra)
Amelia: I’ve downloaded a map chunk of Melbourne. Can you take the map I made, and plot the location of the sensors on top of it? We want to be able to see all of the sensors, but we also want to create different shapes for the following sensors:
First we download the data on the pedestrian sensor locations around Melbourne.
# only select active sensors
ped_loc <- pull_sensor() %>%
filter(status == "A")
selected_sensors <- c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr")
# identify those sensors that are
selected <- ped_loc %>% filter(sensor %in% selected_sensors)
nonselected <- ped_loc %>% filter(!(sensor %in% selected_sensors))
And now we draw a plot on a map tile of the pedestrian sensor locations
melb_map <- read_rds(here::here("data-raw/melb-map.rds"))
ggmap(melb_map) +
geom_point(data = nonselected,
aes(x = longitude,
y = latitude),
colour = "#2b8cbe",
alpha = 0.6,
size = 3) +
geom_point(data = selected,
aes(x = longitude,
y = latitude,
colour = sensor),
size = 4,
shape = 17) +
labs(x = "Longitude",
y = "Latitude") +
scale_colour_brewer(palette = "Dark2",
name = "Sensor") +
guides(col = guide_legend(nrow = 2,
byrow = TRUE)) +
theme(legend.position = "bottom")
!> Answer: The map above represents the sensor locations in and around West Melbourne. The 4 sensors selected are marked as triangles and differentiated by color. 3 out of these 4 sensors (except Melbourne Central) are placed along the Yarra river, whereas Melbourne Central is relatively more land locked. It seems as though the north-west region of Flagstaff Gardens does not have any sensors in view. The area around the Melbourne Central sensor is the most densely sensored.
ped_loc %>%
# calculate the year from the date information
mutate(year = year(installation_date)) %>%
#grouping the observations based on year
group_by(year) %>%
# count up the number of sensors
summarise('Number of sensors installed' = n()) %>%
kable() %>%
# then use `kable()` to create a markdown table
kable_styling(bootstrap_options = c("bordered"))
year | Number of sensors installed |
---|---|
2009 | 13 |
2013 | 12 |
2014 | 2 |
2015 | 7 |
2016 | 1 |
2017 | 9 |
2018 | 5 |
2019 | 6 |
2020 | 4 |
Additionally, how many sensors were added in 2016, and in 2017?
!> Answer: 1 sensor was added in 2016 and 9 sensors were added in 2017.
We would like you to focus on the foot traffic at 4 sensors:
Your task is to:
rwalkr
(you might want to cache this so it doesn’t run every time you knit)## # A tibble: 181,440 x 5
## Sensor Date_Time Date Time Count
## <chr> <dttm> <date> <int> <int>
## 1 Bourke Street Mall (North) 2018-01-01 00:00:00 2018-01-01 0 895
## 2 Bourke Street Mall (South) 2018-01-01 00:00:00 2018-01-01 0 734
## 3 Melbourne Central 2018-01-01 00:00:00 2018-01-01 0 2996
## 4 Town Hall (West) 2018-01-01 00:00:00 2018-01-01 0 3052
## 5 Princes Bridge 2018-01-01 00:00:00 2018-01-01 0 1757
## 6 Flinders Street Station Underpass 2018-01-01 00:00:00 2018-01-01 0 3443
## 7 Birrarung Marr 2018-01-01 00:00:00 2018-01-01 0 1828
## 8 Webb Bridge 2018-01-01 00:00:00 2018-01-01 0 605
## 9 Southern Cross Station 2018-01-01 00:00:00 2018-01-01 0 1411
## 10 Victoria Point 2018-01-01 00:00:00 2018-01-01 0 963
## # ... with 181,430 more rows
rwalkr
walk_2018 <- walk_2018 %>%
# Filter the data down to include only the four sensors above
filter(Sensor %in% selected_sensors) %>%
# now add four columns, containing month day, month, year, and day of the year
# using functions from lubridate.
mutate(mday = mday(Date),
month = month(Date),
year = year(Date),
day_of_year = yday(Date))
walk_2018
## # A tibble: 11,520 x 9
## Sensor Date_Time Date Time Count mday month year
## <chr> <dttm> <date> <int> <int> <int> <dbl> <dbl>
## 1 Melbo~ 2018-01-01 00:00:00 2018-01-01 0 2996 1 1 2018
## 2 Flind~ 2018-01-01 00:00:00 2018-01-01 0 3443 1 1 2018
## 3 Birra~ 2018-01-01 00:00:00 2018-01-01 0 1828 1 1 2018
## 4 South~ 2018-01-01 00:00:00 2018-01-01 0 1411 1 1 2018
## 5 Melbo~ 2018-01-01 01:00:00 2018-01-01 1 3481 1 1 2018
## 6 Flind~ 2018-01-01 01:00:00 2018-01-01 1 3579 1 1 2018
## 7 Birra~ 2018-01-01 01:00:00 2018-01-01 1 1143 1 1 2018
## 8 South~ 2018-01-01 01:00:00 2018-01-01 1 436 1 1 2018
## 9 Melbo~ 2018-01-01 02:00:00 2018-01-01 2 1721 1 1 2018
## 10 Flind~ 2018-01-01 02:00:00 2018-01-01 2 3157 1 1 2018
## # ... with 11,510 more rows, and 1 more variable: day_of_year <dbl>
Now we can plot the pedestrian count information for January - April in 2018
ggplot(walk_2018,
aes(x = Date_Time,
y = Count)) +
geom_line(size = 0.3) +
facet_grid(Sensor ~ .,
# this code presents the facets ina nice way
labeller = labeller(Sensor = label_wrap_gen(20))) +
# this code mades the x axis a bit nicer to read
scale_x_datetime(date_labels = "%d %b %Y",
date_minor_breaks = "1 month") +
labs(x = "Date Time")
We can see that there are quite different patterns in each of the sensors. Let’s explore this further.
!> Answer: From the above plot, we can see that the activities are regular and consistent throughout the day such as normal commuting (say), with same patterns. At melbourne central, there is 1 irregular activity which takes place at some time interval after Feb mid due to which we can see a spike in the foot traffic. At Birrarung Marr, we can see several irregular activities (such as events or festivals) that takes place from Jan to April across different days and times due to which we can see a lot of ups and downs in the pedestrain foot traffic.
We’re primarily interested in exploiting pedestrian patterns at various time resolutions and across different locations. In light of people’s daily schedules, let’s plot the counts against time of day for each sensor.
ggplot(walk_2018,
aes(x = Time,
y = Count,
group = Sensor,
colour = Sensor)) +
geom_line() +
facet_wrap(~Sensor, labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2", name = "Sensor") +
theme(legend.position = "none")
Write a short paragraph that describe what the plot shows:
!> Answer: The above graph represents the time in hours of the day on the x axis and pedestrain counts on the y axis, with all the four sensors represented in facets. It showcases rise and fall in pedestrain counts at each station with respect to the time of the day.
Use the data inside the hols_2018
data to identify weekdays and weekends, and holidays.
hols_2018 <- tsibble::holiday_aus(year = 2018, state = "VIC")
walk_2018_hols <- walk_2018 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = weekday=='Sun' | weekday=='Sat' | Date %in% hols_2018$date,
true = 'Non-workday',
false = 'Workday'
))
walk_2018_hols
## # A tibble: 11,520 x 11
## Sensor Date_Time Date Time Count mday month year
## <chr> <dttm> <date> <int> <int> <int> <dbl> <dbl>
## 1 Melbo~ 2018-01-01 00:00:00 2018-01-01 0 2996 1 1 2018
## 2 Flind~ 2018-01-01 00:00:00 2018-01-01 0 3443 1 1 2018
## 3 Birra~ 2018-01-01 00:00:00 2018-01-01 0 1828 1 1 2018
## 4 South~ 2018-01-01 00:00:00 2018-01-01 0 1411 1 1 2018
## 5 Melbo~ 2018-01-01 01:00:00 2018-01-01 1 3481 1 1 2018
## 6 Flind~ 2018-01-01 01:00:00 2018-01-01 1 3579 1 1 2018
## 7 Birra~ 2018-01-01 01:00:00 2018-01-01 1 1143 1 1 2018
## 8 South~ 2018-01-01 01:00:00 2018-01-01 1 436 1 1 2018
## 9 Melbo~ 2018-01-01 02:00:00 2018-01-01 2 1721 1 1 2018
## 10 Flind~ 2018-01-01 02:00:00 2018-01-01 2 3157 1 1 2018
## # ... with 11,510 more rows, and 3 more variables: day_of_year <dbl>,
## # weekday <ord>, workday <chr>
Now create a plot to compare the workdays to the non workdays.
ggplot(walk_2018_hols,
aes(x = Time, y = Count)) +
geom_line(size = 0.3,
alpha = 0.3) +
facet_grid(workday~Sensor,
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2", name = "Sensor") +
theme(legend.position = "none")
Write a short paragraph that describe what the plot shows, and helps us answer these questions:
!> Answer:
This graph shows information about the hourly pedestrain counts during different times of the day as per working and non-working days at 4 selected locations.
On working days, Flinders street station underpass and Southern Cross Station show a similar trend where during early morning, the counts are significantly less with an increasing trend during morning time with a peak around 9AM, post which the counts gradually begin to decrease during afternoon time.
The counts again start to increase during late afternoon time with a peak around 5PM post which it gradually starts to decrease with some minor ups and downs.
On working days, Melbourne Central and Birrarung Marr show a different trend compared to Flinders and Southern Cross Station.
On non-working days, Flinders Street and Melbourne Central show the same trend where the counts gradually decrease during the morning times and slowly it starts to increase post afternoon. Southern Cross on the other hand, the counts remain consistent with very slight ups and downs throughout the day.
The most striking pattern that emerges is that of Southern Cross Station, as the foot traffic observed there is considerably low
To locate those unusual moments, Flinders Street Station data is calendarised on the canvas, using the sugrrants package. We can spot the unusual weekday patterns on public holidays using their color. Using the calendar plot, try to spot another unusual pattern, do a google search to try to explain the change in foot traffic. (Hint: Look at the plot carefully, does a particular day show a different daily pattern? Is there a specific event or festival happening on that day?)
# filter to just look at flinders st station
flinders <- walk_2018_hols %>%
filter(Sensor == 'Flinders Street Station Underpass')
flinders_cal <- flinders %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- flinders_cal %>%
ggplot(aes(x = .Time, y = .Count, colour = workday, group = Date)) +
geom_line()
prettify(gg_cal) +
theme(legend.position = "bottom")
!> Answer: On Jan 1st, the pattern is bit unusual compared to other working days since it was new year and the foot traffic was high during early morning because of new year events such as fireworks. On Jan 26th, the pattern was again different due to Australia Day. On February 17th, the pattern was different compared to normal weekend pattern. The counts kept on increasing from evening till next day early morning, post which it started to decrease. The reason for this is White Night Festival which starts from 7PM onwards till late night.
## # A tibble: 182,952 x 5
## Sensor Date_Time Date Time Count
## <chr> <dttm> <date> <int> <int>
## 1 Bourke Street Mall (North) 2020-01-01 00:00:00 2020-01-01 0 1012
## 2 Bourke Street Mall (South) 2020-01-01 00:00:00 2020-01-01 0 725
## 3 Melbourne Central 2020-01-01 00:00:00 2020-01-01 0 3553
## 4 Town Hall (West) 2020-01-01 00:00:00 2020-01-01 0 3120
## 5 Princes Bridge 2020-01-01 00:00:00 2020-01-01 0 998
## 6 Flinders Street Station Underpass 2020-01-01 00:00:00 2020-01-01 0 2372
## 7 Birrarung Marr 2020-01-01 00:00:00 2020-01-01 0 2944
## 8 Webb Bridge 2020-01-01 00:00:00 2020-01-01 0 778
## 9 Southern Cross Station 2020-01-01 00:00:00 2020-01-01 0 1619
## 10 Victoria Point 2020-01-01 00:00:00 2020-01-01 0 1194
## # ... with 182,942 more rows
You’ll need to ensure that you follow the steps we did earlier to filter the data and add the holiday information.
walk_2020 <- walk_2020 %>%
filter(Sensor %in% selected_sensors) %>%
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
mutate(mday = mday(Date),
month = month(Date),
year = year(Date),
day_of_year = yday(Date))
walk_2020
## # A tibble: 11,616 x 9
## Sensor Date_Time Date Time Count mday month year
## <chr> <dttm> <date> <int> <int> <int> <dbl> <dbl>
## 1 Melbo~ 2020-01-01 00:00:00 2020-01-01 0 3553 1 1 2020
## 2 Flind~ 2020-01-01 00:00:00 2020-01-01 0 2372 1 1 2020
## 3 Birra~ 2020-01-01 00:00:00 2020-01-01 0 2944 1 1 2020
## 4 South~ 2020-01-01 00:00:00 2020-01-01 0 1619 1 1 2020
## 5 Melbo~ 2020-01-01 01:00:00 2020-01-01 1 4370 1 1 2020
## 6 Flind~ 2020-01-01 01:00:00 2020-01-01 1 2036 1 1 2020
## 7 Birra~ 2020-01-01 01:00:00 2020-01-01 1 412 1 1 2020
## 8 South~ 2020-01-01 01:00:00 2020-01-01 1 388 1 1 2020
## 9 Melbo~ 2020-01-01 02:00:00 2020-01-01 2 1897 1 1 2020
## 10 Flind~ 2020-01-01 02:00:00 2020-01-01 2 2347 1 1 2020
## # ... with 11,606 more rows, and 1 more variable: day_of_year <dbl>
Now add the holiday data
# also the steps for adding in the holiday info
hols_2020 <- tsibble::holiday_aus(year = 2020, state = "VIC")
walk_2020_hols <- walk_2020 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = weekday=='Sun' | weekday=='Sat' | Date %in% hols_2020$date,
true = 'Non-workday',
false = 'Workday'
))
melb_walk_hols <- bind_rows(walk_2018_hols, walk_2020_hols)
melb_walk_hols
## # A tibble: 23,136 x 11
## Sensor Date_Time Date Time Count mday month year
## <chr> <dttm> <date> <int> <int> <int> <dbl> <dbl>
## 1 Melbo~ 2018-01-01 00:00:00 2018-01-01 0 2996 1 1 2018
## 2 Flind~ 2018-01-01 00:00:00 2018-01-01 0 3443 1 1 2018
## 3 Birra~ 2018-01-01 00:00:00 2018-01-01 0 1828 1 1 2018
## 4 South~ 2018-01-01 00:00:00 2018-01-01 0 1411 1 1 2018
## 5 Melbo~ 2018-01-01 01:00:00 2018-01-01 1 3481 1 1 2018
## 6 Flind~ 2018-01-01 01:00:00 2018-01-01 1 3579 1 1 2018
## 7 Birra~ 2018-01-01 01:00:00 2018-01-01 1 1143 1 1 2018
## 8 South~ 2018-01-01 01:00:00 2018-01-01 1 436 1 1 2018
## 9 Melbo~ 2018-01-01 02:00:00 2018-01-01 2 1721 1 1 2018
## 10 Flind~ 2018-01-01 02:00:00 2018-01-01 2 3157 1 1 2018
## # ... with 23,126 more rows, and 3 more variables: day_of_year <dbl>,
## # weekday <ord>, workday <chr>
!> Answer: Repeating lines of code can lead to errors and inconsistencies. To avoid repeating, there are two approaches which can be used. 1. Functions: Functions makes the code easier to understand and reduces the lines of codes. As requirements change, you only need to update the code in one place, instead of many. Also, you eliminate the chances of making incidental mistakes such as updating a variable name in one place but not in another.
filter_sensor <- function(data, sensors){
data %>% filter(Sensor %in% sensors)
}
add_day_info <- function(data){
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
data %>%
mutate(mday = mday(Date),
month = month(Date),
year = year(Date),
day_of_year = day(Date))
}
add_working_day <- function(data){
walk_years <- unique(data$year)
hols <- tsibble::holiday_aus(year = walk_years, state = "VIC")
data %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = weekday=='Sun'| weekday=='Sat' | Date %in% hols$date,
true = 'Non-workday',
false = 'Workday'
))
}
# Step one, combine the walking data
bind_rows(walk_2018, walk_2020) %>%
# Step two, filter the sensors
filter_sensor(sensors = selected_sensors) %>%
# step three, add the info on day of the year, month, etc
add_day_info() %>%
# strep four, add info on working days.
add_working_day()
## # A tibble: 23,136 x 11
## Sensor Date_Time Date Time Count mday month year
## <chr> <dttm> <date> <int> <int> <int> <dbl> <dbl>
## 1 Melbo~ 2018-01-01 00:00:00 2018-01-01 0 2996 1 1 2018
## 2 Flind~ 2018-01-01 00:00:00 2018-01-01 0 3443 1 1 2018
## 3 Birra~ 2018-01-01 00:00:00 2018-01-01 0 1828 1 1 2018
## 4 South~ 2018-01-01 00:00:00 2018-01-01 0 1411 1 1 2018
## 5 Melbo~ 2018-01-01 01:00:00 2018-01-01 1 3481 1 1 2018
## 6 Flind~ 2018-01-01 01:00:00 2018-01-01 1 3579 1 1 2018
## 7 Birra~ 2018-01-01 01:00:00 2018-01-01 1 1143 1 1 2018
## 8 South~ 2018-01-01 01:00:00 2018-01-01 1 436 1 1 2018
## 9 Melbo~ 2018-01-01 02:00:00 2018-01-01 2 1721 1 1 2018
## 10 Flind~ 2018-01-01 02:00:00 2018-01-01 2 3157 1 1 2018
## # ... with 23,126 more rows, and 3 more variables: day_of_year <int>,
## # weekday <ord>, workday <chr>
Write a paragraph that describe what you learn from these plots. Can you describe any similarities, and differences amongst the plots, and why they might be similar or different? (You might need to play with the plot output size to clearly see the pattern)
melb_walk_hols_flinders_april <- melb_walk_hols %>%
filter(month==4, Sensor == 'Flinders Street Station Underpass')
ggplot(melb_walk_hols_flinders_april,
aes(x = Time, y = Count,
colour = as.factor(year))) +
geom_line() +
facet_wrap(~weekday, ncol = 5) +
theme(legend.position = "bottom") +
labs(colour = "Year")
!> Answer: The above plot shows the hourly pedestrain counts across all the days of the week at Flinders Street Station Underpass in April 2018 and April 2020.
During April 2018, We can see that on weekdays, the pattern is similar. On early mornings before 5AM, the counts are siginificantly less. Post 5AM, the counts begin to rise gradually with a peak around 8AM.
Further, the counts cool down with minor increase and decrease till afternoon 3PM post which there is sudden increase with a peak at 5PM. After 5PM, the counts cool down again. On weekends, the pattern is bit different. The counts begin to decrease initially with an increase after 5AM post which it keeps on increasing till 4PM. After 4PM, the counts cool down.
The reason for such patterns are on weekdays, 8AM and 5PM are the time intervals where majority of people go to office and come back from office. These people might live in suburbs and travel to office which would be in the city. On weekends, people prefer to stay outside till late-night due to which we can see that the counts are high around 12AM on weekends compared to weekdays.
April 2020 looks like a freezing period where the counts are significantly low both across weekdays and weekends with the same kind of pattern. The reason for this is majority of people worked from home because of a major state lockdown due to Coronavirus pandemic.
What do you learn? Which Sensors seem the most similar? Or the most different?
melb_walk_hols_april <- melb_walk_hols %>% filter(month == 4)
ggplot(melb_walk_hols_april,
aes(x = Time, y = Count, color = as.factor(year))) +
geom_line() +
facet_grid(Sensor~weekday) +
theme(legend.position = "bottom") +
labs(colour = "Year")
!> Answer:
From the above set of plots, we see Flinders Street Station Underpass and Melbourne Central show a similar pattern on weekdays. At Melbourne Central, the pedestrain counts are less during early morning and it slowly starts to rise till 12 Noon post which there are few ups and downs and the counts cool down during evening time. On weekends, both the places show the exact same pattern.
Birrarung Marr and Southern Cross Station show a different pattern altogether.
An overwhelmingly high number of pedestrains travelled in April 2018 as compared to 2020. The Covid-19 lockdown has ofcourse caused this fall in numbers.
Combining weather data with pedestrian counts
One question we want to answer is: “Does the weather make a difference to the number of people walking out?”
Time of day and day of week are the predominant driving force of the number of pedestrian, depicted in the previous data plots. Apart from these temporal factors, the weather condition could possibly affect how many people are walking in the city. In particular, people are likely to stay indoors, when the day is too hot or too cold, or raining hard.
Daily meteorological data as a separate source, available on National Climatic Data Center, is used and joined to the main pedestrian data table using common dates.
Binary variables are created to serve as the tipping points
We have pulled information on weather stations for the Melbourne area - can you combine it together into one dataset?
prcp
> 5 (if yes, “rain”, if no, “none”)tmax
> 33 (if yes, “hot”, if no, “not”)tmin
< 6 (if yes, “cold”, if no, “not”)# Now create some flag variables
melb_weather_2018 <- read_csv(
here::here("data-raw/melb_ncdc_2018.csv")
) %>%
mutate(
high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none"),
high_temp = if_else(condition = tmax > 33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin < 6,
true = "cold",
false = "not")
)
The weather data is per day, and the pedestrian count data is every hour. One way to explore this data is to collapse the pedestrian count data down to the total daily counts, so we can compare the total number of people each day to the weather for each day. This means each row is the total number of counts at each sensor, for each day.
Depending on how you do this, you will likely need to merge the pedestrian count data back with the weather data. Remember that we want to look at the data for 2018 only
melb_daily_walk_2018 <- melb_walk_hols %>%
filter(year == 2018) %>%
group_by(Sensor, Date) %>%
summarise(Count = sum(Count, na.rm = TRUE)) %>%
ungroup() %>%
rename(date = Date)
melb_daily_walk_weather_2018 <- melb_daily_walk_2018 %>%
inner_join(melb_weather_2018, by = 'date')
head(melb_daily_walk_weather_2018)
## # A tibble: 6 x 10
## Sensor date Count station tmax tmin prcp high_prcp high_temp low_temp
## <chr> <date> <int> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 Birra~ 2018-01-01 8385 ASN000~ 26.2 14 0 none not not
## 2 Birra~ 2018-01-02 9844 ASN000~ 23.6 15.5 0 none not not
## 3 Birra~ 2018-01-03 4091 ASN000~ 22.3 11.2 0 none not not
## 4 Birra~ 2018-01-04 1386 ASN000~ 25.5 11.5 0 none not not
## 5 Birra~ 2018-01-05 1052 ASN000~ 30.5 12.2 0 none not not
## 6 Birra~ 2018-01-06 6672 ASN000~ 41.5 16.6 0 none hot not
Create a few plots that look at the spread of the daily totals for each of the sensors, according to the weather flagging variables (high_prcp
, high_temp
, and low_temp
). Write a paragraph that tells us what you learn from these plots, how you think weather might be impacting how people go outside. Make sure to discuss the strengths and limitations of the plots summarised like this, what assumption do they make?
# Plot of count for each sensor against high rain
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = high_prcp)) +
geom_boxplot() +
theme(legend.position = "bottom")
# Plot against high temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = high_temp)) +
geom_boxplot() +
theme(legend.position = "bottom")
# Plot of low temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = low_temp)) +
geom_boxplot() +
theme(legend.position = "bottom")
!> Answer:
As one would expect, there is a higher pedestrian count when there is less rain. But it is worth noting that in Birrarung Marr, pedestrian count median is actually higher for rainy days. Southern Cross station has the most stark difference between rainy and non rainy days.
Again, Birrarung Marr station has less median foot traffic in cooler days. All other stations show less foot traffic when temperatures during the day exceed 33 degrees celsius.
Here, we can observe a limitation of this data. From an initial look, we infer from the above plot that there were barely an pedestrians using trains to travel during colder days. However, the reality is that between Jan - April 2018, there was only one day that had a temperature less than 6 degrees celsius. Therefore, this plot doesn’t offer much insight
The above data is summarised in form of box plots. These kind of plots can handle and present a summary of large amount of data. It shows a median (represented by a horizontal line inside the boxes), upper and lower quartiles, minimum and maximum data values. This is more of an efficient way in understanding the spread of data points and also shows the outliers present in the data.
A limitation of such kind of plots would be the exact values are not retained. Through the observation in the first plot, we can assume that majority of people prefer to wait under the subway or underpass when the precipitation is high.
The visualisations tell us something interesting about the data, but to really understand the data, we need to perform some modelling. To do this, you need to combine the weather data with the pedestrian data. We have provided the weather data for 2018 and 2020, combine with the pedestrian data for 2018 and 2020.
melb_weather_2018 <- read_csv(here::here("data-raw/melb_ncdc_2018.csv"))
melb_weather_2020 <- read_csv(here::here("data-raw/melb_ncdc_2020.csv"))
# task: combine the weather data together into an object, `melb_weather`
melb_weather <- bind_rows(melb_weather_2018,
melb_weather_2020) %>%
# remember to add info about high precipitation, high temperature, + low temps
mutate(
high_prcp = if_else(condition = prcp > 5, true = "rain", false = "none"),
high_temp = if_else(condition = tmax > 33, true = "hot", false = "not"),
low_temp = if_else(condition = tmin < 6, true = "cold", false = "not")
) %>%
rename(Date = date)
# now combine this weather data with the walking data
melb_walk_weather <- melb_walk_hols %>%
inner_join(melb_weather, by = 'Date')
head(melb_walk_weather)
## # A tibble: 6 x 18
## Sensor Date_Time Date Time Count mday month year
## <chr> <dttm> <date> <int> <int> <int> <dbl> <dbl>
## 1 Melbo~ 2018-01-01 00:00:00 2018-01-01 0 2996 1 1 2018
## 2 Flind~ 2018-01-01 00:00:00 2018-01-01 0 3443 1 1 2018
## 3 Birra~ 2018-01-01 00:00:00 2018-01-01 0 1828 1 1 2018
## 4 South~ 2018-01-01 00:00:00 2018-01-01 0 1411 1 1 2018
## 5 Melbo~ 2018-01-01 01:00:00 2018-01-01 1 3481 1 1 2018
## 6 Flind~ 2018-01-01 01:00:00 2018-01-01 1 3579 1 1 2018
## # ... with 10 more variables: day_of_year <dbl>, weekday <ord>, workday <chr>,
## # station <chr>, tmax <dbl>, tmin <dbl>, prcp <dbl>, high_prcp <chr>,
## # high_temp <chr>, low_temp <chr>
We have been able to start answering the question, “Does the weather make a difference to the number of people walking out?” by looking at a few exploratory plots. However, we want to get a bit more definitive answer by performing some statistical modelling.
We are going to process the data somewhat so we can fit a linear model to the data. First, let’s take a set relevant variables to be factors. This ensures that the linear model interprets them appropriately.
We also add one to count and then take the natural log of it. The reasons for this are a bit complex, but essentially a linear model is not the most ideal model to fit for this data, and we can help it be more ideal by taking the log of the counts, which helps stabilise the residuals (predictions - observed) when we fit the model.
melb_walk_weather_prep_lm <- melb_walk_weather %>%
mutate_at(.vars = vars(Sensor,
Time,
month,
year,
workday,
high_prcp,
high_temp,
low_temp),
as_factor) %>%
mutate(log_count = log1p(Count))
Now we fit a linear model, predicting logCount using Time, Month, weekday and the weather flag variables (high_prcp
, high_temp
, and low_temp
)
walk_fit_lm <- lm (
formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
Provide some summary statistics on how well this model fits the data? What do you see? What statistics tell us about our model fit?
glance(walk_fit_lm)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.524 0.524 1.21 669. 0 36 -34259. 68592. 68887.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
!> Answer: R-squared which is also called the coefficient of determination, measures the strength of linear model fit. We see that the r-squared value is 0.5244 which means that 52.44% of the variability in our target/response variable i.e. log_count is explained by the model.
AIC and BIC are also good indicators, and they need to be on the lower side. Both those values are quite high for this model, hence indicating a bad fit for this data.
We have had one look at the model fit statistics, but let’s now look at the fitted and observed (log_count
) values, for each sensor:
peds_aug_lm <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
ggplot(peds_aug_lm,
aes(x = log_count,
y = .fitted)) +
geom_point() +
facet_wrap(~Sensor)
There is actually a lot of variation. Looking at this, you might assume that the model does a bad job of fitting the residuals. However, we must remember that there is an inherent time structure to the data. A better way to explore this is to look directly at the temporal components. We can do this directly with a calendar plot.
Before we can get the data into the right format for analysis, we need to pivot the data into a longer format, so that we have columns of Date, Time, Count, Model, and log_count.
flinders_lm <- peds_aug_lm %>%
# filter the data to only look at flinder st
filter(Sensor == 'Flinders Street Station Underpass') %>%
# Select the Date, Time, Count, .fitted, and log_count
select(Date, Time, Count, .fitted, log_count) %>%
# Now pivot the log count and fitted columns into a column called "model"
# data so we have columns of Date, Time, Count, Model,
# and log_count.
pivot_longer(c(`.fitted`, `log_count`), names_to = "model", values_to = "log_count") %>%
# Now we're going to undo the intiial data transformation
mutate(Count = expm1(log_count))
head(flinders_lm)
## # A tibble: 6 x 5
## Date Time Count model log_count
## <date> <fct> <dbl> <chr> <dbl>
## 1 2018-01-01 0 109. .fitted 4.70
## 2 2018-01-01 0 3443. log_count 8.14
## 3 2018-01-01 1 61.5 .fitted 4.13
## 4 2018-01-01 1 3579. log_count 8.18
## 5 2018-01-01 2 37.1 .fitted 3.64
## 6 2018-01-01 2 3157 log_count 8.06
flinders_lm_cal <- flinders_lm %>%
# Let's just look at 2020 to make it a bit easier
filter(year(Date) == "2020") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- ggplot(flinders_lm_cal, aes(x = .Time, y = .Count, group = Date)) +
# Part of the interface to overlaying multiple lines in a calendar plot
# is drawing two separate `geom_line`s -
# See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
# for details
geom_line(data = filter(flinders_lm_cal, model == 'log_count'),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(flinders_lm_cal, model == '.fitted'),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal) + theme(legend.position = "bottom")
Write a paragraph answering these questions:
!> Answer:
In Jan 2020, we can see that from the second week, the fitted values follow the same kind of pattern as observed values over the weekdays and weekends till March 17th except for Jan 27th and March 9th. Jan 27th marks Australia Day which was a public holiday due to which the pattern between fitted and observed values differ. Also, March 9th marks Labour Day. From March 16th, till April mid, we can see that the observed pedestrain counts started to decrease as the government imposed a ban on large gatherings due to COVID 19 pandemic. As the observed values started to decrease, the model captured those patterns and even the fitted values started to decrease.
The difference with 2018 is that the model tries to capture the patterns (like in 2020) as fitted values follow the same trend as observed values for most of the days except for public holidays or other events/special occasions.
Across the sensors, there is no difference in terms of the pattern that the model follows.
We can add Sensor as a feature or variable to the model since it can have some good correlation with the target variable. We can also add mday, day_of_year and workday as the features to our model since these are derived features from existing set of features such as month, year, and weekday. This can help to unleash the hidden patterns in our dataset. The existing features may not have a direct correlation with our target variable but derived features can have a high correlation. We need to extract it and add it in our model in order to make the model better.
What sort of improvements do you think you could make to the model? Fit two more models, Try adding more variables to the linear model.
walk_fit_lm_mod1 <- lm(
formula = log_count ~ Time + month + Sensor + weekday + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
walk_fit_lm_mod2 <- lm(
formula = log_count ~ Time + month + mday + day_of_year + workday + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
Why did you add those variables?
!> Answer: Adding Sensor as an explanatory variable can also help to explain some variation in our target variable. Also, adding derived features in our model as a part of feature creation, can help to unleash hidden patterns in our dataset.
bind_rows(
first = glance(walk_fit_lm),
second = glance(walk_fit_lm_mod1),
third = glance(walk_fit_lm_mod2),
.id = "type"
)
## # A tibble: 3 x 12
## type r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 first 0.524 0.524 1.21 669. 0 36 -34259. 68592.
## 2 seco~ 0.690 0.689 0.981 1242. 0 39 -29719. 59518.
## 3 third 0.569 0.568 1.16 875. 0 33 -33216. 66500.
## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
Which model does the best? Why do you think that is? What statistics are you basing this decision off?
!> Answer: walk_fit_lm_mod1 does a better job compared to other models since the R-squared value is high which is 0.68.
(Suggestion - Perhaps write this as a function to speed up comparison)
peds_aug_lm_sensor_mod1 <- augment(walk_fit_lm_mod1, data = melb_walk_weather_prep_lm)
head(peds_aug_lm_sensor_mod1)
## # A tibble: 6 x 26
## Sensor Date_Time Date Time Count mday month year
## <fct> <dttm> <date> <fct> <int> <int> <fct> <fct>
## 1 Melbo~ 2018-01-01 00:00:00 2018-01-01 0 2996 1 1 2018
## 2 Flind~ 2018-01-01 00:00:00 2018-01-01 0 3443 1 1 2018
## 3 Birra~ 2018-01-01 00:00:00 2018-01-01 0 1828 1 1 2018
## 4 South~ 2018-01-01 00:00:00 2018-01-01 0 1411 1 1 2018
## 5 Melbo~ 2018-01-01 01:00:00 2018-01-01 1 3481 1 1 2018
## 6 Flind~ 2018-01-01 01:00:00 2018-01-01 1 3579 1 1 2018
## # ... with 18 more variables: day_of_year <dbl>, weekday <ord>, workday <fct>,
## # station <chr>, tmax <dbl>, tmin <dbl>, prcp <dbl>, high_prcp <fct>,
## # high_temp <fct>, low_temp <fct>, log_count <dbl>, .fitted <dbl>,
## # .se.fit <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
## # .std.resid <dbl>
pivot_sensor <- function(lm_fit, sensor = "Flinders Street Station Underpass"){
lm_fit %>%
filter(Sensor == sensor) %>%
select(Date, Time, Count, .fitted, log_count) %>%
pivot_longer(c(`.fitted`, `log_count`), names_to = "model", values_to = "log_count") %>%
mutate(Count = expm1(log_count))
}
calendar_fit_obs <- function(lm_fit_aug){
data_cal <- lm_fit_aug %>%
frame_calendar(x = Time, y = Count, date = Date)
ggcal <- ggplot(data_cal, aes(x = .Time, y = .Count, group = Date)) +
geom_line(data = filter(data_cal, model == "log_count"),
aes(x = .Time, y = .Count, colour = model, group = Date)) +
geom_line(data = filter(data_cal, model == ".fitted"),
aes(x = .Time, y = .Count, colour = model, group = Date))
prettify(gg_cal) + theme(legend.position = "bottom")
}
pivot_sensor(peds_aug_lm_sensor_mod1, sensor = "Flinders Street Station Underpass") %>%
filter(year(Date) == "2020") %>%
calendar_fit_obs()
What do you see? How does it compare to the previous model?
!> Answer: Compared to the previous model, the best model (with added Sensor variable) show a more better fit on observed values. The model errors decrease as well.
Compare the fitted against the residuals, perhaps write a function to help you do this in a more readable way.
walk_fit_lm_first <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
walk_fit_lm_second <- augment(walk_fit_lm_mod1, data = melb_walk_weather_prep_lm)
walk_fit_lm_third <- augment(walk_fit_lm_mod2, data = melb_walk_weather_prep_lm)
plot_fit_resid <- function(data){
ggplot(data,
aes(x = .fitted,
y = .resid)) +
geom_point() +
facet_wrap(~Sensor)
}
plot_fit_resid(walk_fit_lm_first)
plot_fit_resid(walk_fit_lm_second)
plot_fit_resid(walk_fit_lm_third)
We’ve looked at all these models, now pick your best one, and compare the predicted values against the actual observed values. What do you see? Is the model good? Is it bad? Do you have any thoughts on why it is good or bad?
!> Answer: The model with added Sensor variable (Second model) is the best model since the the residual values are less corresponding to fitted values (as we see in the above plots). This means that the model is able to generalise the patterns well.
Make sure to reference all of the R packages that you used here, along with any links or blog posts that you used to help you answer these questions
The data of sensors in City of Melbourne were downloaded from Melbourne Open Data Portal, https://data.melbourne.vic.gov.au/Transport/Pedestrian-Counting-System-2009-to-Present-counts-/b2ak-trbp
Daily meteorological data, downloaded from National Climatic Data Center, https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/
Inspiration of Question 3F – Use in conjunction with group_by, from https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
David Robinson and Alex Hayes (2020). broom: Convert Statistical Analysis Objects into Tidy Tibbles. R package version 0.5.5. https://CRAN.R-project.org/package=broom
D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
Yihui Xie (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.28.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL http://www.jstatsoft.org/v40/i03/.
Earo Wang (2020). rwalkr: API to Melbourne Pedestrian Data. R package version 0.5.2. https://CRAN.R-project.org/package=rwalkr
Wang, E, D Cook, and RJ Hyndman (2020). Calendar-based graphics for visualizing people’s daily schedules. Journal of Computational and Graphical Statistics. doi:10.1080/10618600.2020.1715226.
Diethelm Wuertz, Tobias Setz, Yohan Chalabi, Martin Maechler and Joe W. Byers (2018). timeDate: Rmetri
This code below here is what was used to retrieve the data in the data-raw
folder.
# melb_bbox <- c(min(ped_loc$longitude) - .001,
# min(ped_loc$latitude) - 0.001,
# max(ped_loc$longitude) + .001,
# max(ped_loc$latitude) + 0.001)
#
# melb_map <- get_map(location = melb_bbox, source = "osm")
# write_rds(melb_map,
# path = here::here("2020/assignment-2/data-raw/melb-map.rds"),
# compress = "xz")
# code to download the stations around the airport and the weather times
# this is purely here so you can see how we downloaded this data
# it is not needed for you to complete the assignment, so it is commented out
# melb_stns <- read_table(
# file = "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
# col_names = c("ID",
# "lat",
# "lon",
# "elev",
# "state",
# "name",
# "v1",
# "v2",
# "v3"),
# skip = 353,
# n_max = 17081
# ) %>%
# filter(state == "MELBOURNE AIRPORT")
# #
# get_ncdc <- function(year){
# vroom::vroom(
# glue::glue(
# "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/{year}.csv.gz"
# ),
# col_names = FALSE,
# col_select = 1:4
# )
# }
#
# clean_ncdc <- function(x){
# x %>%
# filter(X1 == melb_stns$ID, X3 %in% c("PRCP", "TMAX", "TMIN")) %>%
# rename_all(~ c("station", "date", "variable", "value")) %>%
# mutate(date = ymd(date), value = value / 10) %>%
# pivot_wider(names_from = variable, values_from = value) %>%
# rename_all(tolower)
# }
# ncdc_2018 <- get_ncdc(2018)
# melb_ncdc_2018 <- clean_ncdc(ncdc_2018)
# write_csv(melb_ncdc_2018,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2018.csv"))
#
# ncdc_2020 <- get_ncdc(2020)
# beepr::beep(sound = 4)
# melb_ncdc_2020 <- clean_ncdc(ncdc_2020)
# beepr::beep(sound = 4)
# write_csv(melb_ncdc_2020,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2020.csv"))