This assignment is designed to simulate a scenario where you are given a dataset, and taking over someone’s existing work, and continuing with it to draw some further insights. This aspect of it is similar to Assignment 1, but it will provide less scaffolding, and ask you to draw more insights, as well as do more communication.
Your previous submission of crime data was well received!
You’ve now been given a different next task to work on. Your colleague at your consulting firm, Amelia (in the text treatment below) has written some helpful hints throughout the assignment to help guide you.
Questions that are worth marks are indicated with “Q” at the start and end of the question, as well as the number of marks in parenthesis. For example
## Q1A some text (0.5 marks)
Is question one, part A, worth 0.5 marks
This assignment will be worth 10% of your total grade, and is marked out of 58 marks total.
9 Marks for grammar and clarity. You must write in complete sentences and do a spell check.
9 Marks for presentation of the data visualisations
40 marks for the questions
Your marks will be weighted according to peer evaluation.
As of week 6, you have seen most of the code for parts 1 - 2 that needs to be used here, and Week 7 will give you the skills to complete part 3. I do not expect you to know immediately what the code below does - this is a challenge for you! We will be covering skills on modelling in the next weeks, but this assignment is designed to simulate a real life work situation - this means that there are some things where you need to “learn on the job”. But the vast majority of the assignment will cover things that you will have seen in class, or the readings.
Remember, you can look up the help file for functions by typing ?function_name
. For example, ?mean
. Feel free to google questions you have about how to do other kinds of plots, and post on the ED if you have any questions about the assignment.
To complete the assignment you will need to fill in the blanks for function names, arguments, or other names. These sections are marked with ***
or ___
. At a minimum, your assignment should be able to be “knitted” using the knit
button for your Rmarkdown document.
If you want to look at what the assignment looks like in progress, but you do not have valid R code in all the R code chunks, remember that you can set the chunk options to eval = FALSE
like so:
```{r this-chunk-will-not-run, eval = FALSE}`r''`
ggplot()
```
If you do this, please remember to ensure that you remove this chunk option or set it to eval = TRUE
when you submit the assignment, to ensure all your R code runs.
You will be completing this assignment in your assigned groups. A reminder regarding our recommendations for completing group assignments:
Your assignments will be peer reviewed, and results checked for reproducibility. This means:
Each student will be randomly assigned another team’s submission to provide feedback on three things:
This assignment is due in by 6pm on Wednesday 20th May. You will submit the assignment via ED. Please change the file name to include your teams name. For example, if you are team dplyr
, your assignment file name could read: “assignment-2-2020-s1-team-dplyr.Rmd”
You work as a data scientist in the well named consulting company, “Consulting for You”.
On your second day at the company, you impressed the team with your work on crime data. Your boss says to you:
Amelia has managed to find yet another treasure trove of data - get this: pedestrian count data in inner city Melbourne! Amelia is still in New Zealand, and now won’t be back now for a while. They discovered this dataset the afternoon before they left on holiday, and got started on doing some data analysis.
We’ve got a meeting coming up soon where we need to discuss some new directions for the company, and we want you to tell us about this dataset and what we can do with it.
Most Importantly, can you get this to me by Wednesday 20th May, COB (COB = Close of Business at 6pm).
I’ve given this dataset to some of the other new hire data scientists as well, you’ll all be working as a team on this dataset. I’d like you to all try and work on the questions separately, and then combine your answers together to provide the best results.
From here, you are handed a USB stick. You load this into your computer, and you see a folder called “melbourne-walk”. In it is a folder called “data-raw”, and an Rmarkdown file. It contains the start of a data analysis. Your job is to explore the data and answer the questions in the document.
Note that the text that is written was originally written by Amelia, and you need to make sure that their name is kept up top, and to pay attention to what they have to say in the document!
The City of Melbourne has sensors set up in strategic locations across the inner city to keep hourly tallies of pedestrians. The data is updated on a monthly basis and available for download from Melbourne Open Data Portal. The rwalkr package provides an API in R to easily access sensor counts and geographic locations.
There are three parts to this work:
Amelia: I’ve downloaded a map chunk of Melbourne. Can you take the map I made, and plot the location of the sensors on top of it? We want to be able to see all of the sensors, but we also want to create different shapes for the following sensors:
First we download the data on the pedestrian sensor locations around Melbourne.
And now we draw a plot on a map tile of the pedestrian sensor locations
!> Answer: This map displays the locations of various pedestrian sensors around Melbourne, with longitude plotted on the x-axis and latitude on the y-axis. From this, we can visualise the distribution of the sensors around the city, particulary the four of interest for this analysis - Birrarung Marr, Flinders Street Station Underpass and Southern Cross Station.
ped_loc %>%
# calculate the year from the date information
mutate(year = year(installation_date)) %>%
# count up the number of sensors
count(year) %>%
# then use `kable()` to create a markdown table
kable(caption = "Number of sensors installed each year", col.names = c("Year", "Count")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Year | Count |
---|---|
2009 | 13 |
2013 | 12 |
2014 | 2 |
2015 | 7 |
2016 | 1 |
2017 | 9 |
2018 | 5 |
2019 | 6 |
2020 | 4 |
Additionally, how many sensors were added in 2016, and in 2017?
!> Answer: From this table it appears that 1 sensor was added in 2016 and 9 were added in 2017.
We would like you to focus on the foot traffic at 4 sensors:
Your task is to:
Extract the data for the year 2018 from Jan 1 to April 30 using rwalkr
(you might want to cache this so it doesn’t run every time you knit)
Filter the data down to include only the four sensors above, note that you could do this directly with rwalkr
add variables that contain the day of the month, month, year, and day in the year.
walk_2018 <- walk_2018 %>%
# Filter the data down to include only the four sensors above
filter(Sensor %in% c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr")
) %>%
# now add four columns, containing month day, month, year, and day of the year
# using functions from lubridate.
mutate(mday = mday(Date),
month = month(Date),
year = year(Date),
yday = yday(Date)
)
Now we can plot the pedestrian count information for January - April in 2018
ggplot(walk_2018,
aes(x = Date_Time,
y = Count)) +
geom_line(size = 0.3) +
facet_grid(Sensor ~.,
# this code presents the facets ina nice way
labeller = labeller(Sensor = label_wrap_gen(20))) +
# this code mades the x axis a bit nicer to read
scale_x_datetime(date_labels = "%d %b %Y",
date_minor_breaks = "1 month") +
labs(x = "Date Time") +
ggtitle("Pedestrian count against time for the selected sensors")
We can see that there are quite different patterns in each of the sensors. Let’s explore this further.
!> Answer: The sensors capture two different types of activities at each location. People entering from one direction and also people exiting from another.
We’re primarily interested in exploiting pedestrian patterns at various time resolutions and across different locations. In light of people’s daily schedules, let’s plot the counts against time of day for each sensor.
ggplot(walk_2018,
aes(x = Time,
y = Count,
group = Date,
colour = Sensor)) +
geom_line() +
facet_wrap(~ Sensor,
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2",
name = "Sensor") +
theme(legend.position = "none") +
ggtitle("Counts for each sensor dependent on time of day")
Write a short paragraph that describe what the plot shows:
!> Answer: This plot depicts the number of pedestrians counted by the various sensors at different times of the day, with Time on the x-axis and count on the y-axis. Each line represents a different day recorded over the time period. The data for Flinders Street Station Underpass and Southern Cross Station are quite similar with significant peaks at around 8am and 5pm - likely due to increased traffic around the railway stations during the start and end of the work day. This pattern remains quite consistent across the whole period except that Southern Cross Station has quite a few days with almost no traffic. These days possibly represent weekends when less pedestrians are accessing the station. Meanwhile Birrarung Marr experiences much more variation in the count variable with no clear trend emerging. Whearas Melbourne Central is the most consistent of the four sensors, experiencing a consistent level of traffic from 12pm-9pm on most days recorded. This is perhaps due to most people shopping at the centre during these times.
Use the data inside the hols_2018
data to identify weekdays and weekends, and holidays.
hols_2018 <- tsibble::holiday_aus(year = 2018, state = "VIC")
walk_2018_hols <- walk_2018 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = Date %in% hols_2018$date | weekday %in% c("Sat", "Sun"),
true = "no",
false = "yes"
))
Now create a plot to compare the workdays to the non workdays.
ggplot(walk_2018_hols,
aes(x = Time,
y = Count,
group = Date,
colour = workday)) +
geom_line(size = 0.3,
alpha = 0.3) +
facet_grid(weekday ~ Sensor,
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2", name = "Sensor") +
theme(legend.position = "none") +
ggtitle("Counts for working days against non-working days")
Write a short paragraph that describe what the plot shows, and helps us answer these questions:
!> Answer: This plot depicts the number of pedestrians counted by the various sensors at different times of the day, with Time on the x-axis and count on the y-axis. Each line represents a different day over the period with the orange lines depicting work days and the green line depicting weekends or public holidays. The Flinders Street Station Underpass and Southern Cross Station Underpass both experience higher counts on working days confirming the assumption made above that these sensors are reliant on workday traffic, with Southern Cross Station in particuar experiencing almost no traffic on Saturday and Sunday. Meanwhile the consistent trend for Melbourne Central holds true for both working and non-working days. Whearas with Birrarung Marr, it appears that the pedestrian traffic is much lower on working days and inversely higher on weekends and public holidays in particular. This may be due to various events being held in the area on these days.
To locate those unusual moments, Flinders Street Station data is calendarised on the canvas, using the sugrrants package. We can spot the unusual weekday patterns on public holidays using their color. Using the calendar plot, try to spot another unusual pattern, do a google search to try to explain the change in foot traffic. (Hint: Look at the plot carefully, does a particular day show a different daily pattern? Is there a specific event or festival happening on that day?)
# filter to just look at flinders st station
flinders <- walk_2018_hols %>%
filter(Sensor == "Flinders Street Station Underpass")
flinders_cal <- flinders %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- flinders_cal %>%
ggplot(aes(x = .Time, y = .Count, colour = workday, group = Date)) +
geom_line()
prettify(gg_cal) +
theme(legend.position = "bottom") +
ggtitle("Calender plot of pedestrian counts for the Flinders Street Station Underpass")
!> Answer: The weekend of the third week of February exhibits a distinct pattern in foot traffic, rising until the end of the day at 12am before declining the following day over the course of a few hours. This is likely due to the White Night event held on February 17th of that year which took place from 7pm on Saturday until 7am on Sunday and was described as a “twelve hour overnight celebration of creativity held in Melbourne’s CBD”. As such there were a higher number of pedestrians moving around the city than usual, contributing to the higher count recorded. Additionally, this weekend was also during the Chinese New Year which may have also contributed to the higher number of people than average.
You’ll need to ensure that you follow the steps we did earlier to filter the data and add the holiday information.
walk_2020 <- walk_2020 %>%
filter(Sensor %in% c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr")
) %>%
# now add four columns, containing month day, month, year, and day of the year
# using functions from lubridate.
mutate(mday = mday(Date),
month = month(Date),
year = year(Date),
yday = yday(Date)
)
Now add the holiday data
# also the steps for adding in the holiday info
hols_2020 <- tsibble::holiday_aus(year = 2020, state = "VIC")
walk_2020_hols <- walk_2020 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = Date %in% hols_2018$date | weekday %in% c("Sat", "Sun"),
true = "no",
false = "yes")
)
melb_walk_hols <- bind_rows(walk_2018_hols, walk_2020_hols)
The code from the previous section can be repeated in a more efficient manner by using functions. Three funtions can be created to achieve this and are displayed below. The first funtion is designed to filter the data so that only the four sensors of interest are included. The second adds the required columns for the day of the month, the month number, the year and the day of the year. The third function utilises the holiay dates from the tsibble package to distinguish between working and non-working days.
filter_sensor <- function(data, sensors){
data %>% filter(Sensor %in% c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr"))
}
add_day_info <- function(data){
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
data %>% mutate(mday = mday(Date),
month = month(Date),
year = year(Date),
yday = yday(Date))
}
add_working_day <- function(data){
walk_years <- unique(data$year)
hols <- tsibble::holiday_aus(year = walk_years, state = "VIC")
data %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = Date %in% hols_2018$date | weekday %in% c("Sat", "Sun"),
true = "no",
false = "yes"
))
}
# Step one, combine the walking data
bind_rows(walk_2018, walk_2020) %>%
# Step two, filter the sensors
filter_sensor(sensors = Sensor) %>%
# step three, add the info on day of the year, month, etc
add_day_info() %>%
# strep four, add info on working days.
add_working_day() %>%
head(10) %>%
kable(caption = "Pedestrian count data for 2018 & 2020 from Melbourne Central, Flinders Street Station Underpass, Southern Cross Station & Birrarung Marr") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Sensor | Date_Time | Date | Time | Count | mday | month | year | yday | weekday | workday |
---|---|---|---|---|---|---|---|---|---|---|
Melbourne Central | 2018-01-01 00:00:00 | 2018-01-01 | 0 | 2996 | 1 | 1 | 2018 | 1 | Mon | no |
Flinders Street Station Underpass | 2018-01-01 00:00:00 | 2018-01-01 | 0 | 3443 | 1 | 1 | 2018 | 1 | Mon | no |
Birrarung Marr | 2018-01-01 00:00:00 | 2018-01-01 | 0 | 1828 | 1 | 1 | 2018 | 1 | Mon | no |
Southern Cross Station | 2018-01-01 00:00:00 | 2018-01-01 | 0 | 1411 | 1 | 1 | 2018 | 1 | Mon | no |
Melbourne Central | 2018-01-01 01:00:00 | 2018-01-01 | 1 | 3481 | 1 | 1 | 2018 | 1 | Mon | no |
Flinders Street Station Underpass | 2018-01-01 01:00:00 | 2018-01-01 | 1 | 3579 | 1 | 1 | 2018 | 1 | Mon | no |
Birrarung Marr | 2018-01-01 01:00:00 | 2018-01-01 | 1 | 1143 | 1 | 1 | 2018 | 1 | Mon | no |
Southern Cross Station | 2018-01-01 01:00:00 | 2018-01-01 | 1 | 436 | 1 | 1 | 2018 | 1 | Mon | no |
Melbourne Central | 2018-01-01 02:00:00 | 2018-01-01 | 2 | 1721 | 1 | 1 | 2018 | 1 | Mon | no |
Flinders Street Station Underpass | 2018-01-01 02:00:00 | 2018-01-01 | 2 | 3157 | 1 | 1 | 2018 | 1 | Mon | no |
Write a paragraph that describe what you learn from these plots. Can you describe any similarities, and differences amongst the plots, and why they might be similar or different? (You might need to play with the plot output size to clearly see the pattern)
melb_walk_hols_flinders_april <- melb_walk_hols %>%
filter(Sensor == "Flinders Street Station Underpass", month == 4)
ggplot(melb_walk_hols_flinders_april,
aes(x = Time,
y = Count,
group = Date,
colour = as.factor(year))) +
facet_wrap(~ mday, ncol = 5) +
geom_line() +
theme(legend.position = "bottom") +
labs(colour = "Year") +
ggtitle("Comparison of pedestrian counts for April 2018 and April 2019 for the Flinders Street Station Underpass")
!> Answer: This plot depicts the number of pedestrians counted by the Flinders Street Station Underpass Sensor over the month of April, with Time on the x-axis and the count on the y-axis. Each line represents an individual day of the month, with the orange lines being for 2018 and blue ones for 2020. Quite clearly, the counts for 2018 are consistently higher than those for 2020 across everday for the entire month, with value for 2020 being close to 0. This is likely due to the recent COVID19 outbreak and subsequent lockdown laws in the city preventing travel for non-essential reasons, and causing the closure of many workplaces. As suchh less pedestrians are passing by the sensors. 2018 meanwhile saw no such restrictions in place. Counts for the 2nd and the 25th are lower for 2018 than average due to the public holidays on Easter Monday and Anzac respectively. Whearas 2020 saw so significant drops on these days compared to comparable days of the week.
What do you learn? Which Sensors seem the most similar? Or the most different?
melb_walk_hols_april <- melb_walk_hols %>% filter(month == 4)
ggplot(melb_walk_hols_april,
aes(x = Time,
y = Count,
group = Date,
colour = as.factor(year)
)) +
geom_line() +
facet_grid(mday ~ Sensor) +
scale_y_discrete() +
theme(legend.position = "bottom") +
labs(colour = "Year") +
ggtitle("Comparison of pedestrian counts for April 2018 and April 2019 for all 4 sensors")
!> Answer: This plot depicts the number of pedestrians counted by the Flinders Street Station Underpass Sensor over the month of April, with Time on the x-axis and the count on the y-axis. Each line represents an individual day of the month. The orange lines depcit 2018 and blue lines depict 2020. It appears that Birrarung Marr is the most similar across the two years, both experiencing periods of low pedestrian traffic, however 2020 is missing the various peaks seen in 2018. Meanwhile Flinders Street Station Underpass has the greatest difference between the two years, having a consistent low level of traffic throughout the day in 2020 rather than the two peaks and consistently higher level seen in 2018. Likewise both Melbourne Central and Southern Cross Station are down in 2020 compared to their 2018 levels.
Combining weather data with pedestrian counts
One question we want to answer is: “Does the weather make a difference to the number of people walking out?”
Time of day and day of week are the predominant driving force of the number of pedestrian, depicted in the previous data plots. Apart from these temporal factors, the weather condition could possibly affect how many people are walking in the city. In particular, people are likely to stay indoors, when the day is too hot or too cold, or raining hard.
Daily meteorological data as a separate source, available on National Climatic Data Center, is used and joined to the main pedestrian data table using common dates.
Binary variables are created to serve as the tipping points
We have pulled information on weather stations for the Melbourne area - can you combine it together into one dataset?
prcp
> 5 (if yes, “rain”, if no, “none”)tmax
> 33 (if yes, “hot”, if no, “not”)tmin
< 6 (if yes, “cold”, if no, “not”)# Now create some flag variables
melb_weather_2018 <- read_csv(
here::here("data-raw/melb_ncdc_2018.csv")
) %>%
mutate(
high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none"),
high_temp = if_else(condition = tmax > 33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin < 6,
true = "cold",
false = "not")
)
The weather data is per day, and the pedestrian count data is every hour. One way to explore this data is to collapse the pedestrian count data down to the total daily counts, so we can compare the total number of people each day to the weather for each day. This means each row is the total number of counts at each sensor, for each day.
Depending on how you do this, you will likely need to merge the pedestrian count data back with the weather data. Remember that we want to look at the data for 2018 only
melb_daily_walk_2018 <- melb_walk_hols %>%
filter(year == 2018) %>%
group_by(Sensor, Date) %>%
summarise(Count = sum(Count, na.rm = TRUE)) %>%
ungroup()
melb_daily_walk_weather_2018 <- melb_daily_walk_2018 %>%
left_join(melb_weather_2018, by = c("Date" = "date"))
melb_daily_walk_weather_2018 %>%
head(10) %>%
kable(caption = "Joined pedestrian and weather data for 2018") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Sensor | Date | Count | station | tmax | tmin | prcp | high_prcp | high_temp | low_temp |
---|---|---|---|---|---|---|---|---|---|
Birrarung Marr | 2018-01-01 | 8385 | ASN00086282 | 26.2 | 14.0 | 0 | none | not | not |
Birrarung Marr | 2018-01-02 | 9844 | ASN00086282 | 23.6 | 15.5 | 0 | none | not | not |
Birrarung Marr | 2018-01-03 | 4091 | ASN00086282 | 22.3 | 11.2 | 0 | none | not | not |
Birrarung Marr | 2018-01-04 | 1386 | ASN00086282 | 25.5 | 11.5 | 0 | none | not | not |
Birrarung Marr | 2018-01-05 | 1052 | ASN00086282 | 30.5 | 12.2 | 0 | none | not | not |
Birrarung Marr | 2018-01-06 | 6672 | ASN00086282 | 41.5 | 16.6 | 0 | none | hot | not |
Birrarung Marr | 2018-01-07 | 1567 | ASN00086282 | 22.0 | 15.7 | 0 | none | not | not |
Birrarung Marr | 2018-01-08 | 6201 | ASN00086282 | 23.6 | 15.9 | 0 | none | not | not |
Birrarung Marr | 2018-01-09 | 8403 | ASN00086282 | 22.8 | 13.9 | 0 | none | not | not |
Birrarung Marr | 2018-01-10 | 12180 | ASN00086282 | 25.5 | 12.1 | 0 | none | not | not |
Create a few plots that look at the spread of the daily totals for each of the sensors, according to the weather flagging variables (high_prcp
, high_temp
, and low_temp
). Write a paragraph that tells us what you learn from these plots, how you think weather might be impacting how people go outside. Make sure to discuss the strengths and limitations of the plots summarised like this, what assumption do they make?
# Plot of count for each sensor against high rain
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = high_prcp)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Distribution of pedestrian counts for days classied as rain")
# Plot against high temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = high_temp)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Distribution of pedestrian counts for days classified as hot")
# Plot of low temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = low_temp)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Distribution of pedestrian counts for days classified as cold")
!> Answer: The above plots depict summaries of the pedestrian count data recorded across the 4 sensors in Melbourne for the first four months of 2018. The Sensor is on the x-axis and count data is on the y-axis. The first plot compares days with rain and days without. From this plot it can be seen that generally people are less likely to go out on rainy days, however there is a much greater variation on non-rainy days in comparison. Meanwhile the second plot, depicting hot days (above 33 degrees) against non-hot days displays somewhat less difference when this variable is accounted for, with the median being lower for some Sensors and higher for others. Likewise the variation changes depending on which Sensor is chosen. The final graph depicts cold days (less than 6 degrees) vs non-cold days. As can be seen on the graph there is limited data available for these types of days. This may be due to the fact that the chosen period was over the final two months of summer and the first two months on autumn. As such during these warmer months one day below 6 degrees was recorded making comparisons somewhat more difficult. However, the median values do appear to be quite similar for these days across all four sensors. Futhermore while these plots are well equipped to summmarise large amounts of data and indetify the presence of outliers, they are limited in that they don’t provide insight into each individual data point, only summary statistics. This is particularly evident in the cold weather plot.
The visualisations tell us something interesting about the data, but to really understand the data, we need to perform some modelling. To do this, you need to combine the weather data with the pedestrian data. We have provided the weather data for 2018 and 2020, combine with the pedestrian data for 2018 and 2020.
melb_weather_2018 <- read_csv(here::here("data-raw/melb_ncdc_2018.csv"))
melb_weather_2020 <- read_csv(here::here("data-raw/melb_ncdc_2020.csv"))
# task: combine the weather data together into an object, `melb_weather`
melb_weather <- bind_rows(melb_weather_2018,
melb_weather_2020) %>%
# remember to add info about high precipitation, high temperature, + low temps
mutate(
high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none"),
high_temp = if_else(condition = tmax > 33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin < 6,
true = "cold",
false = "not")
)
# now combine this weather data with the walking data
melb_walk_weather <- melb_walk_hols %>%
left_join(melb_weather, by = c("Date" = "date"))
We have been able to start answering the question, “Does the weather make a difference to the number of people walking out?” by looking at a few exploratory plots. However, we want to get a bit more definitive answer by performing some statistical modelling.
We are going to process the data somewhat so we can fit a linear model to the data. First, let’s take a set relevant variables to be factors. This ensures that the linear model interprets them appropriately.
We also add one to count and then take the natural log of it. The reasons for this are a bit complex, but essentially a linear model is not the most ideal model to fit for this data, and we can help it be more ideal by taking the log of the counts, which helps stabilise the residuals (predictions - observed) when we fit the model.
melb_walk_weather_prep_lm <- melb_walk_weather %>%
mutate_at(.vars = vars(Sensor,
Time,
month,
year,
workday,
high_prcp,
high_temp,
low_temp),
as_factor) %>%
mutate(log_count = log1p(Count))
Now we fit a linear model, predicting logCount using Time, Month, weekday and the weather flag variables (high_prcp
, high_temp
, and low_temp
)
walk_fit_lm <- lm (
formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
Provide some summary statistics on how well this model fits the data? What do you see? What statistics tell us about our model fit?
glance(walk_fit_lm) %>%
kable(caption = "Model fit statistics for walk_fit_lm linear model") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.5244989 | 0.5237145 | 1.213939 | 668.6655 | 0 | 36 | -34259.01 | 68592.02 | 68886.7 | 31266.37 | 21217 |
!> Answer: The r.squared value is one of the most common measurements of the strength of a linear model fit. This value depicts the variability in the response variable explained by the model with the remaining variation explained by variables not in the moodel. Thus the r.squared value provides means that approximately 52.4% of the variability in the log count of pedestrians is explained by the Time, month, day of the week, and whether the day was rainy, hot or cold.
We have had one look at the model fit statistics, but let’s now look at the fitted and observed (log_count
) values, for each sensor:
peds_aug_lm <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
ggplot(peds_aug_lm,
aes(x = log_count,
y = .fitted)) +
geom_point() +
facet_wrap(~Sensor) +
ggtitle("Broom plot for each sensor using the walk_fit_lm linear model")
There is actually a lot of variation. Looking at this, you might assume that the model does a bad job of fitting the residuals. However, we must remember that there is an inherent time structure to the data. A better way to explore this is to look directly at the temporal components. We can do this directly with a calendar plot.
Before we can get the data into the right format for analysis, we need to pivot the data into a longer format, so that we have columns of Date, Time, Count, Model, and log_count.
flinders_lm <- peds_aug_lm %>%
# filter the data to only look at flinder st
filter(Sensor == "Flinders Street Station Underpass") %>%
# Select the Date, Time, Count, .fitted, and log_count
select(Date,
Time,
Count,
.fitted,
log_count) %>%
# Now pivot the log count and fitted columns into a column called "model"
# data so we have columns of Date, Time, Count, Model,
# and log_count.
pivot_longer(.fitted:log_count, names_to = "model", values_to = "log_count") %>%
# Now we're going to undo the intiial data transformation
mutate(Count = expm1(log_count))
flinders_lm_cal <- flinders_lm %>%
# Let's just look at 2020 to make it a bit easier
filter(year(Date) == "2020") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- ggplot(flinders_lm_cal) +
# Part of the interface to overlaying multiple lines in a calendar plot
# is drawing two separate `geom_line`s -
# See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
# for details
geom_line(data = filter(flinders_lm_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(flinders_lm_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal) + theme(legend.position = "bottom") +
ggtitle("Calender plot of fitted counts vs observed counts for the walk_fit_lm linear model")
Write a paragraph answering these questions:
!> Answer: This plot depicts the counts of pedestrians taken by the Flinders Street Station Underpass Sensor in 2020 for each day between January and the end of April. The orange line depicts the value fit by the model while the blue line depicts the log of the observed count data. Generally the observed values are higher than the fitted across the recorded period up until the end of March and beginning of April. This may be due to the fact that some variables are not being captured by the model. As mentioned previously, the lockdown restrictions in 2020 significantly altered the pedestrian counts and these laws came into affect around the period where the observed and predicted flip. Similary as seen previously, each Sensor has different patterns in their pedestrian accounts. Thus these two variables should be considered for the moodel as it is clear they affect the counts of pedestrians which is the response variable for the model.
What sort of improvements do you think you could make to the model? Fit two more models, Try adding more variables to the linear model.
walk_fit_lm_sensor <- lm(
formula = log_count ~ Time + month + weekday + Sensor + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
walk_fit_lm_year <- lm(
formula = log_count ~ Time + month + weekday + year + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
Why did you add those variables?
!> Answer: As mentioned in the above section, Sensor was chosen as each of the four sensors depict different patterns in pedestrian traffic to varying degrees. Meanwhile, the year chosen has a clear impact on the counts of pedestrians recorded due to the new lockdown laws in 2020. As such the response variable, log_count, is likely to be effected by these variables as a correlation might exist.
bind_rows(
first = glance(),
sensor = glance(walk_fit_lm_sensor),
year = glance(walk_fit_lm_year),
.id = "type"
) %>%
kable(caption = "Model fit statistics for additional linear models") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
type | r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|---|
sensor | 0.6898277 | 0.6892721 | 0.9805127 | 1241.5851 | 0 | 39 | -29718.93 | 59517.86 | 59836.43 | 20395.25 | 21214 |
year | 0.5542828 | 0.5535264 | 1.1753331 | 732.8801 | 0 | 37 | -33571.64 | 67219.28 | 67521.92 | 29307.95 | 21216 |
Which model does the best? Why do you think that is? What statistics are you basing this decision off?
!> Answer: From these two models, the one incorporating Sensor data appears to be the best, with this model explaining roughly 68.9% of the variability in log_count compared to only 55.4% for the model incorporating the year. Thus, more of the variability is explained by the sensor model. Similary AIC, BIC and deviance can also be used to compare models with lower indicating a better model fit. As can be seen, the Sensor model has a lower value for all these variables supporting its postion as the superior model.
(Suggestion - Perhaps write this as a function to speed up comparison)
peds_aug_lm_sensor <- augment(walk_fit_lm_sensor, data = melb_walk_weather_prep_lm)
pivot_sensor <- function(lm_fit, sensor = "Flinders Street Station Underpass"){
lm_fit %>%
filter(Sensor == sensor) %>%
select(Date,
Time,
Count,
.fitted,
log_count) %>%
pivot_longer(.fitted:log_count, names_to = "model", values_to = "log_count") %>%
mutate(Count = expm1(log_count))
}
calendar_fit_obs <- function(lm_fit_aug){
data_cal <- lm_fit_aug %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <-
ggplot(data_cal) +
geom_line(data = filter(data_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(data_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal) + theme(legend.position = "bottom")
}
pivot_sensor(peds_aug_lm_sensor) %>%
filter(year(Date) == "2020") %>%
calendar_fit_obs() +
ggtitle("Calender plot of fitted counts vs observed counts for the walk_fit_lm_sensor linear model")
What do you see? How does it compare to the previous model?
!> Answer: This plot depicts the counts of pedestrians taken by the Flinders Street Station Underpass Sensor in 2020 for each day between January and the end of April. The orange line depicts the value fit by the model while the blue line depicts the log of the observed count data. Compared to the previous calender plot, the fitted model is much closer to the observed values for most of the time period. However, the issue still exists for the period from the end of March to April where the observed values fall well below the fitted ones. This is likely due to the fact the Year variable was not included. Thus in future models both the Sensor and Year variables could be included.
Compare the fitted against the residuals, perhaps write a function to help you do this in a more readable way.
walk_fit_lm_original_aug <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
walk_fit_lm_year_aug <- augment(walk_fit_lm_year, data = melb_walk_weather_prep_lm)
walk_fit_lm_sensor_aug <- augment(walk_fit_lm_sensor, data = melb_walk_weather_prep_lm)
plot_fit_resid <- function(data){
ggplot(data,
aes(x = .fitted,
y = .std.resid)) +
geom_point() +
geom_hline(yintercept = 0, colour = "red") +
facet_wrap(~Sensor)
}
plot_fit_resid(walk_fit_lm_original_aug) +
ggtitle("Residual plot for the walk_fit_lm linear model")
plot_fit_resid(walk_fit_lm_year_aug) +
ggtitle("Residual plot for the walk_fit_lm_year linear model")
plot_fit_resid(walk_fit_lm_sensor_aug) +
ggtitle("Residual plot for the walk_fit_lm_sensor linear model")
We’ve looked at all these models, now pick your best one, and compare the predicted values against the actual observed values. What do you see? Is the model good? Is it bad? Do you have any thoughts on why it is good or bad?
!> Answer: These plots depcit the residuals for each of the produced models with the standardised residuals on the y-axis and the fitted values on x-axis with each point relating to one observed value. Such plots are used to determine the appropriateness of linear models. If the points are randomly dispersed around the horizontal axis (depicted by the red line) then a model is appropriate. Beginning with the first graph for the original model, the residuals for Flinders Street Station Underpass and Melbourne Central are postioned generally above the horizontal axis while the other two are generally below. The pattern for the second graph of the year model exhibits similar results, inferring that neither are a good fit for a linear model. Meanwhile the final graph for the Sensor model depcits the most evenly dispersed residuals around the horizontal axis and are thus the most random. This suggests that it is the most appropriate model of the three.
Make sure to reference all of the R packages that you used here, along with any links or blog posts that you used to help you answer these questions
Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL http://www.jstatsoft.org/v40/i03/.
D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
Kirill Müller (2017). here: A Simpler Way to Find Your Files. R package version 0.1. https://CRAN.R-project.org/package=here
David Robinson and Alex Hayes (2020). broom: Convert Statistical Analysis Objects into Tidy Tibbles. R package version 0.5.6. https://CRAN.R-project.org/package=broom
Earo Wang (2020). rwalkr: API to Melbourne Pedestrian Data. R package version 0.5.2. https://CRAN.R-project.org/package=rwalkr
Wang, E, D Cook, and RJ Hyndman (2020). Calendar-based graphics for visualizing people’s daily schedules. Journal of Computational and Graphical Statistics. doi:10.1080/10618600.2020.1715226.
Wang, E, D Cook, and RJ Hyndman (2020). A new tidy data structure to support exploration and modeling of temporal data. Journal of Computational and Graphical Statistics. doi:10.1080/10618600.2019.1695624.
Hadley Wickham, Jim Hester and Romain Francois (2018). readr: Read Rectangular Text Data. R package version 1.3.1. https://CRAN.R-project.org/package=readr
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Diethelm Wuertz, Tobias Setz, Yohan Chalabi, Martin Maechler and Joe W. Byers (2018). timeDate: Rmetrics - Chronological and Calendar Objects. R package version 3043.102. https://CRAN.R-project.org/package=timeDate
Yihui Xie (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.28.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
Hao Zhu (2019). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0. https://CRAN.R-project.org/package=kableExtra
This code below here is what was used to retrieve the data in the data-raw
folder.
# melb_bbox <- c(min(ped_loc$longitude) - .001,
# min(ped_loc$latitude) - 0.001,
# max(ped_loc$longitude) + .001,
# max(ped_loc$latitude) + 0.001)
#
# melb_map <- get_map(location = melb_bbox, source = "osm")
# write_rds(melb_map,
# path = here::here("2020/assignment-2/data-raw/melb-map.rds"),
# compress = "xz")
# code to download the stations around the airport and the weather times
# this is purely here so you can see how we downloaded this data
# it is not needed for you to complete the assignment, so it is commented out
# melb_stns <- read_table(
# file = "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
# col_names = c("ID",
# "lat",
# "lon",
# "elev",
# "state",
# "name",
# "v1",
# "v2",
# "v3"),
# skip = 353,
# n_max = 17081
# ) %>%
# filter(state == "MELBOURNE AIRPORT")
# #
# get_ncdc <- function(year){
# vroom::vroom(
# glue::glue(
# "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/{year}.csv.gz"
# ),
# col_names = FALSE,
# col_select = 1:4
# )
# }
#
# clean_ncdc <- function(x){
# x %>%
# filter(X1 == melb_stns$ID, X3 %in% c("PRCP", "TMAX", "TMIN")) %>%
# rename_all(~ c("station", "date", "variable", "value")) %>%
# mutate(date = ymd(date), value = value / 10) %>%
# pivot_wider(names_from = variable, values_from = value) %>%
# rename_all(tolower)
# }
# ncdc_2018 <- get_ncdc(2018)
# melb_ncdc_2018 <- clean_ncdc(ncdc_2018)
# write_csv(melb_ncdc_2018,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2018.csv"))
#
# ncdc_2020 <- get_ncdc(2020)
# beepr::beep(sound = 4)
# melb_ncdc_2020 <- clean_ncdc(ncdc_2020)
# beepr::beep(sound = 4)
# write_csv(melb_ncdc_2020,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2020.csv"))