This assignment is designed to simulate a scenario where you are given a dataset, and taking over someone’s existing work, and continuing with it to draw some further insights. This aspect of it is similar to Assignment 1, but it will provide less scaffolding, and ask you to draw more insights, as well as do more communication.
Your previous submission of crime data was well received!
You’ve now been given a different next task to work on. Your colleague at your consulting firm, Amelia (in the text treatment below) has written some helpful hints throughout the assignment to help guide you.
Questions that are worth marks are indicated with “Q” at the start and end of the question, as well as the number of marks in parenthesis. For example
## Q1A some text (0.5 marks)
Is question one, part A, worth 0.5 marks
This assignment will be worth 10% of your total grade, and is marked out of 58 marks total.
9 Marks for grammar and clarity. You must write in complete sentences and do a spell check.
9 Marks for presentation of the data visualisations
40 marks for the questions
Your marks will be weighted according to peer evaluation.
As of week 6, you have seen most of the code for parts 1 - 2 that needs to be used here, and Week 7 will give you the skills to complete part 3. I do not expect you to know immediately what the code below does - this is a challenge for you! We will be covering skills on modelling in the next weeks, but this assignment is designed to simulate a real life work situation - this means that there are some things where you need to “learn on the job”. But the vast majority of the assignment will cover things that you will have seen in class, or the readings.
Remember, you can look up the help file for functions by typing ?function_name
. For example, ?mean
. Feel free to google questions you have about how to do other kinds of plots, and post on the ED if you have any questions about the assignment.
To complete the assignment you will need to fill in the blanks for function names, arguments, or other names. These sections are marked with ***
or ___
. At a minimum, your assignment should be able to be “knitted” using the knit
button for your Rmarkdown document.
If you want to look at what the assignment looks like in progress, but you do not have valid R code in all the R code chunks, remember that you can set the chunk options to eval = FALSE
like so:
```{r this-chunk-will-not-run, eval = FALSE}`r''`
ggplot()
```
If you do this, please remember to ensure that you remove this chunk option or set it to eval = TRUE
when you submit the assignment, to ensure all your R code runs.
You will be completing this assignment in your assigned groups. A reminder regarding our recommendations for completing group assignments:
Your assignments will be peer reviewed, and results checked for reproducibility. This means:
Each student will be randomly assigned another team’s submission to provide feedback on three things:
This assignment is due in by 6pm on Wednesday 20th May. You will submit the assignment via ED. Please change the file name to include your teams name. For example, if you are team dplyr
, your assignment file name could read: “assignment-2-2020-s1-team-dplyr.Rmd”
You work as a data scientist in the well named consulting company, “Consulting for You”.
On your second day at the company, you impressed the team with your work on crime data. Your boss says to you:
Amelia has managed to find yet another treasure trove of data - get this: pedestrian count data in inner city Melbourne! Amelia is still in New Zealand, and now won’t be back now for a while. They discovered this dataset the afternoon before they left on holiday, and got started on doing some data analysis.
We’ve got a meeting coming up soon where we need to discuss some new directions for the company, and we want you to tell us about this dataset and what we can do with it.
Most Importantly, can you get this to me by Wednesday 20th May, COB (COB = Close of Business at 6pm).
I’ve given this dataset to some of the other new hire data scientists as well, you’ll all be working as a team on this dataset. I’d like you to all try and work on the questions separately, and then combine your answers together to provide the best results.
From here, you are handed a USB stick. You load this into your computer, and you see a folder called “melbourne-walk”. In it is a folder called “data-raw”, and an Rmarkdown file. It contains the start of a data analysis. Your job is to explore the data and answer the questions in the document.
Note that the text that is written was originally written by Amelia, and you need to make sure that their name is kept up top, and to pay attention to what they have to say in the document!
The City of Melbourne has sensors set up in strategic locations across the inner city to keep hourly tallies of pedestrians. The data is updated on a monthly basis and available for download from Melbourne Open Data Portal. The rwalkr package provides an API in R to easily access sensor counts and geographic locations.
There are three parts to this work:
Amelia: I’ve downloaded a map chunk of Melbourne. Can you take the map I made, and plot the location of the sensors on top of it? We want to be able to see all of the sensors, but we also want to create different shapes for the following sensors:
First we download the data on the pedestrian sensor locations around Melbourne.
And now we draw a plot on a map tile of the pedestrian sensor locations
!> Answer: This map shows us a map of Melbourne with the location of all active pedestrian sensors plotted on top of it. Pedestrian sensors at Birrarung Marr, Flinders Street Station Underpass, Melbourne Central, and Southern Cross Station have been labelled differently from the rest.
ped_loc %>%
# calculate the year from the date information
mutate(year = year(installation_date)) %>%
# count up the number of sensors
count(year) %>%
# then use `kable()` to create a markdown table
kable(caption = "Number of Sensors Installed Each Year", col.names = c("Year", "Sensors Installed"))
Year | Sensors Installed |
---|---|
2009 | 13 |
2013 | 12 |
2014 | 2 |
2015 | 7 |
2016 | 1 |
2017 | 9 |
2018 | 5 |
2019 | 6 |
2020 | 4 |
Additionally, how many sensors were added in 2016, and in 2017?
!> Answer: 1 sensor in 2016 and 9 sensors in 2017. > …
We would like you to focus on the foot traffic at 4 sensors:
Your task is to:
Extract the data for the year 2018 from Jan 1 to April 30 using rwalkr
(you might want to cache this so it doesn’t run every time you knit)
Filter the data down to include only the four sensors above, note that you could do this directly with rwalkr
add variables that contain the day of the month, month, year, and day in the year.
walk_2018 <- walk_2018 %>%
# Filter the data down to include only the four sensors above
filter(Sensor %in% c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr")) %>%
# now add four columns, containing month day, month, year, and day of the year
# using functions from lubridate.
mutate(mday = mday(Date),
month = month(Date),
year = year(Date),
yday = yday(Date))
Now we can plot the pedestrian count information for January - April in 2018
ggplot(walk_2018,
aes(x = Date_Time,
y = Count)) +
geom_line(size = 0.3) +
facet_grid(Sensor ~ .,
# this code presents the facets ina nice way
labeller = labeller(Sensor = label_wrap_gen(20))) +
# this code mades the x axis a bit nicer to read
scale_x_datetime(date_labels = "%d %b %Y",
date_minor_breaks = "1 month") +
labs(x = "Date Time")
We can see that there are quite different patterns in each of the sensors. Let’s explore this further.
!> Answer: At the four selected places, activities such as peak hour, holidays, weekdays/ weekends, and special events such as the Night Noodle Market at Birrarung Marr may be captured in this plot.
We’re primarily interested in exploiting pedestrian patterns at various time resolutions and across different locations. In light of people’s daily schedules, let’s plot the counts against time of day for each sensor.
ggplot(walk_2018,
aes(x = Time,
y = Count,
group = Date,
colour = Sensor)) +
geom_line() +
facet_wrap(~Sensor,
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2",
name = "Sensor") +
theme(legend.position = "none")
Write a short paragraph that describe what the plot shows:
!> Answer: The graph plots the daily number pedestrians over time of day faceted across four different sensors. Each line represents the dates in which pedestrian counts were observed.
The pedestrian pattern at Flinders Street Station Underpass resembles a combination of patterns from Melbourne Central and Southern Cross Station. On the other hand, the pedestrian patterns in Birrarung Marr are unique compared to the other three sensors.
Flinders Street Station Underpass shows the mostly same trend as Southern Cross Station. There are two peaks in the morning and evening when people go to work or go home during peak hours. Birrarung Marr shows a general increase in pedestrians over the course of the day, with higher fluctuations during the later half of the day. This trend has certain similarities with those observed in Melbourne Central, with much higher fluctuations in pedestrian traffic.
Southern Cross Station shows high spikes in pedestrians during peak hours of the day, which reflect that most pedestrians walking through the sensor are part of the working class. The lines without such spikes may reflect pedestrian traffic during non-workdays such as weekends. Melbourne Central shows a relatively constant pattern of pedestrians each day with the number of pedestrians increasing from early morning and reaching its peak around 7 in the evening. This may be a combination of working class pedestrians during workdays, as well as pedestrians who arrive in the city during the evenings and weekends for recreation.
Use the data inside the hols_2018
data to identify weekdays and weekends, and holidays.
hols_2018 <- tsibble::holiday_aus(year = 2018, state = "VIC")
walk_2018_hols <- walk_2018 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = weekday %in% c("Sat", "Sun") | Date %in% hols_2018$date,
true = "Non-workday",
false = "Workday"
))
Now create a plot to compare the workdays to the non workdays.
ggplot(walk_2018_hols,
aes(x = Time,
y = Count)) +
geom_line(size = 0.3,
alpha = 0.3) +
facet_grid(vars(Sensor), vars(workday),
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2", name = "Sensor") +
theme(legend.position = "none")
Write a short paragraph that describe what the plot shows, and helps us answer these questions:
!> Answer: This graph compares the pedestrians counts of the four sensors we focus on between working and non-working days over the time of day. Each line represents the pedestrian count over time.
The data in both Melbourne Central and Flinders Street Station show similar patterns between non-workday and workday, as compared to sensors at Southern Cross and Birrarung Marr, which show vastly different patterns between non-workday and workday for pedestrian traffic. Except for Southern Cross Station, on non-workdays, pedestrian counts tend to peak in the later half of the day, mostly in the evenings and at night, most likely for recreational activities in the city.
Most of the panels show different trends depending on the time of day except for Southern Cross Station on non-working days, where it is relatively flat across the time of day. This supports the conclusion that most pedestrians at Southern Cross Station are those of the working class. Birrarung Marr shows a larger number of pedestrians during non-workday as compared to workdays, meaning most pedestrians through Birrarung Marr are those possibly for recreational activities.
To locate those unusual moments, Flinders Street Station data is calendarised on the canvas, using the sugrrants package. We can spot the unusual weekday patterns on public holidays using their color. Using the calendar plot, try to spot another unusual pattern, do a google search to try to explain the change in foot traffic. (Hint: Look at the plot carefully, does a particular day show a different daily pattern? Is there a specific event or festival happening on that day?)
# filter to just look at flinders st station
flinders <- walk_2018_hols %>% filter(Sensor == "Flinders Street Station Underpass")
flinders_cal <- flinders %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- flinders_cal %>%
ggplot(aes(x = .Time, y = .Count, colour = workday, group = Date)) +
geom_line()
prettify(gg_cal) +
theme(legend.position = "bottom")
!> Answer: The daily pedestrian count for Flinders Street Station Underpass follows a mostly consistent pattern between workdays and non-workdays, with a few exceptions.
1 January 2020 started with a high number of pedestrian counts at midnight - this is expected as people are out celebrating new year’s day. The foot traffic in the week following the new year’s day is lower compared to the rest of the workdays dates might be because people are taking leave or taking a rest for new year holiday. The daily foot traffic goes back to normal in the beginning of the second week of January.
On the evening of 16 and 17 February 2020, we can see a larger number of pedestrians through Flinders Street Station Underpass. This is possibly due to a larger number of people out to celebrate Valentines’ Day in the evening, as well as possibly celebrating the second day of the Lunar New Year.
In March, during the long weekend, where Monday was a non-workday, we also see a slight spike in pedestrian counts. This could be due to people traveling to the city during the long weekend.
You’ll need to ensure that you follow the steps we did earlier to filter the data and add the holiday information.
walk_2020 <- walk_2020 %>%
filter(Sensor %in% c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr")) %>%
# now add four columns, containing month day, month, year, and day of the year
# using functions from lubridate.
mutate(mday = mday(Date),
month = month(Date),
year = year(Date),
yday = yday(Date))
Now add the holiday data
# also the steps for adding in the holiday info
hols_2020 <- tsibble::holiday_aus(year = 2020, state = "VIC")
walk_2020_hols <- walk_2020 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = weekday %in% c("Sat", "Sun") | Date %in% hols_2020$date,
true = "Non-workday",
false = "Workday"
))
melb_walk_hols <- bind_rows(walk_2018_hols, walk_2020_hols)
filter_sensor <- function(data, sensors){
data %>% filter(Sensor %in% c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr"))
}
add_day_info <- function(data){
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
data %>%
mutate(mday = mday(Date),
month = month(Date),
year = year(Date),
yday = yday(Date))
}
add_working_day <- function(data){
walk_years <- unique(data$year)
hols <- tsibble::holiday_aus(year = walk_years, state = "VIC")
data %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = weekday %in% c("Sat", "Sun") | Date %in% hols$date,
true = "Non-workday",
false = "Workday"
))
}
## # A tibble: 23,136 x 11
## Sensor Date_Time Date Time Count mday month year yday
## <chr> <dttm> <date> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Melbo~ 2018-01-01 00:00:00 2018-01-01 0 2996 1 1 2018 1
## 2 Flind~ 2018-01-01 00:00:00 2018-01-01 0 3443 1 1 2018 1
## 3 Birra~ 2018-01-01 00:00:00 2018-01-01 0 1828 1 1 2018 1
## 4 South~ 2018-01-01 00:00:00 2018-01-01 0 1411 1 1 2018 1
## 5 Melbo~ 2018-01-01 01:00:00 2018-01-01 1 3481 1 1 2018 1
## 6 Flind~ 2018-01-01 01:00:00 2018-01-01 1 3579 1 1 2018 1
## 7 Birra~ 2018-01-01 01:00:00 2018-01-01 1 1143 1 1 2018 1
## 8 South~ 2018-01-01 01:00:00 2018-01-01 1 436 1 1 2018 1
## 9 Melbo~ 2018-01-01 02:00:00 2018-01-01 2 1721 1 1 2018 1
## 10 Flind~ 2018-01-01 02:00:00 2018-01-01 2 3157 1 1 2018 1
## # ... with 23,126 more rows, and 2 more variables: weekday <ord>, workday <chr>
Write a paragraph that describes what you learn from these plots. Can you describe any similarities, and differences amongst the plots, and why they might be similar or different? (You might need to play with the plot output size to clearly see the pattern)
melb_walk_hols_flinders_april <- melb_walk_hols %>%
filter(Sensor == "Flinders Street Station Underpass", month == 4)
ggplot(melb_walk_hols_flinders_april,
aes(x = Time, y = Count,
colour = as.factor(year))) +
geom_line() +
facet_wrap(~ weekday, ncol = 5) +
theme(legend.position = "bottom") +
labs(colour = "Year")
!> Answer: The pedetrian count in 2018 is much higher compared to 2020. The graph shows that the trends of pedestrian counts in 2018 over the time of day vary depending on whether it’s weekday or not. In contrast, this difference in pattern between weekday and weekend cannot be easily seen in 2020.
The reason for such difference between 2018 and 2020 was the social distancing measures and restrictions were put into place due to the global COVID-19 pandemic in April 2020, forcing Australians to stay at home as much as possible to minimise contact with others. Most Australians worked from home, resulting in much lower levels of pedestrian traffic through the Flinders Street Station Underpass, even on workdays. Social gatherings and recreational activities were also discouraged, causing a decline in pedestrians on weekends as well.
What do you learn? Which Sensors seem the most similar? Or the most different?
melb_walk_hols_april <- melb_walk_hols %>% filter(month == 4)
ggplot(melb_walk_hols_april,
aes(x = Time, y = Count,
colour = as.factor(year))) +
geom_line() +
facet_grid(vars(Sensor), vars(weekday)) +
theme(legend.position = "bottom") +
labs(colour = "Year")
!> Answer: Sensors at Flinders Street Station and Melbourne Central show the most similar pattern with each other in 2018, whether it’s a working day or not.
Birrarung Marr and Southern Cross Station seem to have the most unique patterns of pedestrian count in 2018. The pedestrian count in Southern Cross Station appear to be higher during weekdays than on weekends in 2018, showing that most pedestrians were those in the working class. This pattern seems to also occur in 2020, albeit with much lower numbers.
In 2020, all four sensors show similar patterns in pedestrian traffic, with very low pedestrian counts across all four locations. This reflects what was previously observed in Flinders Street Station in 2020, which highlights the social distancing measures imposed during the COVID-19 pandemic.
Combining weather data with pedestrian counts
One question we want to answer is: “Does the weather make a difference to the number of people walking out?”
Time of day and day of week are the predominant driving force of the number of pedestrian, depicted in the previous data plots. Apart from these temporal factors, the weather condition could possibly affect how many people are walking in the city. In particular, people are likely to stay indoors, when the day is too hot or too cold, or raining hard.
Daily meteorological data as a separate source, available on National Climatic Data Center, is used and joined to the main pedestrian data table using common dates.
Binary variables are created to serve as the tipping points
We have pulled information on weather stations for the Melbourne area - can you combine it together into one dataset?
prcp
> 5 (if yes, “rain”, if no, “none”)tmax
> 33 (if yes, “hot”, if no, “not”)tmin
< 6 (if yes, “cold”, if no, “not”)# Now create some flag variables
melb_weather_2018 <- read_csv(
here::here("data-raw/melb_ncdc_2018.csv")
) %>%
mutate(
high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none"),
high_temp = if_else(condition = tmax > 33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin < 6,
true = "cold",
false = "not")
)
The weather data is per day, and the pedestrian count data is every hour. One way to explore this data is to collapse the pedestrian count data down to the total daily counts, so we can compare the total number of people each day to the weather for each day. This means each row is the total number of counts at each sensor, for each day.
Depending on how you do this, you will likely need to merge the pedestrian count data back with the weather data. Remember that we want to look at the data for 2018 only
melb_daily_walk_2018 <- melb_walk_hols %>%
filter(year == 2018) %>%
group_by(Sensor, Date) %>%
summarise(Count = sum(Count, na.rm = TRUE)) %>%
ungroup()
melb_daily_walk_weather_2018 <- melb_daily_walk_2018 %>%
left_join(melb_weather_2018, by = c("Date" = "date"))
melb_daily_walk_weather_2018
## # A tibble: 480 x 10
## Sensor Date Count station tmax tmin prcp high_prcp high_temp
## <chr> <date> <int> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Birra~ 2018-01-01 8385 ASN000~ 26.2 14 0 none not
## 2 Birra~ 2018-01-02 9844 ASN000~ 23.6 15.5 0 none not
## 3 Birra~ 2018-01-03 4091 ASN000~ 22.3 11.2 0 none not
## 4 Birra~ 2018-01-04 1386 ASN000~ 25.5 11.5 0 none not
## 5 Birra~ 2018-01-05 1052 ASN000~ 30.5 12.2 0 none not
## 6 Birra~ 2018-01-06 6672 ASN000~ 41.5 16.6 0 none hot
## 7 Birra~ 2018-01-07 1567 ASN000~ 22 15.7 0 none not
## 8 Birra~ 2018-01-08 6201 ASN000~ 23.6 15.9 0 none not
## 9 Birra~ 2018-01-09 8403 ASN000~ 22.8 13.9 0 none not
## 10 Birra~ 2018-01-10 12180 ASN000~ 25.5 12.1 0 none not
## # ... with 470 more rows, and 1 more variable: low_temp <chr>
Create a few plots that look at the spread of the daily totals for each of the sensors, according to the weather flagging variables (high_prcp
, high_temp
, and low_temp
). Write a paragraph that tells us what you learn from these plots, how you think weather might be impacting how people go outside. Make sure to discuss the strengths and limitations of the plots summarised like this, what assumption do they make?
# Plot of count for each sensor against high rain
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = high_prcp)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Plot 1: Plot of count for each sensor against high rain")
# Plot against high temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = high_temp)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Plot 2: Plot of count for each sensor against high temperature")
# Plot of low temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = low_temp)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Plot 3: Plot of count for each sensor against low temperature")
!> Answer: From the boxplots, it is obvious that different weather conditions do influence pedestrian counts and traffic at the four sensor locations.
Plot 1 shows that for all locations except Birrarung Marr, pedestrians generally are more inclined to go outside when it is not raining, with the median pedestrian count being higher than when it is not raining. Birrarung Marr also has many large outliers in this plot, when it is not raining. This could be due to the combination of good weather and special events held at Birrarung Marr such as the Night Noodle Market.
Plot 2 shows that hot weather does impact pedestrian counts at our diiferent sensors. For Flinders Street Station, Southern Cross Station and Melbourne Central, pedestrians generally prefer to stay at home when the weather is hot, with higher count medians when the weather is not hot. However in Birrarung Marr, there is little impact on pedestrians whether the weather is hot or not. This may also reflect special events held at Birraring Marr such as the Night Noodle Market which attracts visitors regardless of the temperature.
Plot 3 indicates that on average, the number of people that travel to Flinders Street Station Underpass, Melbourne Central, and Southern Cross Station is higher when the temperature is cold. In contrast, there is a smaller number of people travelling to Birrarung Marr when the temperature is cold, on average. Plot 3 shows that there is no variation of foot traffic across the four sensors when the temperature is cold. This is because there was only one observation for each sensor when the temperature is deemed to be cold in the sample selected for this particular analysis. Plot 3 therefore does not provide much insight into the relationship between pedestrian counts and low temperature.
The visualisations tell us something interesting about the data, but to really understand the data, we need to perform some modelling. To do this, you need to combine the weather data with the pedestrian data. We have provided the weather data for 2018 and 2020, combine with the pedestrian data for 2018 and 2020.
melb_weather_2018 <- read_csv(here::here("data-raw/melb_ncdc_2018.csv"))
melb_weather_2020 <- read_csv(here::here("data-raw/melb_ncdc_2020.csv"))
# task: combine the weather data together into an object, `melb_weather`
melb_weather <- bind_rows(melb_weather_2018,
melb_weather_2020) %>%
# remember to add info about high precipitation, high temperature, + low temps
mutate(
high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none"),
high_temp = if_else(condition = tmax > 33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin < 6,
true = "cold",
false = "not")
)
# now combine this weather data with the walking data
melb_walk_weather <- melb_walk_hols %>%
left_join(melb_weather, by = c("Date" = "date"))
We have been able to start answering the question, “Does the weather make a difference to the number of people walking out?” by looking at a few exploratory plots. However, we want to get a bit more definitive answer by performing some statistical modelling.
We are going to process the data somewhat so we can fit a linear model to the data. First, let’s take a set relevant variables to be factors. This ensures that the linear model interprets them appropriately.
We also add one to count and then take the natural log of it. The reasons for this are a bit complex, but essentially a linear model is not the most ideal model to fit for this data, and we can help it be more ideal by taking the log of the counts, which helps stabilise the residuals (predictions - observed) when we fit the model.
melb_walk_weather_prep_lm <- melb_walk_weather %>%
mutate_at(.vars = vars(Sensor,
Time,
month,
year,
workday,
high_prcp,
high_temp,
low_temp),
as_factor) %>%
mutate(log_count = log1p(Count))
Now we fit a linear model, predicting logCount using Time, Month, weekday and the weather flag variables (high_prcp
, high_temp
, and low_temp
)
walk_fit_lm <- lm (
formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
Provide some summary statistics on how well this model fits the data? What do you see? What statistics tell us about our model fit?
glance(walk_fit_lm)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.524 0.524 1.21 669. 0 36 -34259. 68592. 68887.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
!> Answer: The value of adjusted R squared tells us how well the model performs. The value of R squared decreases every time new independent variables are added to a model. Adjusted R squared minimises this effect (Tierney & Lee, 2020). The adjusted R squared for this model is 52.37% - this means that 52.37% of variations in the data are explained for the model.
The R squared value shows us how well the model fits the data that we have. The R squared value suggests that approximately 52.45% of the variance in the data is explained by the model. Both the R squared and Adjusted R squared have similar relatively low values of fit. This implies that this model may have to be improved on to have a better representation of the data.
We have had one look at the model fit statistics, but let’s now look at the fitted and observed (log_count
) values, for each sensor:
peds_aug_lm <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
ggplot(peds_aug_lm,
aes(x = log_count,
y = .fitted)) +
geom_point() +
facet_wrap(~Sensor)
There is actually a lot of variation. Looking at this, you might assume that the model does a bad job of fitting the residuals. However, we must remember that there is an inherent time structure to the data. A better way to explore this is to look directly at the temporal components. We can do this directly with a calendar plot.
Before we can get the data into the right format for analysis, we need to pivot the data into a longer format, so that we have columns of Date, Time, Count, Model, and log_count.
flinders_lm <- peds_aug_lm %>%
# filter the data to only look at flinder st
filter(Sensor == "Flinders Street Station Underpass") %>%
# Select the Date, Time, Count, .fitted, and log_count
select("Date", "Time", "Count", ".fitted", "log_count") %>%
# Now pivot the log count and fitted columns into a column called "model"
# data so we have columns of Date, Time, Count, Model,
# and log_count.
pivot_longer(cols = .fitted: log_count,
names_to = "model", values_to = "log_count") %>%
# Now we're going to undo the intiial data transformation
mutate(Count = expm1(log_count))
flinders_lm_cal <- flinders_lm %>%
# Let's just look at 2020 to make it a bit easier
filter(year(Date) == "2020") %>%
frame_calendar(x = Time, y = Count, date = Date)
flinders_lm_cal2018 <- flinders_lm %>%
filter(year(Date) == "2018") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal2018 <- ggplot(flinders_lm_cal2018) +
geom_line(data = filter(flinders_lm_cal2018, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(flinders_lm_cal2018, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal2018) +
theme(legend.position = "bottom") +
ggtitle("Flinders Street Station Underpass 2018")
gg_cal <- ggplot(flinders_lm_cal) +
# Part of the interface to overlaying multiple lines in a calendar plot
# is drawing two separate `geom_line`s -
# See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
# for details
geom_line(data = filter(flinders_lm_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(flinders_lm_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal) +
theme(legend.position = "bottom") +
ggtitle("Flinders Street Station Underpass 2020")
melbcen_lm <- peds_aug_lm %>%
filter(Sensor == "Melbourne Central") %>%
select("Date", "Time", "Count", ".fitted", "log_count") %>%
pivot_longer(cols = .fitted: log_count,
names_to = "model", values_to = "log_count") %>%
mutate(Count = expm1(log_count))
melbcen_lm_cal2018 <- melbcen_lm %>%
filter(year(Date) == "2018") %>%
frame_calendar(x = Time, y = Count, date = Date)
melb_cal2018 <- ggplot(melbcen_lm_cal2018) +
geom_line(data = filter(melbcen_lm_cal2018, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(melbcen_lm_cal2018, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(melb_cal2018) +
theme(legend.position = "bottom") +
ggtitle("Melbourne Central 2018")
melbcen_lm_cal2020 <- melbcen_lm %>%
filter(year(Date) == "2020") %>%
frame_calendar(x = Time, y = Count, date = Date)
melb_cal2020 <- ggplot(melbcen_lm_cal2020) +
geom_line(data = filter(melbcen_lm_cal2020, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(melbcen_lm_cal2020, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(melb_cal2020) +
theme(legend.position = "bottom") +
ggtitle("Melbourne Central 2020")
Write a paragraph answering these questions:
!> Answer: For all four graphs, in the months of January, February, and March, the models of Flinders Street Station Underpass and Melbourne Central under-predict the value of pedestrian counts in 2018 and 2020. However, the model overestimates foot traffic that occured in April 2020.
It can be seen from the figure that except for the end of March and April 2020, the model consistently under-predicts the pedestrian counts for Flinders Street Station Underpass. From the end of March 2020, model accuracy seems higher as the difference between the fitted values and acutal values narrows. As a whole, the difference between the prediction data and the observed data of pedestrian counts is relatively, which shows that the model is too conservative for the prediction and needs to be improved.
In Melbourne Central, the fitted values under predict the pedestrian counts for both 2018 and 2020. This is moreso in 2018, where the model greatly under-predicts pedestrian counts for both weekdays and weekends, some at almost half the actual counts. In 2020, in light of the global pandemic, pedestrian counts dropped from mid-March, which led to an over prediction of our model by April.
In general, the fluctuations of pedestrian counts that occur in Flinders Street Station Underpass during weekdays were much greater compared to the weekends. At Melbourne Central Station, however, there seems to be no significant difference in pedestrian traffic between weekdays and weekends.
As noted above, trends / patterns of pedestrian counts can be observed across different years (2018 and 2020) and sensor locations (in this case: Flinders Street Station Underpass and Melbourne Central Station). Therefore, it would be reasonable to add the year and sensors variable to the model as these variables will possibly allow the model to explain more variations in the dataset.
What sort of improvements do you think you could make to the model? Fit two more models, Try adding more variables to the linear model.
walk_fit_lm_sensor <- lm(
formula = log_count ~ Time + month + weekday + workday + Sensor + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
walk_fit_lm_year <- lm(
formula = log_count ~ Time + month + weekday + year + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
Why did you add those variables?
!> Answer: Sensor: People use different stations for different purposes (e.g. work, recreation, personal errands, etc.). For instance, people are likely to go to Birrarung Marr for recreation, and Southern Cross is mainly used by Australians who commute to work via train from outside the city.
Year: Significant events may happen during different years. For example, in 2020, the COVID-19 outbreak meant that social distancing laws were put in place, and many people were forced to work from home, resulting in a drastic decrease in pedestrians throughout the country.
Workday: From our previous analysis, certain sensors such as Flinders Street Station Underpass and Southern Cross Station may record larger pedestrian counts during weekdays and peak hour. This suggests that workday may play a significant role in influencing pedestrian counts.
bind_rows(
first = glance(walk_fit_lm),
sensor = glance(walk_fit_lm_sensor),
year = glance(walk_fit_lm_year),
.id = "type"
)
## # A tibble: 3 x 12
## type r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 first 0.524 0.524 1.21 669. 0 36 -34259. 68592.
## 2 sens~ 0.690 0.689 0.980 1210. 0 40 -29716. 59514.
## 3 year 0.554 0.554 1.18 733. 0 37 -33572. 67219.
## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
Which model does the best? Why do you think that is? What statistics are you basing this decision off?
!> Answer: Comparing the R sqaured values and Adjusted R sqaured values between the three models, the model with the sensor variable (Sensor model) has the highest R squared and Adjusted R squared. This means that the Sensor model has the highest percentage of variations in the data that have been explained by the model, approximately 68.99% using R squared and 68.93% using Adjusted R Squared.
Comparing AIC, BIC, and Deviance values of all three models, the Sensor model has the lowest values among the models, for all three measures. This implies that the Sensor model has the best fit to our data. (Tierney & Lee, 2020)
The Sensor model as has the best fit as each sensor location serves a different purpose of travelling. For instance, the main reason people go to Southern Cross Stations is to commute to their work or study. In contrast, it is highly unlikely that people travel to Birrarung Marr for work/study purposes. They are more likely to travel to Birrarung Marr for recreational purposes.
(Suggestion - Perhaps write this as a function to speed up comparison)
peds_aug_lm_sensor <- augment(walk_fit_lm_sensor, data = melb_walk_weather_prep_lm)
pivot_sensor <- function(lm_fit, sensor = "Flinders Street Station Underpass"){
lm_fit %>%
filter(Sensor == sensor) %>%
select("Date", "Time", "Count", ".fitted", "log_count") %>%
pivot_longer(cols = .fitted: log_count,
names_to = "model",
values_to = "log_count") %>%
mutate(Count = expm1(log_count))
}
calendar_fit_obs <- function(lm_fit_aug){
data_cal <- lm_fit_aug %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <-
ggplot(data_cal) +
geom_line(data = filter(data_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(data_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal) + theme(legend.position = "bottom")
}
pivot_sensor(peds_aug_lm_sensor) %>%
filter(year(Date) == "2020") %>%
calendar_fit_obs()
What do you see? How does it compare to the previous model?
!> Answer: Compared to the previous model, where the workday and sensor variables were not added to our model, the new model has a much closer fit to the actual data. However, there is still an overestimation in pedestrian count and its fluctuations for non-workdays. The fluctuation within each workday is now better explained by the new model until the last week of March where the model started to over-predict the number of pedestrian counts. This suggests that either more adjustments have to be made for a better model fit, or that the global pandemic situation is an unprecedented event that our data is unable to predict.
Compare the fitted against the residuals, perhaps write a function to help you do this in a more readable way.
walk_fit_lm_first <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
walk_fit_lm_year <- augment(walk_fit_lm_year, data = melb_walk_weather_prep_lm)
walk_fit_lm_sensor <- augment(walk_fit_lm_sensor, data = melb_walk_weather_prep_lm)
plot_fit_resid <- function(data){
ggplot(data,
aes(x = .fitted,
y = .resid)) +
geom_point() +
facet_wrap(~Sensor)
}
plot_fit_resid(walk_fit_lm_first)
plot_fit_resid(walk_fit_lm_year)
plot_fit_resid(walk_fit_lm_sensor)
We’ve looked at all these models, now pick your best one, and compare the predicted values against the actual observed values. What do you see? Is the model good? Is it bad? Do you have any thoughts on why it is good or bad?
!> Answer: Looking at the residual plots of the three models, it can be inferred that the Sensor model produces the most accurate predicted values of foot traffic compared to the rest of the models. This is because the distributions of the residual and predicted values produced by the Sensor model is the least dispersed compared to the other models. The residuals of each sensor mostly fluctuate around 0, which indicate better accuracy compared to our other models.
Although the Sensor model is the best model among those fitted to our data, the distribution of residuals and fitted values is still relatively large, especially for Birrarung Marr, with multiple outliers in the data, as can be seen from the large residual values. Our model may therefore not be suited for datasets similar to data collected at Birrarung Marr, where special events hosted in the area will attract an influx of visitors.
D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
David Robinson and Alex Hayes (2020). broom: Convert Statistical Analysis Objects into Tidy Tibbles. R package version 0.5.6. https://CRAN.R-project.org/package=broom
Diethelm Wuertz, Tobias Setz, Yohan Chalabi, Martin Maechler and Joe W. Byers (2018). timeDate: Rmetrics - Chronological and Calendar Objects. R package version 3043.102. https://CRAN.R-project.org/package=timeDate
Earo Wang (2020). rwalkr: API to Melbourne Pedestrian Data. R package version 0.5.2. https://CRAN.R-project.org/package=rwalkr
Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL http://www.jstatsoft.org/v40/i03/.
Hadley Wickham, Jim Hester and Romain Francois (2018). readr: Read Rectangular Text Data. R package version 1.3.1. https://CRAN.R-project.org/package=readr
Kirill Müller (2017). here: A Simpler Way to Find Your Files. R package version 0.1. https://CRAN.R-project.org/package=here
Tierney, N., & Lee, S. (2020, April). ETC5510: Introduction to Data Analysis. Retrieved from Monash University: https://mida.numbat.space/slides/lecture_7b.pdf
Wang, E, D Cook, and RJ Hyndman (2020). A new tidy data structure to support exploration and modeling of temporal data. Journal of Computational and Graphical Statistics. doi:10.1080/10618600.2019.1695624.
Wang, E, D Cook, and RJ Hyndman (2020). Calendar-based graphics for visualizing people’s daily schedules. Journal of Computational and Graphical Statistics. doi:10.1080/10618600.2020.1715226.
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
This code below here is what was used to retrieve the data in the data-raw
folder.
# melb_bbox <- c(min(ped_loc$longitude) - .001,
# min(ped_loc$latitude) - 0.001,
# max(ped_loc$longitude) + .001,
# max(ped_loc$latitude) + 0.001)
#
# melb_map <- get_map(location = melb_bbox, source = "osm")
# write_rds(melb_map,
# path = here::here("2020/assignment-2/data-raw/melb-map.rds"),
# compress = "xz")
# code to download the stations around the airport and the weather times
# this is purely here so you can see how we downloaded this data
# it is not needed for you to complete the assignment, so it is commented out
# melb_stns <- read_table(
# file = "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
# col_names = c("ID",
# "lat",
# "lon",
# "elev",
# "state",
# "name",
# "v1",
# "v2",
# "v3"),
# skip = 353,
# n_max = 17081
# ) %>%
# filter(state == "MELBOURNE AIRPORT")
# #
# get_ncdc <- function(year){
# vroom::vroom(
# glue::glue(
# "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/{year}.csv.gz"
# ),
# col_names = FALSE,
# col_select = 1:4
# )
# }
#
# clean_ncdc <- function(x){
# x %>%
# filter(X1 == melb_stns$ID, X3 %in% c("PRCP", "TMAX", "TMIN")) %>%
# rename_all(~ c("station", "date", "variable", "value")) %>%
# mutate(date = ymd(date), value = value / 10) %>%
# pivot_wider(names_from = variable, values_from = value) %>%
# rename_all(tolower)
# }
# ncdc_2018 <- get_ncdc(2018)
# melb_ncdc_2018 <- clean_ncdc(ncdc_2018)
# write_csv(melb_ncdc_2018,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2018.csv"))
#
# ncdc_2020 <- get_ncdc(2020)
# beepr::beep(sound = 4)
# melb_ncdc_2020 <- clean_ncdc(ncdc_2020)
# beepr::beep(sound = 4)
# write_csv(melb_ncdc_2020,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2020.csv"))