This assignment is designed to simulate a scenario where you are given a dataset, and taking over someone’s existing work, and continuing with it to draw some further insights. This aspect of it is similar to Assignment 1, but it will provide less scaffolding, and ask you to draw more insights, as well as do more communication.
Your previous submission of crime data was well received!
You’ve now been given a different next task to work on. Your colleague at your consulting firm, Amelia (in the text treatment below) has written some helpful hints throughout the assignment to help guide you.
Questions that are worth marks are indicated with “Q” at the start and end of the question, as well as the number of marks in parenthesis. For example
## Q1A some text (0.5 marks)
Is question one, part A, worth 0.5 marks
This assignment will be worth 10% of your total grade, and is marked out of 58 marks total.
9 Marks for presentation of the data visualisations
Your marks will be weighted according to peer evaluation.
As of week 6, you have seen most of the code for parts 1 - 2 that needs to be used here, and Week 7 will give you the skills to complete part 3. I do not expect you to know immediately what the code below does - this is a challenge for you! We will be covering skills on modelling in the next weeks, but this assignment is designed to simulate a real life work situation - this means that there are some things where you need to “learn on the job”. But the vast majority of the assignment will cover things that you will have seen in class, or the readings.
Remember, you can look up the help file for functions by typing ?function_name
. For example, ?mean
. Feel free to google questions you have about how to do other kinds of plots, and post on the ED if you have any questions about the assignment.
To complete the assignment you will need to fill in the blanks for function names, arguments, or other names. These sections are marked with ***
or ___
. At a minimum, your assignment should be able to be “knitted” using the knit
button for your Rmarkdown document.
If you want to look at what the assignment looks like in progress, but you do not have valid R code in all the R code chunks, remember that you can set the chunk options to eval = FALSE
like so:
```{r this-chunk-will-not-run, eval = FALSE}`r''`
ggplot()
```
If you do this, please remember to ensure that you remove this chunk option or set it to eval = TRUE
when you submit the assignment, to ensure all your R code runs.
You will be completing this assignment in your assigned groups. A reminder regarding our recommendations for completing group assignments:
Your assignments will be peer reviewed, and results checked for reproducibility. This means:
Each student will be randomly assigned another team’s submission to provide feedback on three things:
This assignment is due in by 6pm on Wednesday 20th May. You will submit the assignment via ED. Please change the file name to include your teams name. For example, if you are team dplyr
, your assignment file name could read: “assignment-2-2020-s1-team-dplyr.Rmd”
You work as a data scientist in the well named consulting company, “Consulting for You”.
On your second day at the company, you impressed the team with your work on crime data. Your boss says to you:
Amelia has managed to find yet another treasure trove of data - get this: pedestrian count data in inner city Melbourne! Amelia is still in New Zealand, and now won’t be back now for a while. They discovered this dataset the afternoon before they left on holiday, and got started on doing some data analysis.
We’ve got a meeting coming up soon where we need to discuss some new directions for the company, and we want you to tell us about this dataset and what we can do with it.
Most Importantly, can you get this to me by Wednesday 20th May, COB (COB = Close of Business at 6pm).
I’ve given this dataset to some of the other new hire data scientists as well, you’ll all be working as a team on this dataset. I’d like you to all try and work on the questions separately, and then combine your answers together to provide the best results.
From here, you are handed a USB stick. You load this into your computer, and you see a folder called “melbourne-walk”. In it is a folder called “data-raw”, and an Rmarkdown file. It contains the start of a data analysis. Your job is to explore the data and answer the questions in the document.
Note that the text that is written was originally written by Amelia, and you need to make sure that their name is kept up top, and to pay attention to what they have to say in the document!
The City of Melbourne has sensors set up in strategic locations across the inner city to keep hourly tallies of pedestrians. The data is updated on a monthly basis and available for download from Melbourne Open Data Portal. The rwalkr package provides an API in R to easily access sensor counts and geographic locations.
There are three parts to this work:
Amelia: I’ve downloaded a map chunk of Melbourne. Can you take the map I made, and plot the location of the sensors on top of it? We want to be able to see all of the sensors, but we also want to create different shapes for the following sensors:
First we download the data on the pedestrian sensor locations around Melbourne.
And now we draw a plot on a map tile of the pedestrian sensor locations
Answer: Map shows us a map chunk of Melbourne showing all active sensors in the area. Where blue circles show unselected sensors. While triangles (coloured to distinguish between sensors) represents selected sensors.
ped_loc %>%
# calculate the year from the date information
mutate(year = lubridate::year(installation_date)) %>%
# count up the number of sensors
count(year) %>%
# then use `kable()` to create a markdown table
kable(caption = "number of sensors installed each year") %>% kable_styling(bootstrap_options = c("striped", "hover"))
year | n |
---|---|
2009 | 13 |
2013 | 12 |
2014 | 2 |
2015 | 7 |
2016 | 1 |
2017 | 9 |
2018 | 5 |
2019 | 6 |
2020 | 4 |
Additionally, how many sensors were added in 2016, and in 2017?
Answer: 1 and 9 sensors were added in 2016 and 2017 respectively.
We would like you to focus on the foot traffic at 4 sensors:
Your task is to:
Extract the data for the year 2018 from Jan 1 to April 30 using rwalkr
(you might want to cache this so it doesn’t run every time you knit)
rwalkr
add variables that contain the day of the month, month, year, and day in the year.
walk_2018 <- walk_2018 %>%
# Filter the data down to include only the four sensors above
filter(Sensor %in% c("Southern Cross Station", "Melbourne Central", "Flinders Street Station Underpass", "Birrarung Marr")) %>%
# now add four columns, containing month day, month, year, and day of the year
# using functions from lubridate.
mutate(mday = mday(Date), # month day
month = month(Date, label = TRUE, abbr = TRUE),# month
year = year(Date), # year
yday = yday(Date)) # day of the year
Now we can plot the pedestrian count information for January - April in 2018
ggplot(walk_2018,
aes(x = Date_Time,
y = Count)) +
geom_line(size = 0.3) +
facet_grid(Sensor ~ .,
# this code presents the facets in a nice way
labeller = labeller(Sensor = label_wrap_gen(20))) +
# this code mades the x axis a bit nicer to read
scale_x_datetime(date_labels = "%d %b %Y",
date_minor_breaks = "1 month") +
labs(x = "Date Time")
We can see that there are quite different patterns in each of the sensors. Let’s explore this further.
Answer: There are three activities that might be captured from the graph. They are the activity on weekdays, weekend, and time of day (busy hour and nonbusy hour).
In Southern Cross and Flinders Street Station, pedestrian counts on working and non-working days are also distinguishable.
We witness highest average pedestrian counts in Flinders Street which allows us to determine that most people commute through Flinders Street Station.
We’re primarily interested in exploiting pedestrian patterns at various time resolutions and across different locations. In light of people’s daily schedules, let’s plot the counts against time of day for each sensor.
ggplot(walk_2018,
aes(x = Time,
y = Count,
group = Date,
colour = Sensor),
alpha = 0.6) +
geom_line() +
facet_wrap(~ Sensor,
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2",
name = "Sensor") +
theme(legend.position = "none")
Write a short paragraph that describe what the plot shows:
What is plotted on the graph? What does each line represent?
Is the data in these sensors similar, or different?
What do you learn? > Graphs portray hourly pedestrian counts where each line represents a date ranging from 1 Jan 2018 to 30 Apr 2018.
Southern Cross Station and Flinders Street have similar pedestrian count patterns where they have 2 peak hours (at roughly around 0800 and 1700) - indicating rush hours. These were absent in the other 2 sensors.
Melbourne Central have similar patterns to Flinders Street and Southern Cross station without the sharp peaks.
Birrarung Marr; has its own distinct pattern. It does not have a rush hour and seem random. Perhaps this could be because it is more like a park than a place through which people commute.
In the Flinders Street panel, there appears to be 2 crowds- 1 for working commuters while the other for general traffic This general public trend is seen to mirrored by Melbourne Central.
Notably, the graphs also show that in some days, there are unusual patterns shown in some locations. In particular, few lines can be spotted with high pedestrian counts starting from midnight.
Use the data inside the hols_2018
data to identify weekdays and weekends, and holidays.
hols_2018 <- tsibble::holiday_aus(year = 2018, state = "VIC")
walk_2018_hols <- walk_2018 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = weekday %in% c("Sat", "Sun") | Date %in% hols_2018$date,
true = "no", # not a workday
false = "yes" # workday
))
Now create a plot to compare the workdays to the non workdays.
ggplot(walk_2018_hols,
aes(x = Time,
y = Count,
group = Date,
colour = Sensor)) +
geom_line(size = 0.3, alpha = 0.3) +
facet_grid(workday ~ Sensor,
labeller = labeller(Sensor = label_wrap_gen(20))) + scale_colour_brewer(palette = "Dark2", name = "Sensor") + theme(legend.position = "none")
Write a short paragraph that describe what the plot shows, and helps us answer these questions:
Graph shows hourly pedestrian counts in the selected sensors. Each line represents a date ranging from 01-Jan-2018 to 30-Apr-2018 categorized by workday/non-workday.
Generally, working days are busier with the exception of Birrarung Marr. On working days, Flinders Street and Southern Cross Station saw a similar trend where there are 2 high peaks during the rush hour(0800 and 1700) and 1 shorter peak in the afternoon. Also, Melbourne Central and Flinders Street encountered similar patterns during non-working days. Southern Cross station was generally quiet during non-working days Birrarung Marr; generally busier on non-working days but has rather random patterns with no obvious similarities with other sensors.
Each panel seem to follow a rather similar trend with not many outliers. Except for Birrarung Marr. Notably, Melbourne Central showed the same trend on both working and non-working days.
We learn that pedestrian counts are determined largely by working and non-working days. We ascertain that these graphs portray 2 crowds- one for working commuters while the other for the general public.
To locate those unusual moments, Flinders Street Station data is calendarised on the canvas, using the sugrrants package. We can spot the unusual weekday patterns on public holidays using their color. Using the calendar plot, try to spot another unusual pattern, do a google search to try to explain the change in foot traffic. (Hint: Look at the plot carefully, does a particular day show a different daily pattern? Is there a specific event or festival happening on that day?)
# filter to just look at flinders st station
flinders <- walk_2018_hols %>%
filter(Sensor == "Flinders Street Station Underpass")
flinders_cal <- flinders %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- flinders_cal %>%
ggplot(aes(x = .Time, y = .Count, colour = workday, group = Date)) +
geom_line()
prettify(gg_cal) +
theme(legend.position = "bottom")
Answer: 17 Feb 2018 is the only day on the calendar that experienced an increase in pedestrian counts towards midnight. This was the date when White Night Festival took place. White Night is an overnight event which explains the pattern. Also, roads were closed during this event in the CBD area with limited access to parking. Therefore, the pedestrian counts on the following day was significantly higher than the other days which points at the participants leaving.
You’ll need to ensure that you follow the steps we did earlier to filter the data and add the holiday information.
walk_2020 <- walk_2020 %>%
filter(Sensor %in% c("Southern Cross Station", "Melbourne Central", "Flinders Street Station Underpass", "Birrarung Marr")) %>%
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
mutate(mday = mday(Date), # month day
month = month(Date, label = TRUE, abbr = TRUE),# month
year = year(Date), # year
yday = yday(Date)) # day of the year
Now add the holiday data
# also the steps for adding in the holiday info
hols_2020 <- tsibble::holiday_aus(year = 2020, state = "VIC")
walk_2020_hols <- walk_2020 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = weekday %in% c("Sat", "Sun") | Date %in% hols_2020$date,
true = "no", # not a workday
false = "yes" # workday
))
melb_walk_hols <- bind_rows(walk_2018_hols, walk_2020_hols)
filter_sensor <- function(data){
data %>% filter(Sensor %in% c("Southern Cross Station", "Melbourne Central", "Flinders Street Station Underpass", "Birrarung Marr"))
}
add_day_info <- function(data){
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
data %>%
mutate(mday = mday(Date), # month day
month = month(Date, label = TRUE, abbr = TRUE),# month
year = year(Date), # year
yday = yday(Date)) # day of the year
}
add_working_day <- function(data){
walk_years <- unique(data$year)
hols <- tsibble::holiday_aus(year = walk_years, state = "VIC")
data %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = weekday %in% c("Sat", "Sun") | Date %in% hols$date,
true = "no", # not a workday
false = "yes" # workday
))
}
# Step one, combine the walking data
bind_rows(walk_2018, walk_2020) %>%
# Step two, filter the sensors
filter_sensor() %>%
# step three, add the info on day of the year, month, etc
add_day_info() %>%
# step four, add info on working days.
add_working_day()
## # A tibble: 23,136 x 11
## Sensor Date_Time Date Time Count mday month year yday
## <chr> <dttm> <date> <int> <int> <int> <ord> <dbl> <dbl>
## 1 Melbo… 2018-01-01 00:00:00 2018-01-01 0 2996 1 Jan 2018 1
## 2 Flind… 2018-01-01 00:00:00 2018-01-01 0 3443 1 Jan 2018 1
## 3 Birra… 2018-01-01 00:00:00 2018-01-01 0 1828 1 Jan 2018 1
## 4 South… 2018-01-01 00:00:00 2018-01-01 0 1411 1 Jan 2018 1
## 5 Melbo… 2018-01-01 01:00:00 2018-01-01 1 3481 1 Jan 2018 1
## 6 Flind… 2018-01-01 01:00:00 2018-01-01 1 3579 1 Jan 2018 1
## 7 Birra… 2018-01-01 01:00:00 2018-01-01 1 1143 1 Jan 2018 1
## 8 South… 2018-01-01 01:00:00 2018-01-01 1 436 1 Jan 2018 1
## 9 Melbo… 2018-01-01 02:00:00 2018-01-01 2 1721 1 Jan 2018 1
## 10 Flind… 2018-01-01 02:00:00 2018-01-01 2 3157 1 Jan 2018 1
## # … with 23,126 more rows, and 2 more variables: weekday <ord>, workday <chr>
We used pipes to be able to discern the functions applied to each function. It focuses on the verbs rather than the nouns hence easier to follow. Each function is created for a specific job such as adding day info or adding working days. Another way we could’ve reduced repetition was stringing all the functions together. However, it makes it very hard to read/reproduce.
Write a paragraph that describe what you learn from these plots. Can you describe any similarities, and differences amongst the plots, and why they might be similar or different? (You might need to play with the plot output size to clearly see the pattern)
melb_walk_hols_flinders_april <- melb_walk_hols %>%
filter(Sensor == "Flinders Street Station Underpass",
month == "Apr")
ggplot(melb_walk_hols_flinders_april,
aes(x = Time,
y = Count,
group = Date,
colour = as.factor(year))) +
geom_line() +
facet_wrap(~weekday, ncol = 5) +
theme(legend.position = "bottom") +
labs(colour = "Year")
Answer: To start, we see that 2020’s pedestrian counts were significantly less than 2018’s. However, adjusting the figures’ height and width. We can make out that 2020 displayed similar patterns with 2018.
On the weekdays we can identify two peaks in 2020 just like there were in 2018. These may deduced by the fact that essential workers were still required to work despite quarantine measures.
The weekends in both years also exhibited same trends where there were a number of peaks in the late afternoon. Perhaps people still preferred to head out in the afternoon.
What do you learn? Which Sensors seem the most similar? Or the most different?
melb_walk_hols_april <- melb_walk_hols %>%
filter(month == "Apr")
ggplot(melb_walk_hols_april,
aes(x = Time,
y = Count,
group = Date,
colour = as.factor(year))) +
geom_line() +
facet_grid(vars(Sensor),
vars(weekday)) +
theme(legend.position = "bottom") +
labs(colour = "Year")
Answer: Similar sensors are Southern Cross and Flinders Street on the weekdays. Also, Melbourne Central over the entire week with Flinders Street on the weekends.
Again, Birrarung Marr showed a rather random pattern with no similarities or differences with other sensors. Southern Cross stations are generally quiet on the weekends.
Across all sensors in 2020 there were a consistently low pedestrian count.
Combining weather data with pedestrian counts
One question we want to answer is: “Does the weather make a difference to the number of people walking out?”
Time of day and day of week are the predominant driving force of the number of pedestrian, depicted in the previous data plots. Apart from these temporal factors, the weather condition could possibly affect how many people are walking in the city. In particular, people are likely to stay indoors, when the day is too hot or too cold, or raining hard.
Daily meteorological data as a separate source, available on National Climatic Data Center, is used and joined to the main pedestrian data table using common dates.
Binary variables are created to serve as the tipping points
We have pulled information on weather stations for the Melbourne area - can you combine it together into one dataset?
prcp
> 5 (if yes, “rain”, if no, “none”)tmax
> 33 (if yes, “hot”, if no, “not”)tmin
< 6 (if yes, “cold”, if no, “not”)# Now create some flag variables
melb_weather_2018 <- read_csv(
here::here("data-raw/melb_ncdc_2018.csv")
) %>%
mutate(
high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none"),
high_temp = if_else(condition = tmax > 33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin < 6,
true = "cold",
false = "not")
)
The weather data is per day, and the pedestrian count data is every hour. One way to explore this data is to collapse the pedestrian count data down to the total daily counts, so we can compare the total number of people each day to the weather for each day. This means each row is the total number of counts at each sensor, for each day.
Depending on how you do this, you will likely need to merge the pedestrian count data back with the weather data. Remember that we want to look at the data for 2018 only
melb_daily_walk_2018 <- melb_walk_hols %>%
filter(year == "2018") %>%
group_by(Sensor, Date) %>%
summarise(daily_count = sum(Count, na.rm = TRUE)) %>%
ungroup()
melb_daily_walk_weather_2018 <- melb_daily_walk_2018 %>%
left_join(melb_weather_2018, by = c("Date" = "date"))
melb_daily_walk_weather_2018
## # A tibble: 480 x 10
## Sensor Date daily_count station tmax tmin prcp high_prcp high_temp
## <chr> <date> <int> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Birra… 2018-01-01 8385 ASN000… 26.2 14 0 none not
## 2 Birra… 2018-01-02 9844 ASN000… 23.6 15.5 0 none not
## 3 Birra… 2018-01-03 4091 ASN000… 22.3 11.2 0 none not
## 4 Birra… 2018-01-04 1386 ASN000… 25.5 11.5 0 none not
## 5 Birra… 2018-01-05 1052 ASN000… 30.5 12.2 0 none not
## 6 Birra… 2018-01-06 6672 ASN000… 41.5 16.6 0 none hot
## 7 Birra… 2018-01-07 1567 ASN000… 22 15.7 0 none not
## 8 Birra… 2018-01-08 6201 ASN000… 23.6 15.9 0 none not
## 9 Birra… 2018-01-09 8403 ASN000… 22.8 13.9 0 none not
## 10 Birra… 2018-01-10 12180 ASN000… 25.5 12.1 0 none not
## # … with 470 more rows, and 1 more variable: low_temp <chr>
Create a few plots that look at the spread of the daily totals for each of the sensors, according to the weather flagging variables (high_prcp
, high_temp
, and low_temp
). Write a paragraph that tells us what you learn from these plots, how you think weather might be impacting how people go outside. Make sure to discuss the strengths and limitations of the plots summarised like this, what assumption do they make?
# Plot of count for each sensor against high rain
ggplot(melb_daily_walk_weather_2018,
aes(y = daily_count,
x = Sensor,
colour = high_prcp)) +
geom_boxplot() +
theme(legend.position = "bottom")
# Plot against high temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = daily_count,
x = Sensor,
colour = high_temp)) +
geom_boxplot() +
theme(legend.position = "bottom")
# Plot of low temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = daily_count,
x = Sensor,
colour = low_temp)) +
geom_boxplot() +
theme(legend.position = "bottom")
Answer: There was only 1 day of cold weather throughout (18-April-2018). So it was not plausible to determine if cold weather impact people from going out.
Overall, whether weather impact people from going out is very location-specific.
People seem to be undeterred by weather at Birrarung Marr based on the fact that mean daily count is roughly the same regardless of weather.
On the other hand, rain and hot weather seem to lower the number of people going at the other 3 locations (Flinders Street, Melbourne Central & Southern Cross station)
Although these plots are easy to visualize the spread, however, inferencing from them assumes that regardless of weather, each day can be treated the same. This may not be true considering we have different types of activities such as workday, public holiday or weekdays which captures different demographics as discussed earlier.
Furthermore, we need to take into account weather can affect different times of the day. For example, rain can occur in the night where pedestrian counts are generally lower.
The visualisations tell us something interesting about the data, but to really understand the data, we need to perform some modelling. To do this, you need to combine the weather data with the pedestrian data. We have provided the weather data for 2018 and 2020, combine with the pedestrian data for 2018 and 2020.
melb_weather_2018 <- read_csv(here::here("data-raw/melb_ncdc_2018.csv"))
melb_weather_2020 <- read_csv(here::here("data-raw/melb_ncdc_2020.csv"))
# task: combine the weather data together into an object, `melb_weather`
melb_weather <- bind_rows(melb_weather_2018,
melb_weather_2020) %>%
# remember to add info about high precipitation, high temperature, + low temps
mutate(
high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none"),
high_temp = if_else(condition = tmax > 33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin < 6,
true = "cold",
false = "not")
)
# now combine this weather data with the walking data
melb_walk_weather <- melb_walk_hols %>%
left_join(melb_weather, by = c("Date" = "date"))
# note: weather data not available from 12-04-20 onwards
We have been able to start answering the question, “Does the weather make a difference to the number of people walking out?” by looking at a few exploratory plots. However, we want to get a bit more definitive answer by performing some statistical modelling.
We are going to process the data somewhat so we can fit a linear model to the data. First, let’s take a set relevant variables to be factors. This ensures that the linear model interprets them appropriately.
We also add one to count and then take the natural log of it. The reasons for this are a bit complex, but essentially a linear model is not the most ideal model to fit for this data, and we can help it be more ideal by taking the log of the counts, which helps stabilise the residuals (predictions - observed) when we fit the model.
melb_walk_weather_prep_lm <- melb_walk_weather %>%
mutate_at(.vars = vars(Sensor,
Time,
month,
year,
workday,
high_prcp,
high_temp,
low_temp),
as_factor) %>%
mutate(log_count = log1p(Count))
Now we fit a linear model, predicting logCount using Time, Month, weekday and the weather flag variables (high_prcp
, high_temp
, and low_temp
)
walk_fit_lm <- lm (
formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
Provide some summary statistics on how well this model fits the data? What do you see? What statistics tell us about our model fit?
glance(walk_fit_lm)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.524 0.524 1.21 669. 0 36 -34259. 68592. 68887.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
Answer: Adjusted R.squared indicates the model fit for a linear model with multiple variables. R-squared increases each time a variable is added while adjusted R-square adjusts for this increase.
We see that approximately 52.37% of the variability in log count can be explained by the variables listed above. This is a relatively good fit, however, can be improved which in turns will reduce the residuals of the given model.
We have had one look at the model fit statistics, but let’s now look at the fitted and observed (log_count
) values, for each sensor:
peds_aug_lm <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
ggplot(peds_aug_lm,
aes(x = .fitted,
y = log_count)) +
geom_point() +
facet_wrap(~Sensor)
There is actually a lot of variation. Looking at this, you might assume that the model does a bad job of fitting the residuals. However, we must remember that there is an inherent time structure to the data. A better way to explore this is to look directly at the temporal components. We can do this directly with a calendar plot.
Before we can get the data into the right format for analysis, we need to pivot the data into a longer format, so that we have columns of Date, Time, Count, Model, and log_count.
flinders_lm <- peds_aug_lm %>%
# filter the data to only look at flinder st
filter(Sensor == "Flinders Street Station Underpass") %>%
# Select the Date, Time, Count, .fitted, and log_count
select(Date, Time, Count, .fitted, log_count) %>%
# Now pivot the log count and fitted columns into a column called "model"
# data so we have columns of Date, Time, Count, Model,
# and log_count.
pivot_longer(cols = .fitted:log_count,
names_to = "model",
values_to = "log_count") %>%
# Now we're going to undo the intitial data transformation
mutate(Count = expm1(log_count))
flinders_lm_cal <- flinders_lm %>%
# Let's just look at 2020 to make it a bit easier
filter(year(Date) == "2020") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal_2020 <- ggplot(flinders_lm_cal) +
# Part of the interface to overlaying multiple lines in a calendar plot
# is drawing two separate `geom_line`s -
# See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
# for details
geom_line(data = filter(flinders_lm_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(flinders_lm_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify_gg_cal_2020 <- prettify(gg_cal_2020) +
theme(legend.position = "bottom")
prettify_gg_cal_2020
flinders_lm_cal <- flinders_lm %>%
# Let's just look at 2018 to make it a bit easier
filter(year(Date) == "2018") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal_2018 <- ggplot(flinders_lm_cal) +
# Part of the interface to overlaying multiple lines in a calendar plot
# is drawing two separate `geom_line`s -
# See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
# for details
geom_line(data = filter(flinders_lm_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(flinders_lm_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify_gg_cal_2018 <- prettify(gg_cal_2018) +
theme(legend.position = "bottom")
prettify_gg_cal_2018
gridExtra::grid.arrange(prettify_gg_cal_2020, prettify_gg_cal_2018,
top = "2020", bottom = "2018")
Write a paragraph answering these questions:
The fitted counts/values are generally lower than the observed(log_count).
The patterns between 2020 and 2018 displayed rather similar trends. However, we see 2020’s counts declining significantly upon lockdown measures in March.
From the graph of fitted against observed (log_count) value, it can be seen that the pattern of relationship is different in each sensor. The pattern in Flinders Street Station Underpass and Melbourne Central are quite similar, but not exactly the same. Where the values are more spread out in the Birrarung Marr.
Since year and sensor variables make a difference in pedestrians count, we might consider these variables to be added to the model. We also would consider the lockdown and workday to the model as the exploratory plots show different patterns of the pedestrian counts when these variables are taken into account.
What sort of improvements do you think you could make to the model? Fit two more models, Try adding more variables to the linear model.
melb_walk_lockdown_prep <- melb_walk_weather_prep_lm %>%
# add lockdown stages variable
# lockdown stage 1 from 23 Mar 2020 to 30 Mar 2020
mutate(
lockdown_stage_1 = if_else(condition = Date >= dmy("23/03/2020") & Date <= dmy("30/03/2020"),
true = "1",
false = "0"),
#lockdown stage 3 from 31 Mar 2020 through to April
lockdown_stage_3 = if_else(condition = Date >= dmy("31/03/2020"),
true = "1",
false = "0")) %>%
mutate(lockdown_stage_1 = as.factor(lockdown_stage_1), lockdown_stage_3 = as.factor(lockdown_stage_3))
# Model 2: included lockdown, workday, year, sensor variables
walk_fit_lm_lockdown <- lm(formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp + lockdown_stage_1 + lockdown_stage_3 + workday + year + Sensor,
data = melb_walk_lockdown_prep
)
# Model 3: included continuous weather, workday, sensor, year variables
walk_fit_lm_weather <- lm(formula = log_count ~ Time + month + weekday + tmin + tmax + prcp + workday + Sensor + year,
data = melb_walk_lockdown_prep
)
Why did you add those variables?
Year is added in those two models because from the exploratory plots, we can see that there is a different pattern of the number of pedestrians in 2018 and 2020. This reason is also applied in adding location of sensors and workday variables to those models.
In “walk_fit_lm_lockdown” model, we specifically added lockdown as a dummy variable since there is a significant decline of pedestrians count when the government imposed stage 1 restrictions starting 23 March 2020 and stage 3 restrictions starting 30 March 2020 due to the COVID-19.
In “walk_fit_lm_weather” model, we made the categorized “high_prcp”, “high_temp”, and “low_temp” variables to be its original variables (“tmax”, “tmin”, and “prcp”) because continuous variable is able to reflect the variation of the data better than categorized variable (More information here). Moreover, it is possible to lose information when treating a continuous variable as a categorical variable (see here).
walk_fit_lm_lockdown_summary <- glance(walk_fit_lm_lockdown)
walk_fit_lm_summary <- glance(walk_fit_lm)
bind_rows(
first = glance(walk_fit_lm),
second = glance(walk_fit_lm_lockdown),
third = glance(walk_fit_lm_weather),
.id = "type"
)
## # A tibble: 3 x 12
## type r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 first 0.524 0.524 1.21 669. 0 36 -34259. 68592.
## 2 seco… 0.758 0.758 0.866 1584. 0 43 -27069. 54226.
## 3 third 0.721 0.721 0.930 1371. 0 41 -28589. 57263.
## # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
Which model does the best? Why do you think that is? What statistics are you basing this decision off?
Walk_fit_lm_lockdown is the best model because it has the largest adjusted R-squared, which means that this model can explain the variation in pedestrians count better than other models.
Further we can also refer to the value of AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion) and the logLik (the data’s log-likelihood under the model). The greater the value of logLik, the better the model’s accuracy. Whilst the smallest value of AIC and BIC, the better the model is.
Walk_fit_lm_lockdown has the largest value of logLik and and the least value of AIC and BIC, hence this model is the best compared to the other two models.
(Suggestion - Perhaps write this as a function to speed up comparison)
peds_aug_lm_sensor_lockdown <- augment(walk_fit_lm_lockdown, melb_walk_lockdown_prep)
pivot_sensor <- function(lm_fit, sensor = "Flinders Street Station Underpass"){
lm_fit %>%
filter(Sensor == sensor) %>%
select(Date, Time, Count, .fitted, log_count) %>%
pivot_longer(cols = .fitted:log_count,
names_to = "model",
values_to = "log_count") %>%
mutate(Count = expm1(log_count))
}
calendar_fit_obs <- function(lm_fit_aug){
data_cal <- lm_fit_aug %>%
frame_calendar(x = Time,
y = Count,
date = Date)
gg_cal <-
ggplot(data_cal) +
geom_line(data = filter(data_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(data_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal) +
theme(legend.position = "bottom")
}
pivot_sensor(peds_aug_lm_sensor_lockdown, sensor = "Flinders Street Station Underpass" ) %>%
filter(year(Date) == "2020") %>%
calendar_fit_obs()
What do you see? How does it compare to the previous model?
Answer: We can see that the fitted line is coincides more with the actual value (log_count line). Hence, this model is better than the previous model.
Compare the fitted against the residuals, perhaps write a function to help you do this in a more readable way.
walk_fit_lm_lockdown <- augment(walk_fit_lm_lockdown, melb_walk_lockdown_prep)
walk_fit_lm_weather <- augment(walk_fit_lm_weather, melb_walk_lockdown_prep)
walk_fit_lm <- augment(walk_fit_lm,
melb_walk_weather_prep_lm)
plot_fit_resid <- function(data){
ggplot(data,
aes(x = .fitted,
y = .resid)) +
geom_point() +
facet_wrap(~Sensor)
}
plot_fit_resid(walk_fit_lm_lockdown) +
ggtitle("walk_fit_lm_lockdown")
plot_fit_resid(walk_fit_lm_weather) +
ggtitle("walk_fit_lm_weather")
plot_fit_resid(walk_fit_lm) +
ggtitle("walk_fit_lm")
We’ve looked at all these models, now pick your best one, and compare the predicted values against the actual observed values. What do you see? Is the model good? Is it bad? Do you have any thoughts on why it is good or bad?
ggplot(peds_aug_lm_sensor_lockdown, aes(x = .fitted,
y = log_count)) +
geom_point() +
facet_wrap(~Sensor)
Answer: The lockdown model seen from the graph above has the most correlation between its fitted and actual value.
So, the lockdown model gives less variations between fitted and residuals compared to the previous model. Hence, we can conclude that the model have done a relatively good job in fitting the residual.
Make sure to reference all of the R packages that you used here, along with any links or blog posts that you used to help you answer these questions
This code below here is what was used to retrieve the data in the data-raw
folder.
# melb_bbox <- c(min(ped_loc$longitude) - .001,
# min(ped_loc$latitude) - 0.001,
# max(ped_loc$longitude) + .001,
# max(ped_loc$latitude) + 0.001)
#
# melb_map <- get_map(location = melb_bbox, source = "osm")
# write_rds(melb_map,
# path = here::here("2020/assignment-2/data-raw/melb-map.rds"),
# compress = "xz")
# code to download the stations around the airport and the weather times
# this is purely here so you can see how we downloaded this data
# it is not needed for you to complete the assignment, so it is commented out
# melb_stns <- read_table(
# file = "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
# col_names = c("ID",
# "lat",
# "lon",
# "elev",
# "state",
# "name",
# "v1",
# "v2",
# "v3"),
# skip = 353,
# n_max = 17081
# ) %>%
# filter(state == "MELBOURNE AIRPORT")
# #
# get_ncdc <- function(year){
# vroom::vroom(
# glue::glue(
# "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/{year}.csv.gz"
# ),
# col_names = FALSE,
# col_select = 1:4
# )
# }
#
# clean_ncdc <- function(x){
# x %>%
# filter(X1 == melb_stns$ID, X3 %in% c("PRCP", "TMAX", "TMIN")) %>%
# rename_all(~ c("station", "date", "variable", "value")) %>%
# mutate(date = ymd(date), value = value / 10) %>%
# pivot_wider(names_from = variable, values_from = value) %>%
# rename_all(tolower)
# }
# ncdc_2018 <- get_ncdc(2018)
# melb_ncdc_2018 <- clean_ncdc(ncdc_2018)
# write_csv(melb_ncdc_2018,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2018.csv"))
#
# ncdc_2020 <- get_ncdc(2020)
# beepr::beep(sound = 4)
# melb_ncdc_2020 <- clean_ncdc(ncdc_2020)
# beepr::beep(sound = 4)
# write_csv(melb_ncdc_2020,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2020.csv"))