This assignment is designed to simulate a scenario where you are given a dataset, and taking over someone’s existing work, and continuing with it to draw some further insights. This aspect of it is similar to Assignment 1, but it will provide less scaffolding, and ask you to draw more insights, as well as do more communication.
Your previous submission of crime data was well received!
You’ve now been given a different next task to work on. Your colleague at your consulting firm, Amelia (in the text treatment below) has written some helpful hints throughout the assignment to help guide you.
Questions that are worth marks are indicated with “Q” at the start and end of the question, as well as the number of marks in parenthesis. For example
## Q1A some text (0.5 marks)
Is question one, part A, worth 0.5 marks
This assignment will be worth 10% of your total grade, and is marked out of 58 marks total.
9 Marks for grammar and clarity. You must write in complete sentences and do a spell check.
9 Marks for presentation of the data visualisations
40 marks for the questions
Your marks will be weighted according to peer evaluation.
As of week 6, you have seen most of the code for parts 1 - 2 that needs to be used here, and Week 7 will give you the skills to complete part 3. I do not expect you to know immediately what the code below does - this is a challenge for you! We will be covering skills on modelling in the next weeks, but this assignment is designed to simulate a real life work situation - this means that there are some things where you need to “learn on the job”. But the vast majority of the assignment will cover things that you will have seen in class, or the readings.
Remember, you can look up the help file for functions by typing ?function_name
. For example, ?mean
. Feel free to google questions you have about how to do other kinds of plots, and post on the ED if you have any questions about the assignment.
To complete the assignment you will need to fill in the blanks for function names, arguments, or other names. These sections are marked with ***
or ___
. At a minimum, your assignment should be able to be “knitted” using the knit
button for your Rmarkdown document.
If you want to look at what the assignment looks like in progress, but you do not have valid R code in all the R code chunks, remember that you can set the chunk options to eval = FALSE
like so:
```{r this-chunk-will-not-run, eval = FALSE}`r''`
ggplot()
```
If you do this, please remember to ensure that you remove this chunk option or set it to eval = TRUE
when you submit the assignment, to ensure all your R code runs.
You will be completing this assignment in your assigned groups. A reminder regarding our recommendations for completing group assignments:
Your assignments will be peer reviewed, and results checked for reproducibility. This means:
Each student will be randomly assigned another team’s submission to provide feedback on three things:
This assignment is due in by 6pm on Wednesday 20th May. You will submit the assignment via ED. Please change the file name to include your teams name. For example, if you are team dplyr
, your assignment file name could read: “assignment-2-2020-s1-team-dplyr.Rmd”
You work as a data scientist in the well named consulting company, “Consulting for You”.
On your second day at the company, you impressed the team with your work on crime data. Your boss says to you:
Amelia has managed to find yet another treasure trove of data - get this: pedestrian count data in inner city Melbourne! Amelia is still in New Zealand, and now won’t be back now for a while. They discovered this dataset the afternoon before they left on holiday, and got started on doing some data analysis.
We’ve got a meeting coming up soon where we need to discuss some new directions for the company, and we want you to tell us about this dataset and what we can do with it.
Most Importantly, can you get this to me by Wednesday 20th May, COB (COB = Close of Business at 6pm).
I’ve given this dataset to some of the other new hire data scientists as well, you’ll all be working as a team on this dataset. I’d like you to all try and work on the questions separately, and then combine your answers together to provide the best results.
From here, you are handed a USB stick. You load this into your computer, and you see a folder called “melbourne-walk”. In it is a folder called “data-raw”, and an Rmarkdown file. It contains the start of a data analysis. Your job is to explore the data and answer the questions in the document.
Note that the text that is written was originally written by Amelia, and you need to make sure that their name is kept up top, and to pay attention to what they have to say in the document!
The City of Melbourne has sensors set up in strategic locations across the inner city to keep hourly tallies of pedestrians. The data is updated on a monthly basis and available for download from Melbourne Open Data Portal. The rwalkr package provides an API in R to easily access sensor counts and geographic locations.
There are three parts to this work:
Amelia: I’ve downloaded a map chunk of Melbourne. Can you take the map I made, and plot the location of the sensors on top of it? We want to be able to see all of the sensors, but we also want to create different shapes for the following sensors:
First we download the data on the pedestrian sensor locations around Melbourne.
And now we draw a plot on a map tile of the pedestrian sensor locations
!> Answer:The map shows the location of all active sensors present in the inner city of Melbourne that keeps hourly tallies of pedestrians. The following sensors are highlighted in the map.
Southern Cross Station
Melbourne Central
Flinders Street Station Underpass
Birrarung Marr
ped_loc %>%
# calculate the year from the date information
mutate(year = year(installation_date)) %>%
group_by(year) %>%
# count up the number of sensors
summarise(n=n())%>%
# then use `kable()` to create a markdown table
kable(caption = "Number of Sensors installed each year")%>%
kable_styling(bootstrap_options= c("striped",
"condensed"),
full_width = F)
year | n |
---|---|
2009 | 13 |
2013 | 12 |
2014 | 2 |
2015 | 7 |
2016 | 1 |
2017 | 9 |
2018 | 5 |
2019 | 6 |
2020 | 4 |
Additionally, how many sensors were added in 2016, and in 2017?
!> Answer: The City of Melbourne has installed one sensor in the year 2016 and a total of 9 sensors in the year 2017.
We would like you to focus on the foot traffic at 4 sensors:
Your task is to:
Extract the data for the year 2018 from Jan 1 to April 30 using rwalkr
(you might want to cache this so it doesn’t run every time you knit)
Filter the data down to include only the four sensors above, note that you could do this directly with rwalkr
add variables that contain the day of the month, month, year, and day in the year.
walk_2018 <- walk_2018 %>%
# Filter the data down to include only the four sensors above
filter(Sensor %in% c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr")
) %>%
# now add four columns, containing month day, month, year, and day of the year
# using functions from lubridate.
mutate(day_month = mday(Date),
monthname = month(Date,label = TRUE, abbr = TRUE),
year = year(Date),
day_of_the_year = yday(Date)
)
Now we can plot the pedestrian count information for January - April in 2018
ggplot(walk_2018,
aes(x = Date_Time,
y = Count)) +
geom_line(size = 0.3) +
facet_grid(Sensor ~ .,
# this code presents the facets in a nice way
labeller = labeller(Sensor = label_wrap_gen(20))) +
# this code mades the x axis a bit nicer to read
scale_x_datetime(date_labels = "%d %b %Y",
date_minor_breaks = "1 month") +
labs(x = "Date Time")
We can see that there are quite different patterns in each of the sensors. Let’s explore this further.
!> Answer: There seems a spike in pedestrian counts during Mid-January to Late January in the region of Birrarung Marr which is most likely due to the Australian Open conducted during 15 January to 28 January 2018 where as the increase in pedestrian counts in the month of March is most likely due to the Momba festival conducted during that time. The Flinders Street sensor records high pedestrian counts consistently as it is a relatively crowded region. As for Southern cross station pedestrian counts,it is consistent throughout the months. The Melbourne Central station shows a spike on February 17th, due to protected industrial action, no trams ran between 10am and 2pm on Monday 17 February, giving an increased influx in foot traffic. Since Docklands is a prominent business area, the count is mostly due to the people who go to work as there is a dip in the pedestrian counts on the weekends.
We’re primarily interested in exploiting pedestrian patterns at various time resolutions and across different locations. In light of people’s daily schedules, let’s plot the counts against time of day for each sensor.
ggplot(walk_2018,
aes(x = Time,
y = Count,
group = Date,
colour = Sensor)) +
geom_line() +
facet_wrap(Sensor~.,
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2",
name = "Sensor") +
theme(legend.position = "none")
Write a short paragraph that describe what the plot shows:
!> Answer: The figure plotted above is the hourly pedestrian counts for each day in a month taken by the sensors at the four locations for the months of January, February, March and April of 2018 . Each line represents the hourly trend of the pedestrian count in one day across the 4 sensors.
The pattern for all the stations except Birraungh Marr remain almost the same for the four months. The Melbourne Central sensor, Flinders Street sensor has occasionally sensed a spike in pedestrian counts throughout the months whereas the Birrarung Marr sensor data is continuously varying for the four months with no specific trend.
All regions see pedestrians after 5am. Flinders street - 5-10 am. people going to work, students to uni etc. after that it decreases. Then again increases during lunch hours. again increases after 3pm and then gradually decreases. people going home.
Melbourne Central - Sees ped count throughout the day increase range 5am -5pm after that decreases.
Southern Cross Station - consistent throughout the months. People going to work after 5, rush hour 5-10, then lunch time, then rush hour after 5 peal time around 5 after that decreases.
Birrrarung Marr sees pedestrians after 5 but the rush hour begins after 10 and varies. Count in this region depends on events conducted nearby like Aus open, moomba festival and high pedestrian counts are seen in the evening time. March saw the highest ped counts. Feb saw high ped counts in early morning for a day in regions except southern cross. probably due to some events.
Use the data inside the hols_2018
data to identify weekdays and weekends, and holidays.
hols_2018 <- tsibble::holiday_aus(year = 2018, state = "VIC")
walk_2018_hols <- walk_2018 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = Date %in% hols_2018$date | weekday %in% c("Sat","Sun"),
true = "holiday",
false = "workday"
)
)
Now create a plot to compare the workdays to the non workdays.
ggplot(walk_2018_hols,
aes(x=Time,y=Count, group = Date, colour = Sensor)) +
geom_line(size = 0.3,
alpha = 0.3) +
facet_grid(Sensor~workday,
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2", name = "Sensor") +
theme(legend.position = "none")
Write a short paragraph that describe what the plot shows, and helps us answer these questions:
!> Answer: – The figure plotted above is the hourly pedestrian counts for work days and non working days/holidays in a month taken by the sensors at the four locations for the months of January, February, March and April of 2018 . Each line represents the hourly trend of the pedestrian count in the respective day. - The data varies for each sensor and within the sensor, the pattern is different for week day or working day and non working days.
The Southern Cross Station sensor saw very less pedestrians on non working days/holidays compared to working days.
On the working days the rush hour can be seen in both Southern Cross Station and Flinder Street bulge between 5 AM till around 8AM hitting its peak, after which it decreases till 10 AM. A spike is also seen during lunch time between the hours of 10AM to 12PM. Again evening rush hour begins around 3PM till 5PM, which are closing hours of work places, after which it decreases. Although the Flinder street sensor sees more traffic but the trend is similar.
Flinders street and Melbourne Central saw quite similar patterns during non working days/holidays with more pedestrians compared to Southern Cross station region whereas for working days Flinders street seem to have more pedestrians.They also saw pedestrian activity around early morning after 12am during non working days/holidays.
Melbourne Central sensor shows a positive trend in relation to time till 8PM, as the station center relates to the mall and shopping district more people use that sensor as the day passes and towards the evening, and then falls slowly and gradually till midnight.
From Birrarung Marr sensor data, we can see that more people visit this region during non working days compared to working days. The region has inconsistent traffic as the region contains tourist sites and areas which are not visited frequently but are visited more on special occasions.
To locate those unusual moments, Flinders Street Station data is calendarised on the canvas, using the sugrrants package. We can spot the unusual weekday patterns on public holidays using their color. Using the calendar plot, try to spot another unusual pattern, do a google search to try to explain the change in foot traffic. (Hint: Look at the plot carefully, does a particular day show a different daily pattern? Is there a specific event or festival happening on that day?)
# filter to just look at flinders st station
flinders <- walk_2018_hols %>% filter(Sensor=="Flinders Street Station Underpass")
flinders_cal <- flinders %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- flinders_cal %>%
ggplot(aes(x = .Time, y = .Count, colour = workday, group = Date)) +
geom_line()
prettify(gg_cal) +
theme(legend.position = "bottom")
!> Answer: The pedestrian count data on January 1st 2018 seem different from the data of other dates. On January 1st,more people were detected on the sensors during and after midnight.This can be explained by the fact that it was New Year.
On February 17 2018, there seemed to be an increase in pedestrian count from morning up until midnight. This is probably due to the participation of people in a festival called the White Night conducted in Melbourne CBD. The presence of commuters during the early hours of 18th February is most likely due to the same reason.
March 17th shows a delayed spike before midnight was seen which may be due to St Patrick’s day which also affected Sunday the 18th of March.
On April 12 there seems to be a period no data was recorded, this can due to technique problem. On 14th of April which is Easter, there was some discrepencies.
You’ll need to ensure that you follow the steps we did earlier to filter the data and add the holiday information.
walk_2020 <- walk_2020 %>%
filter(Sensor %in% c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr")) %>%
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
mutate(day_month = mday(Date),
monthname = month(Date,label = TRUE, abbr = TRUE),
year = year(Date),
day_of_the_year = yday(Date))
Now add the holiday data
# also the steps for adding in the holiday info
hols_2020 <- tsibble::holiday_aus(year = 2020, state = "VIC")
walk_2020_hols <- walk_2020 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = Date %in% hols_2018$date | weekday %in% c("Sat","Sun"),
true = "holiday",
false = "workday"
))
melb_walk_hols <- bind_rows(walk_2018_hols, walk_2020_hols)
filter_sensor <- function(data, sensors){
data %>% filter(Sensor %in% sensors)
}
add_day_info <- function(data){
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
data %>%
mutate(day_month = mday(Date),
monthname = month(Date,label = TRUE, abbr = TRUE),
year = year(Date),
day_of_the_year = yday(Date))
}
add_working_day <- function(data){
walk_years <- unique(data$year)
hols <- tsibble::holiday_aus(year = walk_years, state = "VIC")
data %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = Date %in% hols$date | weekday %in% c("Sat","Sun"),
true = "holiday",
false = "workday",
))
}
# Step one, combine the walking data
new <- bind_rows(walk_2018, walk_2020) %>%
# Step two, filter the sensors
filter_sensor(sensors = c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr")) %>%
# step three, add the info on day of the year, month, etc
add_day_info() %>%
# strep four, add info on working days.
add_working_day()
Write a paragraph that describe what you learn from these plots. Can you describe any similarities, and differences amongst the plots, and why they might be similar or different? (You might need to play with the plot output size to clearly see the pattern)
melb_walk_hols_flinders_april <- new %>%
filter(Sensor=="Flinders Street Station Underpass", monthname=="Apr")
ggplot(melb_walk_hols_flinders_april,
aes(x=Time,y=Count,group=Date,
colour = as.factor(year))) +
geom_line() +
facet_wrap(~ year, ncol = 2) +
theme(legend.position = "bottom") +
labs(colour = "Year")
!> Answer: The pedestrian count substantially decreased in 2020 when compared to the count in 2018. Reason being the emergency declared throughout Australia by the government to get a handle on the Coronavirus pandemic, which requires the majority of the population to remain indoors and there is a ban on all public events, unnecessary business activity, entertainment and public gatherings.
What do you learn? Which Sensors seem the most similar? Or the most different?
melb_walk_hols_april <- new %>% filter(monthname == "Apr")
ggplot(melb_walk_hols_april,
aes(x=Time,y=Count,group=Date,
colour = as.factor(Sensor))) +
geom_line() +
facet_grid(year~Sensor,labeller = labeller(Sensor = label_wrap_gen(20))) +
theme(legend.position = "bottom") +
labs(colour = "Year")
!> Answer: We learned that the year difference is astounding between the sensors. 2020 sensors show almost identical patterns across all sensors. However, on a closer look 2020 has a similar pattern to the 2018, which has been extremely dialed down.
Combining weather data with pedestrian counts
One question we want to answer is: “Does the weather make a difference to the number of people walking out?”
Time of day and day of week are the predominant driving force of the number of pedestrian, depicted in the previous data plots. Apart from these temporal factors, the weather condition could possibly affect how many people are walking in the city. In particular, people are likely to stay indoors, when the day is too hot or too cold, or raining hard.
Daily meteorological data as a separate source, available on National Climatic Data Center, is used and joined to the main pedestrian data table using common dates.
Binary variables are created to serve as the tipping points
We have pulled information on weather stations for the Melbourne area - can you combine it together into one dataset?
prcp
> 5 (if yes, “rain”, if no, “none”)tmax
> 33 (if yes, “hot”, if no, “not”)tmin
< 6 (if yes, “cold”, if no, “not”)# Now create some flag variables
melb_weather_2018 <- read_csv("data-raw/melb_ncdc_2018.csv")%>%
mutate(
high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none"),
high_temp = if_else(condition = tmax>33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin<6,
true = "cold",
false = "not"),
)
The weather data is per day, and the pedestrian count data is every hour. One way to explore this data is to collapse the pedestrian count data down to the total daily counts, so we can compare the total number of people each day to the weather for each day. This means each row is the total number of counts at each sensor, for each day.
Depending on how you do this, you will likely need to merge the pedestrian count data back with the weather data. Remember that we want to look at the data for 2018 only
melb_daily_walk_2018 <- melb_walk_hols %>%
filter(year==2018) %>%
group_by(Sensor, Date) %>%
summarise(Count=sum(Count)) %>%
ungroup()
melb_daily_walk_weather_2018 <- melb_daily_walk_2018 %>%
left_join(melb_weather_2018, by = c("Date"="date"))
melb_daily_walk_weather_2018
## # A tibble: 480 x 10
## Sensor Date Count station tmax tmin prcp high_prcp high_temp
## <chr> <date> <int> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Birra~ 2018-01-01 8385 ASN000~ 26.2 14 0 none not
## 2 Birra~ 2018-01-02 9844 ASN000~ 23.6 15.5 0 none not
## 3 Birra~ 2018-01-03 NA ASN000~ 22.3 11.2 0 none not
## 4 Birra~ 2018-01-04 NA ASN000~ 25.5 11.5 0 none not
## 5 Birra~ 2018-01-05 1052 ASN000~ 30.5 12.2 0 none not
## 6 Birra~ 2018-01-06 6672 ASN000~ 41.5 16.6 0 none hot
## 7 Birra~ 2018-01-07 1567 ASN000~ 22 15.7 0 none not
## 8 Birra~ 2018-01-08 6201 ASN000~ 23.6 15.9 0 none not
## 9 Birra~ 2018-01-09 8403 ASN000~ 22.8 13.9 0 none not
## 10 Birra~ 2018-01-10 12180 ASN000~ 25.5 12.1 0 none not
## # ... with 470 more rows, and 1 more variable: low_temp <chr>
Create a few plots that look at the spread of the daily totals for each of the sensors, according to the weather flagging variables (high_prcp
, high_temp
, and low_temp
). Write a paragraph that tells us what you learn from these plots, how you think weather might be impacting how people go outside. Make sure to discuss the strengths and limitations of the plots summarised like this, what assumption do they make?
# Plot of count for each sensor against high rain
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = high_prcp)) +
geom_boxplot() +
theme(legend.position = "bottom")
# Plot against high temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = high_temp)) +
geom_boxplot() +
theme(legend.position = "bottom")
# Plot of low temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = low_temp)) +
geom_boxplot() +
theme(legend.position = "bottom")
!> Answer: - Rain has had a very significant impact on people to go outside, South Cross station is the highly affected in all of them, with the lowest median as well. The single outlier suggests that the rain came without warning on a working day. Melbourne Central faced a fall as well due to rain, with a similar explanation for the outlier. Flinders Street Station faced a decline with no outliers. The most interesting outcome is from Birrarung Marr the change in the frequency of people did not face a significant decline, however, the median value has increased, which may be due to the unpredictability of the station sensor output.
High temperature has had a predictable effect on South Cross Station and has a greater output of pedestrians when the weather is not hot. While the hot temperature showed a decrease in the pedestrian sensor output with the median shifting to the center, with no outliers. Melbourne Central faced a rise in the majority count with hot temperatures, with a small decline in the total count. Flinders Street Station showed a decrease in the overall sensor output, with the median shifting downwards, with no outliers. Birrarung Marr station has a slightly opposite effect with an increase in the frequent majority count and a slight upward shift of the median count in the case of hot temperature, with one outlier.
Low temperature has a baffling effect with no significant output to run a comparison, meaning that the cold temperature has a drastic effect on the pedestrian output with almost low to zero output throughout the 4 sensors. Where the graph suggests that the South Cross Station being closed down.
Limitations of the visualization are the dates which can relate to any special occasion to explain the outliers. Time of the year or sudden changes, sudden changes in temperature will have a different effect than during the summers and winters. The bracket of the high temperature and low temperature, with the difference of rain output will help us further clarify the change in trends.
Strengths of the visualization are the plots themselves helping us understand the shift of the median, majority and normal counts, helping us understand the effect on weekdays, holidays and weekends.
The visualisations tell us something interesting about the data, but to really understand the data, we need to perform some modelling. To do this, you need to combine the weather data with the pedestrian data. We have provided the weather data for 2018 and 2020, combine with the pedestrian data for 2018 and 2020.
melb_weather_2018 <- read_csv("data-raw/melb_ncdc_2018.csv")
melb_weather_2020 <- read_csv("data-raw/melb_ncdc_2020.csv")
# task: combine the weather data together into an object, `melb_weather`
melb_weather <- bind_rows(melb_weather_2018,
melb_weather_2020) %>%
# remember to add info about high precipitation, high temperature, + low temps
mutate(
high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none" ),
high_temp = if_else(condition = tmax>33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin<6,
true = "cold",
false = "not")
)
# now combine this weather data with the walking data
melb_walk_weather <- melb_walk_hols %>%
left_join(melb_weather, by = c("Date"="date"
))
We have been able to start answering the question, “Does the weather make a difference to the number of people walking out?” by looking at a few exploratory plots. However, we want to get a bit more definitive answer by performing some statistical modelling.
We are going to process the data somewhat so we can fit a linear model to the data. First, let’s take a set relevant variables to be factors. This ensures that the linear model interprets them appropriately.
We also add one to count and then take the natural log of it. The reasons for this are a bit complex, but essentially a linear model is not the most ideal model to fit for this data, and we can help it be more ideal by taking the log of the counts, which helps stabilise the residuals (predictions - observed) when we fit the model.
melb_walk_weather_prep_lm <- melb_walk_weather %>%
mutate_at(.vars = vars(Sensor,
Time,
monthname,
year,
weekday,
high_prcp,
high_temp,
low_temp),
as_factor) %>%
mutate(log_count = log1p(Count))
Now we fit a linear model, predicting logCount using Time, Month, weekday and the weather flag variables (high_prcp
, high_temp
, and low_temp
)
walk_fit_lm <- lm (
formula = log_count ~ Time + monthname + weekday + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
Provide some summary statistics on how well this model fits the data? What do you see? What statistics tell us about our model fit?
glance(walk_fit_lm)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.524 0.524 1.21 669. 0 36 -34259. 68592. 68887.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
!> Answer: r.squared measures the strength of the linear model fit. It ranges between 0 and 1, with 1 being the perfect fit. Here r.squared is equal to 52.44 %. Around 52.44 percent variability in the observed pedestrian counts is explained by the patterns in the hours of the day.
We have had one look at the model fit statistics, but let’s now look at the fitted and observed (log_count
) values, for each sensor:
peds_aug_lm <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
ggplot(peds_aug_lm,
aes(x = log_count,
y = .fitted)) +
geom_point() +
facet_wrap(~Sensor)
There is actually a lot of variation. Looking at this, you might assume that the model does a bad job of fitting the residuals. However, we must remember that there is an inherent time structure to the data. A better way to explore this is to look directly at the temporal components. We can do this directly with a calendar plot.
Before we can get the data into the right format for analysis, we need to pivot the data into a longer format, so that we have columns of Date, Time, Count, Model, and log_count.
flinders_lm <- peds_aug_lm %>%
# filter the data to only look at flinder st
filter(Sensor == "Flinders Street Station Underpass") %>%
# Select the Date, Time, Count, .fitted, and log_count
select(Date, Time, Count, .fitted, log_count) %>%
# Now pivot the log count and fitted columns into a column called "Model"
# data so we have columns of Date, Time, Count, Model,
# and log_count.
pivot_longer(cols = c(log_count:.fitted,),
names_to = "model",
values_to = "log_count") %>%
# Now we're going to undo the intiial data transformation
mutate(Count = expm1(log_count))
flinders_lm_cal <- flinders_lm %>%
# Let's just look at 2020 to make it a bit easier
filter(year(Date) == "2020") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- ggplot(flinders_lm_cal) +
# Part of the interface to overlaying multiple lines in a calendar plot
# is drawing two separate `geom_line`s -
# See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
# for details
geom_line(data = filter(flinders_lm_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date), color = "Blue") +
geom_line(data = filter(flinders_lm_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date), color = "Red")
prettify(gg_cal) + theme(legend.position = "bottom")
Write a paragraph answering these questions:
flinders_lm_cal <- flinders_lm %>%
# Let's just look at 2020 to make it a bit easier
filter(year(Date) == "2018") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- ggplot(flinders_lm_cal) +
# Part of the interface to overlaying multiple lines in a calendar plot
# is drawing two separate `geom_line`s -
# See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
# for details
geom_line(data = filter(flinders_lm_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date), color = "Blue") +
geom_line(data = filter(flinders_lm_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date), color = "Red")
prettify(gg_cal) + theme(legend.position = "bottom")
tot_sensor <- peds_aug_lm %>%
select(Sensor,Date,Time,Count,.fitted,log_count) %>%
pivot_longer(cols = c(log_count:.fitted,),names_to = "model",values_to = "log_count") %>%
mutate(Count = expm1(log_count))
tot_sensor_cal <- tot_sensor %>%
# Let's just look at 2020 to make it a bit easier
filter(year(Date) == "2020") %>% group_by(Sensor) %>%
frame_calendar(x = Time, y = Count, date = Date)
ggcaltot <- ggplot(data=tot_sensor_cal) +
# Part of the interface to overlaying multiple lines in a calendar plot
# is drawing two separate `geom_line`s -
# See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
# for details
geom_line(data = filter(tot_sensor_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date),colour="red") +
geom_line(data = filter(tot_sensor_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date),colour="blue") +
facet_grid(Sensor ~ .) +
scale_color_brewer(palette = "Dark2") +
theme(legend.position = "bottom")
prettify(ggcaltot) + theme(legend.position = "bottom")
#facet_grid(~ Sensor) + # or ~ as.Date(Date_Time) theme(legend.position = "bottom")
!> Answer: - As per the linear model,In the months of January,February and March 2020 there seems to be more difference in the observed and predicted values in the weekdays. The predicted and the observed values are relatively similar for the weekends in January and February and this trend continues until the second weekend of March 2020. Due to the Covid-19 pandemic and enforced restrictions from mid March,the predicted and observed values are almost the same irrespective of the working hours or holidays.
Generally the predicted and observed values follow a similar trend in 2018 like the first two months of 2020.Due to the COVID-19 pandemic in 2020, there is a huge difference in the predicted and observed values from Mid March in 2018 when compared to its counterpart in 2020.
The figure depicts the predicted and observed values of pedestrian counts across the sensors. There is a visible difference in the patterns across all the sensors. The Southern Cross Station and The Flinder Street Station sensors sees a similar pattern and is different from the patterns seen in the Birrarung Marr and Melbourne Central Station regions as the latter are places of leisure.
What other variables might you consider adding to the model, and why?
We could consider the variables year and Sensor to be added to the model as there is clear difference in the observations in these variables.
What sort of improvements do you think you could make to the model? Fit two more models, Try adding more variables to the linear model.
walk_fit_lm1 <- lm (
formula = log_count ~ Time + monthname + weekday + year + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
walk_fit_lm2 <- lm (
formula = log_count ~ Time + monthname + weekday + year + Sensor + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
Why did you add those variables?
!> Answer: The year and Sensor variables influence the pedestrian count patterns. This is a good reason to add those variables into the model.
bind_rows(
first = glance(walk_fit_lm),
year = glance(walk_fit_lm1),
sensor = glance(walk_fit_lm2),
.id = "type"
)
## # A tibble: 3 x 12
## type r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 first 0.524 0.524 1.21 669. 0 36 -34259. 68592.
## 2 year 0.554 0.554 1.18 733. 0 37 -33572. 67219.
## 3 sens~ 0.720 0.719 0.932 1397. 0 40 -28639. 57360.
## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
Which model does the best? Why do you think that is? What statistics are you basing this decision off?
!> Answer: The model which includes the sensor and year variables is the best among the three models. This decision is based on the r.squared statistic observation for all the variables.The sensor model has an r.squared value of 0.71980 which is higher than that of other models.
(Suggestion - Perhaps write this as a function to speed up comparison)
peds_aug_lm_sensor_1 <- augment(walk_fit_lm1, data = melb_walk_weather_prep_lm)
peds_aug_lm_sensor_2 <- augment(walk_fit_lm2, data = melb_walk_weather_prep_lm)
pivot_sensor <- function(lm_fit, sensor = "Flinders Street Station Underpass"){
lm_fit %>%
filter(Sensor == sensor) %>%
select(Date, Time, Count, .fitted, log_count) %>%
pivot_longer(cols = c(log_count:.fitted,),
names_to = "model",
values_to = "log_count") %>%
mutate(Count = expm1(log_count))
}
calendar_fit_obs <- function(lm_fit_aug){
data_cal <- lm_fit_aug %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- ggplot(flinders_lm_cal) +
geom_line(data = filter(flinders_lm_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date), color = "Blue") +
geom_line(data = filter(flinders_lm_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date), color = "Red")
prettify(gg_cal) + theme(legend.position = "bottom")
}
pivot_sensor(peds_aug_lm_sensor_2) %>%
filter(year(Date) == "2018") %>%
calendar_fit_obs()
What do you see? How does it compare to the previous model?
!> Answer: We can observe that there is very little difference between the predicted and the observed values of the pedestrian counts in the new model where the sensor variable is added. This pattern has a deviation from mid March 2020 due to the COVID-19 pandemic. It seems the new model is better fitted, after adding the sensor variable, but it failed to consider the Covid-19 pandemic.If not for this unusual circumstance, this model is better than the previous one.
Compare the fitted against the residuals, perhaps write a function to help you do this in a more readable way.
walk_fit_lm_0 <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
walk_fit_lm_1 <- augment(walk_fit_lm1, data = melb_walk_weather_prep_lm)
walk_fit_lm_2 <- augment(walk_fit_lm2, data = melb_walk_weather_prep_lm)
plot_fit_resid <- function(data){
ggplot(data,
aes(x = .resid,
y = .fitted)) +
geom_point() +
facet_wrap(~Sensor)
}
plot_fit_resid(walk_fit_lm_0)
plot_fit_resid(walk_fit_lm_1)
plot_fit_resid(walk_fit_lm_2)
We’ve looked at all these models, now pick your best one, and compare the predicted values against the actual observed values. What do you see? Is the model good? Is it bad? Do you have any thoughts on why it is good or bad?
!> Answer: - From the observations made,it is clear that the sensor model which includes both year and sensor variables is the best one among the three. With the increase in the number of variables added ,the difference between the predicted and observed values decreases. The sensor model has relatively more points clustered around zero. This indicates that there is very less difference between the predicted and observed values of pedestrian counts.The other two models has relatively more points scattered away from zero,thus we can say that the sensor model with most number of variables is the best apart from the outliers due to the COVID-19 pandemic. The sensor model has values as follows - AIC = 57359.92,BIC= 57686.46,Deviance = 18424.33.
[1]Tierney, N., & Lee, S. (2020). Linear Models [Ebook]. Retrieved from https://mida.numbat.space/slides/lecture_7a.pdf
[2]Wang, E., Cook, D., & Hyndman, R. (2020). Sugrrants package [Ebook]. Retrieved from https://cran.r-project.org/web/packages/sugrrants/sugrrants.pdf
[3]Wang, E., Cook, D., & Hyndman, R. (2020). Calendar-Based Graphics for Visualizing People’s Daily Schedules. Journal Of Computational And Graphical Statistics, 1-13. doi: 10.1080/10618600.2020.1715226
[4] Wang, E. (2020). Calendar-based graphics. Retrieved 19 May 2020, from https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
[5] Wang, E., Cook, D., & Hyndman, R. (2020). A New Tidy Data Structure to Support Exploration and Modeling of Temporal Data. Journal Of Computational And Graphical Statistics, 1-13. doi: 10.1080/10618600.2019.1695624
This code below here is what was used to retrieve the data in the data-raw
folder.
# melb_bbox <- c(min(ped_loc$longitude) - .001,
# min(ped_loc$latitude) - 0.001,
# max(ped_loc$longitude) + .001,
# max(ped_loc$latitude) + 0.001)
#
# melb_map <- get_map(location = melb_bbox, source = "osm")
# write_rds(melb_map,
# path = here::here("2020/assignment-2/data-raw/melb-map.rds"),
# compress = "xz")
# code to download the stations around the airport and the weather times
# this is purely here so you can see how we downloaded this data
# it is not needed for you to complete the assignment, so it is commented out
# melb_stns <- read_table(
# file = "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
# col_names = c("ID",
# "lat",
# "lon",
# "elev",
# "state",
# "name",
# "v1",
# "v2",
# "v3"),
# skip = 353,
# n_max = 17081
# ) %>%
# filter(state == "MELBOURNE AIRPORT")
# #
# get_ncdc <- function(year){
# vroom::vroom(
# glue::glue(
# "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/{year}.csv.gz"
# ),
# col_names = FALSE,
# col_select = 1:4
# )
# }
#
# clean_ncdc <- function(x){
# x %>%
# filter(X1 == melb_stns$ID, X3 %in% c("PRCP", "TMAX", "TMIN")) %>%
# rename_all(~ c("station", "date", "variable", "value")) %>%
# mutate(date = ymd(date), value = value / 10) %>%
# pivot_wider(names_from = variable, values_from = value) %>%
# rename_all(tolower)
# }
# ncdc_2018 <- get_ncdc(2018)
# melb_ncdc_2018 <- clean_ncdc(ncdc_2018)
# write_csv(melb_ncdc_2018,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2018.csv"))
#
# ncdc_2020 <- get_ncdc(2020)
# beepr::beep(sound = 4)
# melb_ncdc_2020 <- clean_ncdc(ncdc_2020)
# beepr::beep(sound = 4)
# write_csv(melb_ncdc_2020,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2020.csv"))