This assignment is designed to simulate a scenario where you are given a dataset, and taking over someone’s existing work, and continuing with it to draw some further insights. This aspect of it is similar to Assignment 1, but it will provide less scaffolding, and ask you to draw more insights, as well as do more communication.
Your previous submission of crime data was well received!
You’ve now been given a different next task to work on. Your colleague at your consulting firm, Amelia (in the text treatment below) has written some helpful hints throughout the assignment to help guide you.
Questions that are worth marks are indicated with “Q” at the start and end of the question, as well as the number of marks in parenthesis. For example
## Q1A some text (0.5 marks)
Is question one, part A, worth 0.5 marks
This assignment will be worth 10% of your total grade, and is marked out of 58 marks total.
9 Marks for grammar and clarity. You must write in complete sentences and do a spell check.
9 Marks for presentation of the data visualisations
40 marks for the questions
Your marks will be weighted according to peer evaluation.
As of week 6, you have seen most of the code for parts 1 - 2 that needs to be used here, and Week 7 will give you the skills to complete part 3. I do not expect you to know immediately what the code below does - this is a challenge for you! We will be covering skills on modelling in the next weeks, but this assignment is designed to simulate a real life work situation - this means that there are some things where you need to “learn on the job”. But the vast majority of the assignment will cover things that you will have seen in class, or the readings.
Remember, you can look up the help file for functions by typing ?function_name
. For example, ?mean
. Feel free to google questions you have about how to do other kinds of plots, and post on the ED if you have any questions about the assignment.
To complete the assignment you will need to fill in the blanks for function names, arguments, or other names. These sections are marked with ***
or ___
. At a minimum, your assignment should be able to be “knitted” using the knit
button for your Rmarkdown document.
If you want to look at what the assignment looks like in progress, but you do not have valid R code in all the R code chunks, remember that you can set the chunk options to eval = FALSE
like so:
```{r this-chunk-will-not-run, eval = FALSE}`r''`
ggplot()
```
If you do this, please remember to ensure that you remove this chunk option or set it to eval = TRUE
when you submit the assignment, to ensure all your R code runs.
You will be completing this assignment in your assigned groups. A reminder regarding our recommendations for completing group assignments:
Your assignments will be peer reviewed, and results checked for reproducibility. This means:
Each student will be randomly assigned another team’s submission to provide feedback on three things:
This assignment is due in by 6pm on Wednesday 20th May. You will submit the assignment via ED. Please change the file name to include your teams name. For example, if you are team dplyr
, your assignment file name could read: “assignment-2-2020-s1-team-dplyr.Rmd”
You work as a data scientist in the well named consulting company, “Consulting for You”.
On your second day at the company, you impressed the team with your work on crime data. Your boss says to you:
Amelia has managed to find yet another treasure trove of data - get this: pedestrian count data in inner city Melbourne! Amelia is still in New Zealand, and now won’t be back now for a while. They discovered this dataset the afternoon before they left on holiday, and got started on doing some data analysis.
We’ve got a meeting coming up soon where we need to discuss some new directions for the company, and we want you to tell us about this dataset and what we can do with it.
Most Importantly, can you get this to me by Wednesday 20th May, COB (COB = Close of Business at 6pm).
I’ve given this dataset to some of the other new hire data scientists as well, you’ll all be working as a team on this dataset. I’d like you to all try and work on the questions separately, and then combine your answers together to provide the best results.
From here, you are handed a USB stick. You load this into your computer, and you see a folder called “melbourne-walk”. In it is a folder called “data-raw”, and an Rmarkdown file. It contains the start of a data analysis. Your job is to explore the data and answer the questions in the document.
Note that the text that is written was originally written by Amelia, and you need to make sure that their name is kept up top, and to pay attention to what they have to say in the document!
The City of Melbourne has sensors set up in strategic locations across the inner city to keep hourly tallies of pedestrians. The data is updated on a monthly basis and available for download from Melbourne Open Data Portal. The rwalkr package provides an API in R to easily access sensor counts and geographic locations.
There are three parts to this work:
Amelia: I’ve downloaded a map chunk of Melbourne. Can you take the map I made, and plot the location of the sensors on top of it? We want to be able to see all of the sensors, but we also want to create different shapes for the following sensors:
First we download the data on the pedestrian sensor locations around Melbourne.
And now we draw a plot on a map tile of the pedestrian sensor locations
Pedestrian Sensors in Melbourne
This map shows all the active Pedestrian sensors in Melbourne which are used to count the number of pedestrians, hourly. Each sensor is depicted by a blue circle or a coloured triangle. The coloured triangles are sensors for the following locations; Birrarung Marr, Melbourne Central, Southern Cross Station and Flinders Street Station Underpass, with the respective colours referenced in the key.
ped_loc %>%
# calculate the year from the date information
mutate(year = year(installation_date)) %>%
# count up the number of sensors
group_by(year) %>%
summarise(count = n()) %>%
# then use `kable()` to create a markdown table
kable(col.names = c("Year",
"Number of installations"),
caption = "Number of Sensors Installed each year") %>%
kable_styling("striped")
Year | Number of installations |
---|---|
2009 | 13 |
2013 | 12 |
2014 | 2 |
2015 | 7 |
2016 | 1 |
2017 | 9 |
2018 | 5 |
2019 | 6 |
2020 | 4 |
Additionally, how many sensors were added in 2016, and in 2017?
There was one sensor added in 2016 and nine sensors added in 2017.
We would like you to focus on the foot traffic at 4 sensors:
Your task is to:
Extract the data for the year 2018 from Jan 1 to April 30 using rwalkr
(you might want to cache this so it doesn’t run every time you knit)
Filter the data down to include only the four sensors above, note that you could do this directly with rwalkr
add variables that contain the day of the month, month, year, and day in the year.
walk_2018 <- walk_2018 %>%
# Filter the data down to include only the four sensors above
filter(Sensor %in% selected_sensors) %>%
# now add four columns, containing month day, month, year, and day of the year
# using functions from lubridate.
mutate(mday = mday(Date_Time),
month = month(Date_Time),
year = year(Date_Time),
yday = yday(Date_Time))
Now we can plot the pedestrian count information for January - April in 2018
ggplot(walk_2018,
aes(x = Date_Time,
y = Count)) +
geom_line(size = 0.3) +
facet_grid(Sensor ~ .,
# this code presents the facets ina nice way
labeller = labeller(Sensor = label_wrap_gen(20))) +
# this code mades the x axis a bit nicer to read
scale_x_datetime(date_labels = "%d %b %Y",
date_minor_breaks = "1 month") +
labs(x = "Date Time")
Pedestrian Counts for January - April 2018
We can see that there are quite different patterns in each of the sensors. Let’s explore this further.
There would be many activities that are captured at these places including but not limited to;
* Morning commute to work
* Evening commute from work
* Shopping
* Sight-seeing
* Festivals
* Events at Birrarung Marr
* Nightlife
We’re primarily interested in exploiting pedestrian patterns at various time resolutions and across different locations. In light of people’s daily schedules, let’s plot the counts against time of day for each sensor.
ggplot(walk_2018,
aes(x = Time,
y = Count,
group = Date,
colour = Sensor)) +
geom_line() +
facet_wrap(~Sensor,
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2",
name = "Sensor") +
theme(legend.position = "none")
Daily trends of pedestrian patterns in the first quarter of 2018
Write a short paragraph that describe what the plot shows:
The plot shows the pedestrian counts at specific times of day for each sensor. Each panel represents a sensor, and each line represents a day within the time frame (2018-01-01 to 2018-04-30).
The data for Southern Cross Station and Flinders St Station Underpass are similar, while Birrarung Marr and Melbourne Central Station are different. Southern Cross and Flinders Street Station Underpass both begin to increase pedestrian counts around 5am with a peak mid-morning, and begin to increase again around 12pm with a peak around 3-4pm. The peaks are likely to be associated with pedestrian’s daily commute to work or school.
Birrarung Marr shows no clear trend, and is very different from the other plots. There are lower counts in the early morning and much higher counts later at night, with a peak around 7pm. A higher pedestrian count in the evening could be due to the fact that it is a hot spot for different festivals and events, particularly around the evening.
Melbourne Central has high pedestrian counts towards the middle of the day, which would be interpreted as pedestrians going shopping or to lunch. There is a peak around 5pm, which is likely showing commuters.
Use the data inside the hols_2018
data to identify weekdays and weekends, and holidays.
hols_2018 <- tsibble::holiday_aus(year = 2018, state = "VIC")
walk_2018_hols <- walk_2018 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = (weekday %in% c("Sat", "Sun") | as.Date(walk_2018$Date, "%Y-%m-%d") %in% as.Date(hols_2018$date, "%Y-%m-%d")),
true = "No",
false = "Yes"))
Now create a plot to compare the workdays to the non workdays.
ggplot(walk_2018_hols,
aes(x = Time,
y = Count,
group = Date)) +
geom_line(size = 0.3,
alpha = 0.3) +
facet_grid(workday ~ Sensor,
labeller = labeller(Sensor = label_wrap_gen(20),
workday = as_labeller(c(`No` = "non-workday", `Yes` = "workday")))) +
scale_colour_brewer(palette = "Dark2", name = "Sensor") +
theme(legend.position = "none")
Daily trends of pedestrian patterns on working and non-working days in the first quarter of 2018
Write a short paragraph that describe what the plot shows, and helps us answer these questions:
This plot shows the pedestrian counts at different times throughout the day between 1st January 2018 and 30 April 2018, categorized by location and day; workday or non-workday. Each line represents the pedestrian count for a given day, in a specific location, depending on whether it is a workday or not.
Birrarung Marr has much higher pedestrian counts on a non-workday compared to a workday, while Southern Cross Station and Filnders Street Station Underpass experiences much higher pedestrian counts on a workday than non-workday. It is likely this is associated with the high number of festivals and events Birrarung Marr is home to, particularly on the weekends. Similarly, since Southern Cross is a station and the Filders Street Station offer Vline services, it is likely the high pedestrian counts are linked to daily regional commuters, as the peaks are at the start of the work/school day (8am) and the end of the work/school day (4pm), with a smaller peak mid-day which might be people going out during lunch.
Melbourne Central has similar trends for both the working days and nonworking days. Given that Melbourne Central shows a similar trend for working and non-working days, it is likely the station is primarily used for travelling to shopping spots or eateries, as there is no distinct different in 9-5 work associated travel.
To locate those unusual moments, Flinders Street Station data is calendarised on the canvas, using the sugrrants package. We can spot the unusual weekday patterns on public holidays using their color. Using the calendar plot, try to spot another unusual pattern, do a google search to try to explain the change in foot traffic. (Hint: Look at the plot carefully, does a particular day show a different daily pattern? Is there a specific event or festival happening on that day?)
# filter to just look at flinders st station
flinders <- walk_2018_hols %>%
filter(Sensor == "Flinders Street Station Underpass")
flinders_cal <- flinders %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- flinders_cal %>%
ggplot(aes(x = .Time, y = .Count, colour = workday, group = Date)) +
geom_line()
prettify(gg_cal) +
theme(legend.position = "bottom")
Daily trends of pedestrian patterns on working and non-working days in the first quarter of 2018
The 3rd weekend of February 2018; the 17th distinctly stands out as different in terms of pedestrian count. This is likely due to the White Night Festival which attracts large crowds as it is a 12-hour overnight celebration showcasing performers and illuminations.
You’ll need to ensure that you follow the steps we did earlier to filter the data and add the holiday information.
walk_2020 <- walk_2020 %>%
filter(Sensor %in% selected_sensors) %>%
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
mutate(mday = mday(Date_Time),
month = month(Date_Time),
year = year(Date_Time),
yday = yday(Date_Time))
Now add the holiday data
# also the steps for adding in the holiday info
hols_2020 <- tsibble::holiday_aus(year = 2020, state = "VIC")
walk_2020_hols <- walk_2020 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = (weekday %in% c("Sat", "Sun") | as.Date(walk_2020$Date, "%Y-%M-%D") %in% as.Date(hols_2020$date, "%Y-%M-%D")),
true = "No",
false = "Yes"))
melb_walk_hols <- bind_rows(walk_2018_hols, walk_2020_hols)
filter_sensor <- function(data, sensors){
data %>% filter(Sensor %in% sensors)
}
add_day_info <- function(data){
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
data %>%
mutate(mday = mday(Date_Time),
month = month(Date_Time),
year = year(Date_Time),
yday = yday(Date_Time))
}
add_working_day <- function(data){
walk_years <- unique(data$year)
hols <- tsibble::holiday_aus(year = walk_years, state = "VIC")
data %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(
condition = (weekday %in% c("Sat", "Sun") | as.Date(data$Date,"%Y-%m-%d") %in% as.Date(hols$date,"%Y-%m-%d")),
true = "No",
false = "Yes"
))
}
# Step one, combine the walking data
bind_rows(walk_2018, walk_2020) %>%
# Step two, filter the sensors
filter_sensor(sensors = selected_sensors) %>%
# step three, add the info on day of the year, month, etc
add_day_info() %>%
# strep four, add info on working days.
add_working_day()
## # A tibble: 23,136 x 11
## Sensor Date_Time Date Time Count mday month year yday
## <chr> <dttm> <date> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Melbo~ 2018-01-01 00:00:00 2018-01-01 0 2996 1 1 2018 1
## 2 Flind~ 2018-01-01 00:00:00 2018-01-01 0 3443 1 1 2018 1
## 3 Birra~ 2018-01-01 00:00:00 2018-01-01 0 1828 1 1 2018 1
## 4 South~ 2018-01-01 00:00:00 2018-01-01 0 1411 1 1 2018 1
## 5 Melbo~ 2018-01-01 01:00:00 2018-01-01 1 3481 1 1 2018 1
## 6 Flind~ 2018-01-01 01:00:00 2018-01-01 1 3579 1 1 2018 1
## 7 Birra~ 2018-01-01 01:00:00 2018-01-01 1 1143 1 1 2018 1
## 8 South~ 2018-01-01 01:00:00 2018-01-01 1 436 1 1 2018 1
## 9 Melbo~ 2018-01-01 02:00:00 2018-01-01 2 1721 1 1 2018 1
## 10 Flind~ 2018-01-01 02:00:00 2018-01-01 2 3157 1 1 2018 1
## # ... with 23,126 more rows, and 2 more variables: weekday <ord>, workday <chr>
Write a paragraph that describe what you learn from these plots. Can you describe any similarities, and differences amongst the plots, and why they might be similar or different? (You might need to play with the plot output size to clearly see the pattern)
melb_walk_hols_flinders_april <- melb_walk_hols %>%
filter(Sensor == "Flinders Street Station Underpass",
month == 4)
ggplot(melb_walk_hols_flinders_april,
aes(x = Time,
y = Count,
group = mday,
colour = as.factor(year))) +
geom_line() +
facet_wrap(~ year, ncol = 5) +
theme(legend.position = "bottom") +
labs(colour = "Year")
Comparison of pedesrian patterns across Jan-Apr 2018 and 2020
The plot shows a comparison of the pedestrian counts at different times throughout the days in April 2018 and 2020 in Flinders Street Station Underpass. They both have peaks around the morning (7am) and afternoon (3pm), however, the pedestrian count peaks in the 2020 plot are significantly lower. The lower pedestrian counts in 2020 are likely associated with the Coronavirus restrictions, which saw a number of daily commuters work from home instead of work in the CBD, reducing foot traffic in the CBD dramatically.
What do you learn? Which Sensors seem the most similar? Or the most different?
melb_walk_hols_april <- melb_walk_hols %>% filter(month == 4)
ggplot(melb_walk_hols_april,
aes(x = Time,
y = Count,
group = mday,
color = as.factor(year))) +
geom_line() +
facet_grid(Sensor ~ year) +
theme(legend.position = "bottom") +
labs(colour = "Year")
Breakdown of pedestrian patterns into locations and year
Southern Cross Station and Flinders Street Station Underpass have the most similar trends. While the 2020 pedestrian counts are dramatically lower, they still depict similar peaks, with Southern Cross and Flinders Street Station Underpass showing peaks at the beginning of the day (approx 7am) and in the afternoon (around 3-4pm). Melbourne Central has a gradual increase from morning until around 4pm and decreases gradually thereafter in both 2018 and 2020, but with lower numbers in 2020. Birrarung Marr shows a significant difference in pedestrian counts, which is likely due to the lack of events due to the social distancing and associated Coronavirus regulations implemented by the Victorian Government in 2020.
Combining weather data with pedestrian counts
One question we want to answer is: “Does the weather make a difference to the number of people walking out?”
Time of day and day of week are the predominant driving force of the number of pedestrian, depicted in the previous data plots. Apart from these temporal factors, the weather condition could possibly affect how many people are walking in the city. In particular, people are likely to stay indoors, when the day is too hot or too cold, or raining hard.
Daily meteorological data as a separate source, available on National Climatic Data Center, is used and joined to the main pedestrian data table using common dates.
Binary variables are created to serve as the tipping points
We have pulled information on weather stations for the Melbourne area - can you combine it together into one dataset?
prcp
> 5 (if yes, “rain”, if no, “none”)tmax
> 33 (if yes, “hot”, if no, “not”)tmin
< 6 (if yes, “cold”, if no, “not”)# Now create some flag variables
melb_weather_2018 <- read_csv(
here::here("data-raw/melb_ncdc_2018.csv")
) %>%
mutate(
high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none"),
high_temp = if_else(condition = tmax > 33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin < 6,
true = "cold",
false = "not")
)
The weather data is per day, and the pedestrian count data is every hour. One way to explore this data is to collapse the pedestrian count data down to the total daily counts, so we can compare the total number of people each day to the weather for each day. This means each row is the total number of counts at each sensor, for each day.
Depending on how you do this, you will likely need to merge the pedestrian count data back with the weather data. Remember that we want to look at the data for 2018 only
melb_daily_walk_2018 <- melb_walk_hols %>%
filter(year == 2018) %>%
group_by(Sensor, Date) %>%
summarise(Count = sum(Count, na.rm = TRUE)) %>%
ungroup()
melb_daily_walk_weather_2018 <- melb_daily_walk_2018 %>%
left_join(melb_weather_2018, by = c("Date" = "date"))
melb_daily_walk_weather_2018
## # A tibble: 480 x 10
## Sensor Date Count station tmax tmin prcp high_prcp high_temp
## <chr> <date> <int> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Birra~ 2018-01-01 8385 ASN000~ 26.2 14 0 none not
## 2 Birra~ 2018-01-02 9844 ASN000~ 23.6 15.5 0 none not
## 3 Birra~ 2018-01-03 4091 ASN000~ 22.3 11.2 0 none not
## 4 Birra~ 2018-01-04 1386 ASN000~ 25.5 11.5 0 none not
## 5 Birra~ 2018-01-05 1052 ASN000~ 30.5 12.2 0 none not
## 6 Birra~ 2018-01-06 6672 ASN000~ 41.5 16.6 0 none hot
## 7 Birra~ 2018-01-07 1567 ASN000~ 22 15.7 0 none not
## 8 Birra~ 2018-01-08 6201 ASN000~ 23.6 15.9 0 none not
## 9 Birra~ 2018-01-09 8403 ASN000~ 22.8 13.9 0 none not
## 10 Birra~ 2018-01-10 12180 ASN000~ 25.5 12.1 0 none not
## # ... with 470 more rows, and 1 more variable: low_temp <chr>
Create a few plots that look at the spread of the daily totals for each of the sensors, according to the weather flagging variables (high_prcp
, high_temp
, and low_temp
). Write a paragraph that tells us what you learn from these plots, how you think weather might be impacting how people go outside. Make sure to discuss the strengths and limitations of the plots summarised like this, what assumption do they make?
# Plot of count for each sensor against high rain
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = high_prcp)) +
geom_boxplot() +
theme(legend.position = "bottom")
Summary of pedestrian counts over rainy and non-rainy days in the first quarter of 2018
# Plot against high temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = high_temp)) +
geom_boxplot() +
theme(legend.position = "bottom")
Summary of pedestrian counts over hot and not-hot days in the first quarter of 2018
# Plot of low temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = low_temp)) +
geom_boxplot() +
theme(legend.position = "bottom")
Summary of pedestrian counts over cold and not-cold days in the first quarter of 2018
Write a paragraph that tells us what you learn from these plots, how you think weather might be impacting how people go outside. Make sure to discuss the strengths and limitations of the plots summarised like this, what assumption do they make?
The three box plots above show a summary of the pedestrian counts across the four selected locations; Birrarung Marr, Flinders Street Station Underpass, Melbourne Central and Southern Cross Station over different weather conditions. A strength of using a boxplot to represent this information is the ability to get a direct view on the summary statistics such as the mean, quartiles, interquartile range and also any outliers in the huge number of observations, and compare them with each other. However, a limitation of using a boxplot is that deviations of specific observations could be hidden among the summarised values.
The first boxplot shows how precipitation/rain impacts on whether individuals will go outside or not. Birrarung Marr attracted similar crowds regardless of rain on average. However the counts of pedestrians in Birrarung Marr on rainy days seem to be less dispersed, whereas the counts seem to be more dispersed on other days with quite a few outliers. The Flinders Street Station, Southern Cross Station and Melbourne Central had much higher pedestrian counts when there was no rain. The single outlier of high pedestrian counts in the Southern Cross Station is reported on the 30th of January 2018, which may assumed to be a temporary bus/tram disruption attracting more crowds to the train station.
The second boxplot shows how hot weather impacts on whether pedestrians will go outside or not. Birrarung Marr attracted much higher crowds when it was warmer, than “not”. However, the plot shows a few outliers on days which aren’t hot, which maybe assumed to be for events such as the Moomba festival in February and the festival with the Canadian Club Raquet Club in January. The Flinders Street Station Underpass, Melbourne Central and Southern Cross Station had higher pedestrian counts when it was not hot. This may be because pedestrians are reluctant to catch public transport when it is hot as it is more uncomfortable.
The third boxplot shows how cold weather impacts on whether pedestrian will go outside or not. This plot offers limited insights as the weather is very rarely below 6 degrees between 1 January and 30 April, as these are the warmer months of the year. Therefore, insights are unable to be attained.
The missing values on some observations used to calculate aggregated data used in drawing the plots, could be considered as a limitation. A breakdown of the missing observations of pedestrian counts categorized by each sensor is shown below. However, these missing values could be considered to be negligible, as all of them are lower than 1% of the total observations at each location.
melb_walk_hols %>%
group_by(Sensor) %>%
miss_var_summary() %>%
filter(variable == "Count") %>%
select(-variable) %>%
kable(col.names = c("Sensor",
"Number of missing values",
"Percentage of missing values"),
caption = "Missing pedestrian count values on each sensor") %>%
kable_styling("striped")
Sensor | Number of missing values | Percentage of missing values |
---|---|---|
Melbourne Central | 1 | 0.0172891 |
Flinders Street Station Underpass | 1 | 0.0172891 |
Birrarung Marr | 49 | 0.8471646 |
Southern Cross Station | 33 | 0.5705394 |
The visualisations tell us something interesting about the data, but to really understand the data, we need to perform some modelling. To do this, you need to combine the weather data with the pedestrian data. We have provided the weather data for 2018 and 2020, combine with the pedestrian data for 2018 and 2020.
melb_weather_2018 <- read_csv(here::here("data-raw/melb_ncdc_2018.csv"))
melb_weather_2020 <- read_csv(here::here("data-raw/melb_ncdc_2020.csv"))
# task: combine the weather data together into an object, `melb_weather`
melb_weather <- bind_rows(melb_weather_2018,
melb_weather_2020) %>%
# remember to add info about high precipitation, high temperature, + low temps
mutate(high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none"),
high_temp = if_else(condition = tmax > 33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin < 6,
true = "cold",
false = "not")
)
# now combine this weather data with the walking data
melb_walk_weather <- melb_walk_hols %>%
left_join(melb_weather, by = c("Date" = "date"))
We have been able to start answering the question, “Does the weather make a difference to the number of people walking out?” by looking at a few exploratory plots. However, we want to get a bit more definitive answer by performing some statistical modelling.
We are going to process the data somewhat so we can fit a linear model to the data. First, let’s take a set relevant variables to be factors. This ensures that the linear model interprets them appropriately.
We also add one to count and then take the natural log of it. The reasons for this are a bit complex, but essentially a linear model is not the most ideal model to fit for this data, and we can help it be more ideal by taking the log of the counts, which helps stabilise the residuals (predictions - observed) when we fit the model.
melb_walk_weather_prep_lm <- melb_walk_weather %>%
mutate_at(.vars = vars(Sensor,
Time,
month,
year,
workday,
high_prcp,
high_temp,
low_temp),
as_factor) %>%
mutate(log_count = log1p(Count))
Now we fit a linear model, predicting logCount using Time, Month, weekday and the weather flag variables (high_prcp
, high_temp
, and low_temp
)
walk_fit_lm <- lm (
formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
Provide some summary statistics on how well this model fits the data? What do you see? What statistics tell us about our model fit?
glance(walk_fit_lm)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.524 0.524 1.21 669. 0 36 -34259. 68592. 68887.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
The R squared value is a measure of the strength of the linear model fit to the data, and it is 0.524989. This means that approximately 52.45% of the variability in pedestrian counts can be explained by time, month, weekday, rain, high temperature and low temperature.
We have had one look at the model fit statistics, but let’s now look at the fitted and observed (log_count
) values, for each sensor:
peds_aug_lm <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
ggplot(peds_aug_lm,
aes(x = log_count,
y = .fitted)) +
geom_point() +
facet_wrap(~Sensor)
Compaarison of observed and fitted values for the pedestrian counts based on the linear model
There is actually a lot of variation. Looking at this, you might assume that the model does a bad job of fitting the residuals. However, we must remember that there is an inherent time structure to the data. A better way to explore this is to look directly at the temporal components. We can do this directly with a calendar plot.
Before we can get the data into the right format for analysis, we need to pivot the data into a longer format, so that we have columns of Date, Time, Count, Model, and log_count.
flinders_lm <- peds_aug_lm %>%
# filter the data to only look at flinder st
filter(Sensor == "Flinders Street Station Underpass") %>%
# Select the Date, Time, Count, .fitted, and log_count
select(c("Date", "Time", "Count", ".fitted", "log_count")) %>%
# Now pivot the log count and fitted columns into a column called "model"
# data so we have columns of Date, Time, Count, Model,
# and log_count.
pivot_longer(c("log_count", ".fitted"), names_to = "model", values_to = "log_count") %>%
# Now we're going to undo the initial data transformation
mutate(Count = expm1(log_count))
flinders_lm_cal_2018 <- flinders_lm %>%
# Let's just look at 2018 to make it a bit easier
filter(year(Date) == "2018") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- ggplot(flinders_lm_cal_2018) +
# Part of the interface to overlaying multiple lines in a calendar plot
# is drawing two separate `geom_line`s -
# See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
# for details
geom_line(data = filter(flinders_lm_cal_2018, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(flinders_lm_cal_2018, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal) + theme(legend.position = "bottom")
Observed and fitted values for Flinders Street Station Underpass in 2018
flinders_lm_cal_2020 <- flinders_lm %>%
# Let's just look at 2020 to make it a bit easier
filter(year(Date) == "2020") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- ggplot(flinders_lm_cal_2020) +
# Part of the interface to overlaying multiple lines in a calendar plot
# is drawing two separate `geom_line`s -
# See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
# for details
geom_line(data = filter(flinders_lm_cal_2020, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(flinders_lm_cal_2020, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal) + theme(legend.position = "bottom")
Observed and fitted values for Flinders Street Station Underpass in 2020
Write a paragraph answering these questions:
On comparing the plots for the first quarters of 2018 and 2020 in Flinders Street Station Underpass, we can see that in the 2020 plot the observed values are much higher than the fitted values for January, February and the first half of March. However, for the second half of March and April the fitted values and the observed values are quite similar. The 2018 plot has observed values which are higher than the fitted values for January, February, March and April, and therefore is very different to the 2020 plots for April and the second half of March. On the plot for 2018, the observed values are higher than the fitted values throughout the entire first quarter of the year.
A suggestion to improve the model would be to consider the workday, year and Sensor as variable within the model. If it is a workday, pedestrian counts at Birrarung Marr and Southern Cross Station increase. Furthermore, year has shown a reduction in pedestrian counts for the year 2020, due to the Coronavirus restrictions. The sensor has also been shown to impact the pedestrian counts, with some sensors being more widely used depending on the circumstance, day, time, etc. Therefore, there seems to be significant difference in the pedestrian counts respective to the workday, year and sensor variables. Hence, forming a relationship among these variables and the pedestrian count would be a suggestion to improve the model.
What sort of improvements do you think you could make to the model? Fit two more models, Try adding more variables to the linear model.
walk_fit_lm_sensor <- lm(
formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp + Sensor,
data = melb_walk_weather_prep_lm
)
walk_fit_lm_year <- lm(
formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp + year,
data = melb_walk_weather_prep_lm
)
Why did you add those variables?
Year has been shown to impact pedestrian counts just recently due to the 2020 Coronavirus restrictions, therefore, the model should be adjusted to reflect this. Similarly, the pedestrian counts vary at each sensor at different locations, therefore it is appropriate to include the sensor in the model.
bind_rows(
first = glance(walk_fit_lm),
sensor = glance(walk_fit_lm_sensor),
year = glance(walk_fit_lm_year),
.id = "type"
)
## # A tibble: 3 x 12
## type r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 first 0.524 0.524 1.21 669. 0 36 -34259. 68592.
## 2 sens~ 0.690 0.689 0.981 1242. 0 39 -29719. 59518.
## 3 year 0.554 0.554 1.18 733. 0 37 -33572. 67219.
## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
Which model does the best? Why do you think that is? What statistics are you basing this decision off? The model including the Sensor as a variable does the best, and this is due to the fact that it has the highest r squared value (68.98%). This means that 68.98% of the variability in pedestrian counts can be explained by time, month, weekday, rain, high temperature, low temperature and sensor. This is likely due to the fact that each sensor has vastly different pedestrian counts, and so in order to predict the pedestrian count more accurately, consideration must be given to the relevant sensor.
(Suggestion - Perhaps write this as a function to speed up comparison)
peds_aug_lm_sensor <- augment(walk_fit_lm_sensor, data = melb_walk_weather_prep_lm)
pivot_sensor <- function(lm_fit, sensor = "Flinders Street Station Underpass"){
lm_fit %>%
filter(Sensor == sensor) %>%
select(c("Date", "Time", "Count", ".fitted", "log_count")) %>%
pivot_longer(c("log_count", ".fitted"), names_to = "model", values_to = "log_count") %>%
mutate(Count = expm1(log_count))
}
calendar_fit_obs <- function(lm_fit_aug){
data_cal <- lm_fit_aug %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <-
ggplot(data_cal) +
geom_line(data = filter(data_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(data_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal) + theme(legend.position = "bottom")
}
pivot_sensor(peds_aug_lm_sensor, "Flinders Street Station Underpass") %>%
filter(year(Date) == "2020") %>%
calendar_fit_obs()
Observed and fitted values for Flinders Street Station Underpass in 2020 using the improved model
What do you see? How does it compare to the previous model?
!> Answer: The new model seems to be very accurate over the months of January and February, with almost similar observed values and fitted values. However the accuracy seems to be declining over the days in the month of March and continues to do so in April, with the fitted values being higher than the observed values.
If we compare the first model (considering only Time, month, weekday, high_prcp, high_temp and low_temp as explanatory variables in predicting the pedestrian counts in Flinders St Station Underpass in the first quarter of 2020) with the new model (also using Sensor along with the rest of the explanatory variables on the first model), we can see that the new model is much more accurate through January to mid-March. However, the first model seems to be more accurate from mid-March through April.
An interesting observation is that social distancing restrictions due to COVID-19 were imposed mid-March which coincides with the reduction in accuracy of the new model, and also the improvement of accuracy of the first model. The new model doesn’t seem to have captured the sudden decrease in pedestrian counts due to restrictions imposed due to COVID-19.
Compare the fitted against the residuals, perhaps write a function to help you do this in a more readable way.
walk_fit_lm_aug <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
walk_fit_lm_sensor_aug <- augment(walk_fit_lm_sensor, melb_walk_weather_prep_lm)
walk_fit_lm_year_aug <- augment(walk_fit_lm_year, melb_walk_weather_prep_lm)
plot_fit_resid <- function(data){
ggplot(data,
aes(x = .fitted,
y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, colour = "white", size=1) +
facet_wrap(~Sensor)
}
plot_fit_resid(walk_fit_lm_aug) +
ggtitle("The dispersion of Residuls when time, month, weekday, precipitation, high temperature and low temperature were used as explanatory variables for pedestrian counts")
Dispersion of residual values of the three models over the fitted values
plot_fit_resid(walk_fit_lm_sensor_aug) +
ggtitle("The dispersion of Residuls when sensor was used in addition to time, month, weekday, precipitation, high temperature and low temperature were used as explanatory variables for pedestrian counts")
Dispersion of residual values of the three models over the fitted values
plot_fit_resid(walk_fit_lm_year_aug) +
ggtitle("The dispersion of Residuls when year was used in addition to time, month, weekday, precipitation, high temperature and low temperature were used as explanatory variables for pedestrian counts")
Dispersion of residual values of the three models over the fitted values
We’ve looked at all these models, now pick your best one, and compare the predicted values against the actual observed values. What do you see? Is the model good? Is it bad? Do you have any thoughts on why it is good or bad?
!> Answer: On observing the dispersion of residuals across the three models, all three of them look quite similar, but with minor differences.
The first model using time, month, weekday, precipitation, high temperature and low temperature as explanatory variables seems to have more positive residuals; fitted values higher than the observed values for pedestrian counts in Melbourne Central and Flinders Street Station Underpass and more negative residuals; fitted values lower than the observed values in Birrarung Marr and Southern Cross Station. A majority of the residuals lie within a range of 4 and (-4). This model therefore, doesn’t seem the best fit for either of the locations.
The second model using the sensor in addition to the explanatory variables on the first model has its residuals dispersed equally around the best fit line with a majority of the residuals ranging from 2.5 to (-2.5).
The third model using the year in addition to the explanatory variables on the first model has more positive residuals; fitted values higher than the observed values in the Melbourne Central and Flinders Street Underpass and more negative residuals; fitted values lower than the observed values for pedestrian counts in the Birrarung Marr and Southern Cross Stations. However, the majority of the residuals are within a range of 3 to (-3).
Therefore, it is obvious that the second model using time, month, weekday, precipitation, high temperature, low temperature and sensor as explanatory variables is the best fit out of the three models in predicting pedestrian counts. It can also be concluded that there is a strong dependency among the location (the sensor) and the pedestrian counts.
A further suggestion would be to include both the year and sensor as explanatory variables in determining the pedestrian counts.
walk_fit_lm_sensor_year <- lm (
formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp + Sensor + year,
data = melb_walk_weather_prep_lm
)
glance(walk_fit_lm_sensor_year)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.720 0.719 0.932 1397. 0 40 -28639. 57360. 57686.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
walk_fit_lm_sensor_year_aug <- augment(walk_fit_lm_sensor_year, data = melb_walk_weather_prep_lm)
plot_fit_resid(walk_fit_lm_sensor_year_aug) +
ggtitle("The dispersion of Residuls when time, month, weekday, precipitation, high temperature, low temperature, sensor and year are used as explanatory variables for pedestrian counts")
On using year and sensor in addition to the initial explanatory variables in determining the pedestrian counts, an r squared value of 71.98015 was received. Which means that 71.98% of the variability n pedestrian counts can be explained by time, month, weekday, rain, high temperature, low temperature, sensor and year. This is likely due to the pedestrian counts being stringly related to the location of the sensor and also the year; as a result of the social distancing restrictions imposed in 2020.
Make sure to reference all of the R packages that you used here, along with any links or blog posts that you used to help you answer these questions
David Robinson and Alex Hayes (2020). broom: Convert Statistical Analysis Objects into Tidy Tibbles. R package version 0.5.6. https://CRAN.R-project.org/package=broom
Diethelm Wuertz, Tobias Setz, Yohan Chalabi, Martin Maechler and Joe W. Byers (2018). timeDate: Rmetrics - Chronological and Calendar Objects. R package version 3043.102. https://CRAN.R-project.org/package=timeDate
D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
Earo Wang (2020). rwalkr: API to Melbourne Pedestrian Data. R package version 0.5.2. https://CRAN.R-project.org/package=rwalkr
Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL http://www.jstatsoft.org/v40/i03/.
Hadley Wickham, Jim Hester and Romain Francois (2018). readr: Read Rectangular Text Data. R package version 1.3.1. https://CRAN.R-project.org/package=readr
Hao Zhu (2019). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0. https://CRAN.R-project.org/package=kableExtra
Kirill Müller (2017). here: A Simpler Way to Find Your Files. R package version 0.1. https://CRAN.R-project.org/package=here
Wang, E, D Cook, and RJ Hyndman (2020). Calendar-based graphics for visualizing people’s daily schedules. Journal of Computational and Graphical Statistics. doi:10.1080/10618600.2020.1715226.
Wang, E, D Cook, and RJ Hyndman (2020). A new tidy data structure to support exploration and modeling of temporal data. Journal of Computational and Graphical Statistics. doi:10.1080/10618600.2019.1695624.
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Yihui Xie (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.28.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
Caption all the figures and spell check.
This code below here is what was used to retrieve the data in the data-raw
folder.
# melb_bbox <- c(min(ped_loc$longitude) - .001,
# min(ped_loc$latitude) - 0.001,
# max(ped_loc$longitude) + .001,
# max(ped_loc$latitude) + 0.001)
#
# melb_map <- get_map(location = melb_bbox, source = "osm")
# write_rds(melb_map,
# path = here::here("2020/assignment-2/data-raw/melb-map.rds"),
# compress = "xz")
# code to download the stations around the airport and the weather times
# this is purely here so you can see how we downloaded this data
# it is not needed for you to complete the assignment, so it is commented out
# melb_stns <- read_table(
# file = "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
# col_names = c("ID",
# "lat",
# "lon",
# "elev",
# "state",
# "name",
# "v1",
# "v2",
# "v3"),
# skip = 353,
# n_max = 17081
# ) %>%
# filter(state == "MELBOURNE AIRPORT")
# #
# get_ncdc <- function(year){
# vroom::vroom(
# glue::glue(
# "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/{year}.csv.gz"
# ),
# col_names = FALSE,
# col_select = 1:4
# )
# }
#
# clean_ncdc <- function(x){
# x %>%
# filter(X1 == melb_stns$ID, X3 %in% c("PRCP", "TMAX", "TMIN")) %>%
# rename_all(~ c("station", "date", "variable", "value")) %>%
# mutate(date = ymd(date), value = value / 10) %>%
# pivot_wider(names_from = variable, values_from = value) %>%
# rename_all(tolower)
# }
# ncdc_2018 <- get_ncdc(2018)
# melb_ncdc_2018 <- clean_ncdc(ncdc_2018)
# write_csv(melb_ncdc_2018,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2018.csv"))
#
# ncdc_2020 <- get_ncdc(2020)
# beepr::beep(sound = 4)
# melb_ncdc_2020 <- clean_ncdc(ncdc_2020)
# beepr::beep(sound = 4)
# write_csv(melb_ncdc_2020,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2020.csv"))