Instructions to Students

This assignment is designed to simulate a scenario where you are given a dataset, and taking over someone’s existing work, and continuing with it to draw some further insights. This aspect of it is similar to Assignment 1, but it will provide less scaffolding, and ask you to draw more insights, as well as do more communication.

Your previous submission of crime data was well received!

You’ve now been given a different next task to work on. Your colleague at your consulting firm, Amelia (in the text treatment below) has written some helpful hints throughout the assignment to help guide you.

Questions that are worth marks are indicated with “Q” at the start and end of the question, as well as the number of marks in parenthesis. For example

## Q1A some text (0.5 marks)

Is question one, part A, worth 0.5 marks

Marking + Grades

This assignment will be worth 10% of your total grade, and is marked out of 58 marks total.

  • 9 Marks for grammar and clarity. You must write in complete sentences and do a spell check.

  • 9 Marks for presentation of the data visualisations

  • 40 marks for the questions

    • Part 1: 7 Marks
    • Park 2: 17 Marks
    • Park 3: 16 Marks
  • Your marks will be weighted according to peer evaluation.

A Note on skills

As of week 6, you have seen most of the code for parts 1 - 2 that needs to be used here, and Week 7 will give you the skills to complete part 3. I do not expect you to know immediately what the code below does - this is a challenge for you! We will be covering skills on modelling in the next weeks, but this assignment is designed to simulate a real life work situation - this means that there are some things where you need to “learn on the job”. But the vast majority of the assignment will cover things that you will have seen in class, or the readings.

Remember, you can look up the help file for functions by typing ?function_name. For example, ?mean. Feel free to google questions you have about how to do other kinds of plots, and post on the ED if you have any questions about the assignment.

How to complete this assignment.

To complete the assignment you will need to fill in the blanks for function names, arguments, or other names. These sections are marked with *** or ___. At a minimum, your assignment should be able to be “knitted” using the knit button for your Rmarkdown document.

If you want to look at what the assignment looks like in progress, but you do not have valid R code in all the R code chunks, remember that you can set the chunk options to eval = FALSE like so:

```{r this-chunk-will-not-run, eval = FALSE}`r''`
ggplot()
```

If you do this, please remember to ensure that you remove this chunk option or set it to eval = TRUE when you submit the assignment, to ensure all your R code runs.

You will be completing this assignment in your assigned groups. A reminder regarding our recommendations for completing group assignments:

  • Each member of the group completes the entire assignment, as best they can.
  • Group members compare answers and combine it into one document for the final submission.

Your assignments will be peer reviewed, and results checked for reproducibility. This means:

  • 25% of the assignment grade will come from peer evaluation.
  • Peer evaluation is an important learning tool.

Each student will be randomly assigned another team’s submission to provide feedback on three things:

  1. Could you reproduce the analysis?
  2. Did you learn something new from the other team’s approach?
  3. What would you suggest to improve their work?

Due Date

This assignment is due in by 6pm on Wednesday 20th May. You will submit the assignment via ED. Please change the file name to include your teams name. For example, if you are team dplyr, your assignment file name could read: “assignment-2-2020-s1-team-dplyr.Rmd”

Treatment

You work as a data scientist in the well named consulting company, “Consulting for You”.

On your second day at the company, you impressed the team with your work on crime data. Your boss says to you:

Amelia has managed to find yet another treasure trove of data - get this: pedestrian count data in inner city Melbourne! Amelia is still in New Zealand, and now won’t be back now for a while. They discovered this dataset the afternoon before they left on holiday, and got started on doing some data analysis.

We’ve got a meeting coming up soon where we need to discuss some new directions for the company, and we want you to tell us about this dataset and what we can do with it.

Most Importantly, can you get this to me by Wednesday 20th May, COB (COB = Close of Business at 6pm).

I’ve given this dataset to some of the other new hire data scientists as well, you’ll all be working as a team on this dataset. I’d like you to all try and work on the questions separately, and then combine your answers together to provide the best results.

From here, you are handed a USB stick. You load this into your computer, and you see a folder called “melbourne-walk”. In it is a folder called “data-raw”, and an Rmarkdown file. It contains the start of a data analysis. Your job is to explore the data and answer the questions in the document.

Note that the text that is written was originally written by Amelia, and you need to make sure that their name is kept up top, and to pay attention to what they have to say in the document!

Overview

The City of Melbourne has sensors set up in strategic locations across the inner city to keep hourly tallies of pedestrians. The data is updated on a monthly basis and available for download from Melbourne Open Data Portal. The rwalkr package provides an API in R to easily access sensor counts and geographic locations.

There are three parts to this work:

  1. Data wrangling and data visualisation of the pedestrian data
  2. Joining data together weather data
  3. Performing preliminary modelling

Part 1: Data wrangling and data visualisation of the pedestrian data

Amelia: I’ve downloaded a map chunk of Melbourne. Can you take the map I made, and plot the location of the sensors on top of it? We want to be able to see all of the sensors, but we also want to create different shapes for the following sensors:

First we download the data on the pedestrian sensor locations around Melbourne.

And now we draw a plot on a map tile of the pedestrian sensor locations

Q1A Tell me what this map shows us? (0.5 Mark)

!> Answer: This map displays the locations of various pedestrian sensors around Melbourne, with longitude plotted on the x-axis and latitude on the y-axis. From this, we can visualise the distribution of the sensors around the city, particulary the four of interest for this analysis - Birrarung Marr, Flinders Street Station Underpass and Southern Cross Station.

Q1B Create a markdown table of the number of sensors installed each year (0.5 Mark)

ped_loc %>% 
  # calculate the year from the date information
  mutate(year = year(installation_date)) %>%
  # count up the number of sensors
  count(year) %>%
  # then use `kable()` to create a markdown table
  kable(caption = "Number of sensors installed each year", col.names = c("Year", "Count")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Number of sensors installed each year
Year Count
2009 13
2013 12
2014 2
2015 7
2016 1
2017 9
2018 5
2019 6
2020 4

Additionally, how many sensors were added in 2016, and in 2017?

!> Answer: From this table it appears that 1 sensor was added in 2016 and 9 were added in 2017.

Q1C Filter the data down to just look at the selected four sensors (1 mark)

We would like you to focus on the foot traffic at 4 sensors:

  • “Southern Cross Station”
  • “Melbourne Central”
  • “Flinders Street Station Underpass”
  • “Birrarung Marr”

Your task is to:

  • Extract the data for the year 2018 from Jan 1 to April 30 using rwalkr (you might want to cache this so it doesn’t run every time you knit)

  • Filter the data down to include only the four sensors above, note that you could do this directly with rwalkr

  • add variables that contain the day of the month, month, year, and day in the year.

walk_2018 <- walk_2018 %>% 
  # Filter the data down to include only the four sensors above
  filter(Sensor %in% c("Southern Cross Station",
                       "Melbourne Central",
                       "Flinders Street Station Underpass",
                       "Birrarung Marr")
         ) %>%
# now add four columns, containing month day, month, year, and day of the year
  # using functions from lubridate.
  mutate(mday = mday(Date),
         month = month(Date),
         year = year(Date),
         yday = yday(Date)
         )

Now we can plot the pedestrian count information for January - April in 2018

ggplot(walk_2018,
       aes(x = Date_Time, 
           y = Count)) +
  geom_line(size = 0.3) +
  facet_grid(Sensor ~., 
             # this code presents the facets ina  nice way
             labeller = labeller(Sensor = label_wrap_gen(20))) +
  # this code mades the x axis a bit nicer to read
  scale_x_datetime(date_labels = "%d %b %Y", 
                   date_minor_breaks = "1 month") +
  labs(x = "Date Time") + 
  ggtitle("Pedestrian count against time for the selected sensors")

We can see that there are quite different patterns in each of the sensors. Let’s explore this further.

Q1D How many types of activities might this capture at the four selected places? (0.5 marks)

!> Answer: The sensors capture two different types of activities at each location. People entering from one direction and also people exiting from another.

Q1E Create a plot of the counts of each sensor over the day (2 Marks)

We’re primarily interested in exploiting pedestrian patterns at various time resolutions and across different locations. In light of people’s daily schedules, let’s plot the counts against time of day for each sensor.

ggplot(walk_2018,
       aes(x = Time,
           y = Count,
           group = Date,
           colour = Sensor)) +
  geom_line() +
  facet_wrap(~ Sensor,
             labeller = labeller(Sensor = label_wrap_gen(20))) +
  scale_colour_brewer(palette = "Dark2",
                      name = "Sensor") +
  theme(legend.position = "none") +
  ggtitle("Counts for each sensor dependent on time of day")

Write a short paragraph that describe what the plot shows:

  • What is plotted on the graph? What does each line represent?
  • Is the data in these sensors similar, or different?
  • Does each panel show the same trend within it, or is there variation?
  • What do you learn?

!> Answer: This plot depicts the number of pedestrians counted by the various sensors at different times of the day, with Time on the x-axis and count on the y-axis. Each line represents a different day recorded over the time period. The data for Flinders Street Station Underpass and Southern Cross Station are quite similar with significant peaks at around 8am and 5pm - likely due to increased traffic around the railway stations during the start and end of the work day. This pattern remains quite consistent across the whole period except that Southern Cross Station has quite a few days with almost no traffic. These days possibly represent weekends when less pedestrians are accessing the station. Meanwhile Birrarung Marr experiences much more variation in the count variable with no clear trend emerging. Whearas Melbourne Central is the most consistent of the four sensors, experiencing a consistent level of traffic from 12pm-9pm on most days recorded. This is perhaps due to most people shopping at the centre during these times.

Q1F Exploring non work days (2 marks)

Use the data inside the hols_2018 data to identify weekdays and weekends, and holidays.

hols_2018 <- tsibble::holiday_aus(year = 2018, state = "VIC")

walk_2018_hols <- walk_2018 %>% 
  mutate(weekday = wday(Date, label = TRUE, week_start = 1),
         workday = if_else(
           condition = Date %in% hols_2018$date | weekday %in% c("Sat", "Sun"),
           true = "no",
           false = "yes"
           ))

Now create a plot to compare the workdays to the non workdays.

ggplot(walk_2018_hols,
       aes(x = Time,
           y = Count,
           group = Date,
           colour = workday)) +
  geom_line(size = 0.3,
            alpha = 0.3) +
  facet_grid(weekday ~ Sensor,
             labeller = labeller(Sensor = label_wrap_gen(20))) +
  scale_colour_brewer(palette = "Dark2", name = "Sensor") +
  theme(legend.position = "none") +
  ggtitle("Counts for working days against non-working days")

Write a short paragraph that describe what the plot shows, and helps us answer these questions:

  • What is plotted on the graph? What does each line represent?
  • How are the data in these sensors similar or different?
  • Does each panel show the same trend within it, or is there variation?
  • What do you learn?

!> Answer: This plot depicts the number of pedestrians counted by the various sensors at different times of the day, with Time on the x-axis and count on the y-axis. Each line represents a different day over the period with the orange lines depicting work days and the green line depicting weekends or public holidays. The Flinders Street Station Underpass and Southern Cross Station Underpass both experience higher counts on working days confirming the assumption made above that these sensors are reliant on workday traffic, with Southern Cross Station in particuar experiencing almost no traffic on Saturday and Sunday. Meanwhile the consistent trend for Melbourne Central holds true for both working and non-working days. Whearas with Birrarung Marr, it appears that the pedestrian traffic is much lower on working days and inversely higher on weekends and public holidays in particular. This may be due to various events being held in the area on these days.

Q1G Calendar plot (0.5 mark)

To locate those unusual moments, Flinders Street Station data is calendarised on the canvas, using the sugrrants package. We can spot the unusual weekday patterns on public holidays using their color. Using the calendar plot, try to spot another unusual pattern, do a google search to try to explain the change in foot traffic. (Hint: Look at the plot carefully, does a particular day show a different daily pattern? Is there a specific event or festival happening on that day?)

# filter to just look at flinders st station
flinders <- walk_2018_hols %>%
  filter(Sensor == "Flinders Street Station Underpass")

flinders_cal <- flinders %>% 
  frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- flinders_cal %>% 
  ggplot(aes(x = .Time, y = .Count, colour = workday, group = Date)) +
  geom_line()
prettify(gg_cal) +
  theme(legend.position = "bottom") +
ggtitle("Calender plot of pedestrian counts for the Flinders Street Station Underpass")

!> Answer: The weekend of the third week of February exhibits a distinct pattern in foot traffic, rising until the end of the day at 12am before declining the following day over the course of a few hours. This is likely due to the White Night event held on February 17th of that year which took place from 7pm on Saturday until 7am on Sunday and was described as a “twelve hour overnight celebration of creativity held in Melbourne’s CBD”. As such there were a higher number of pedestrians moving around the city than usual, contributing to the higher count recorded. Additionally, this weekend was also during the Chinese New Year which may have also contributed to the higher number of people than average.

Part Two: Combining data sources

Q2A Extract the pedestrian count data for 2020 from the Jan 1 to April 30, can you add that to the data? (2 marks)

You’ll need to ensure that you follow the steps we did earlier to filter the data and add the holiday information.

walk_2020 <- walk_2020 %>% 
filter(Sensor %in% c("Southern Cross Station",
                       "Melbourne Central",
                       "Flinders Street Station Underpass",
                       "Birrarung Marr")
         ) %>%
# now add four columns, containing month day, month, year, and day of the year
  # using functions from lubridate.
  mutate(mday = mday(Date),
         month = month(Date),
         year = year(Date),
         yday = yday(Date)
         )

Now add the holiday data

# also the steps for adding in the holiday info
hols_2020 <- tsibble::holiday_aus(year = 2020, state = "VIC")

walk_2020_hols <- walk_2020 %>% 
    mutate(weekday = wday(Date, label = TRUE, week_start = 1),
           workday = if_else(
              condition = Date %in% hols_2018$date | weekday %in% c("Sat", "Sun"),
              true = "no",
              false = "yes")
           )

melb_walk_hols <- bind_rows(walk_2018_hols, walk_2020_hols)

Q2B There is some repetition in the code above from previous, describe two (or more) ways to limit that repetition, and demonstrate the use of this as a function (1 Mark)

The code from the previous section can be repeated in a more efficient manner by using functions. Three funtions can be created to achieve this and are displayed below. The first funtion is designed to filter the data so that only the four sensors of interest are included. The second adds the required columns for the day of the month, the month number, the year and the day of the year. The third function utilises the holiay dates from the tsibble package to distinguish between working and non-working days.

filter_sensor <- function(data, sensors){
  data %>% filter(Sensor %in% c("Southern Cross Station",
                       "Melbourne Central",
                       "Flinders Street Station Underpass",
                       "Birrarung Marr"))
}

add_day_info <- function(data){
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
  data %>% mutate(mday = mday(Date),
         month = month(Date),
         year = year(Date),
         yday = yday(Date))
}

add_working_day <- function(data){
  
  walk_years <- unique(data$year)
  
  hols <- tsibble::holiday_aus(year = walk_years, state = "VIC")
  
  data %>% 
    mutate(weekday = wday(Date, label = TRUE, week_start = 1),
         workday = if_else(
           condition = Date %in% hols_2018$date | weekday %in% c("Sat", "Sun"),
           true = "no",
           false = "yes"
           ))
}
# Step one, combine the walking data
bind_rows(walk_2018, walk_2020) %>%
  # Step two, filter the sensors
  filter_sensor(sensors = Sensor) %>%
  # step three, add the info on day of the year, month, etc
  add_day_info() %>%
  # strep four, add info on working days.
  add_working_day() %>%
  head(10) %>%
 kable(caption = "Pedestrian count data for 2018 & 2020 from Melbourne Central, Flinders Street Station Underpass, Southern Cross Station & Birrarung Marr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Pedestrian count data for 2018 & 2020 from Melbourne Central, Flinders Street Station Underpass, Southern Cross Station & Birrarung Marr
Sensor Date_Time Date Time Count mday month year yday weekday workday
Melbourne Central 2018-01-01 00:00:00 2018-01-01 0 2996 1 1 2018 1 Mon no
Flinders Street Station Underpass 2018-01-01 00:00:00 2018-01-01 0 3443 1 1 2018 1 Mon no
Birrarung Marr 2018-01-01 00:00:00 2018-01-01 0 1828 1 1 2018 1 Mon no
Southern Cross Station 2018-01-01 00:00:00 2018-01-01 0 1411 1 1 2018 1 Mon no
Melbourne Central 2018-01-01 01:00:00 2018-01-01 1 3481 1 1 2018 1 Mon no
Flinders Street Station Underpass 2018-01-01 01:00:00 2018-01-01 1 3579 1 1 2018 1 Mon no
Birrarung Marr 2018-01-01 01:00:00 2018-01-01 1 1143 1 1 2018 1 Mon no
Southern Cross Station 2018-01-01 01:00:00 2018-01-01 1 436 1 1 2018 1 Mon no
Melbourne Central 2018-01-01 02:00:00 2018-01-01 2 1721 1 1 2018 1 Mon no
Flinders Street Station Underpass 2018-01-01 02:00:00 2018-01-01 2 3157 1 1 2018 1 Mon no

Q2C For Flinders st, in the month of April, can you compare 2018 to 2020? (2 Marks)

Write a paragraph that describe what you learn from these plots. Can you describe any similarities, and differences amongst the plots, and why they might be similar or different? (You might need to play with the plot output size to clearly see the pattern)

melb_walk_hols_flinders_april <- melb_walk_hols %>%
  filter(Sensor == "Flinders Street Station Underpass", month == 4)

ggplot(melb_walk_hols_flinders_april,
       aes(x = Time,
           y = Count,
           group = Date,
           colour = as.factor(year))) +
   facet_wrap(~ mday, ncol = 5) +
  geom_line() +
  theme(legend.position = "bottom") +
  labs(colour = "Year") +
  ggtitle("Comparison of pedestrian counts for April 2018 and April 2019 for the Flinders Street Station Underpass")

!> Answer: This plot depicts the number of pedestrians counted by the Flinders Street Station Underpass Sensor over the month of April, with Time on the x-axis and the count on the y-axis. Each line represents an individual day of the month, with the orange lines being for 2018 and blue ones for 2020. Quite clearly, the counts for 2018 are consistently higher than those for 2020 across everday for the entire month, with value for 2020 being close to 0. This is likely due to the recent COVID19 outbreak and subsequent lockdown laws in the city preventing travel for non-essential reasons, and causing the closure of many workplaces. As suchh less pedestrians are passing by the sensors. 2018 meanwhile saw no such restrictions in place. Counts for the 2nd and the 25th are lower for 2018 than average due to the public holidays on Easter Monday and Anzac respectively. Whearas 2020 saw so significant drops on these days compared to comparable days of the week.

Q2D Produce a similar plot (to 2C) that will also allow us to contrast the patterns across the four sensors. (2 marks)

What do you learn? Which Sensors seem the most similar? Or the most different?

melb_walk_hols_april <- melb_walk_hols %>%  filter(month == 4)



ggplot(melb_walk_hols_april,
       aes(x = Time,
           y = Count,
           group = Date,
           colour = as.factor(year)
           )) +
  geom_line() +
  facet_grid(mday ~ Sensor) +
  scale_y_discrete() +
  theme(legend.position = "bottom") +
  labs(colour = "Year") +
  ggtitle("Comparison of pedestrian counts for April 2018 and April 2019 for all 4 sensors")

!> Answer: This plot depicts the number of pedestrians counted by the Flinders Street Station Underpass Sensor over the month of April, with Time on the x-axis and the count on the y-axis. Each line represents an individual day of the month. The orange lines depcit 2018 and blue lines depict 2020. It appears that Birrarung Marr is the most similar across the two years, both experiencing periods of low pedestrian traffic, however 2020 is missing the various peaks seen in 2018. Meanwhile Flinders Street Station Underpass has the greatest difference between the two years, having a consistent low level of traffic throughout the day in 2020 rather than the two peaks and consistently higher level seen in 2018. Likewise both Melbourne Central and Southern Cross Station are down in 2020 compared to their 2018 levels.

Combining weather data with pedestrian counts

One question we want to answer is: “Does the weather make a difference to the number of people walking out?”

Time of day and day of week are the predominant driving force of the number of pedestrian, depicted in the previous data plots. Apart from these temporal factors, the weather condition could possibly affect how many people are walking in the city. In particular, people are likely to stay indoors, when the day is too hot or too cold, or raining hard.

Daily meteorological data as a separate source, available on National Climatic Data Center, is used and joined to the main pedestrian data table using common dates.

Binary variables are created to serve as the tipping points

We have pulled information on weather stations for the Melbourne area - can you combine it together into one dataset?

Q2E Read in and add flagging variables to the weather data (2 marks)

  • high_prcp if prcp > 5 (if yes, “rain”, if no, “none”)
  • high_temp if tmax > 33 (if yes, “hot”, if no, “not”)
  • low_temp if tmin < 6 (if yes, “cold”, if no, “not”)
# Now create some flag variables 
melb_weather_2018 <- read_csv(
  here::here("data-raw/melb_ncdc_2018.csv")
  ) %>% 
  mutate(
  high_prcp = if_else(condition = prcp > 5,
                          true = "rain",
                          false = "none"),
    high_temp = if_else(condition = tmax > 33, 
                        true = "hot",
                        false = "not"),
    low_temp = if_else(condition = tmin < 6,
                        true = "cold",
                        false = "not")
  )

Q2F summarise the pedestrian count data and join it to the weather data (2 marks)

The weather data is per day, and the pedestrian count data is every hour. One way to explore this data is to collapse the pedestrian count data down to the total daily counts, so we can compare the total number of people each day to the weather for each day. This means each row is the total number of counts at each sensor, for each day.

Depending on how you do this, you will likely need to merge the pedestrian count data back with the weather data. Remember that we want to look at the data for 2018 only

melb_daily_walk_2018 <- melb_walk_hols %>%
  filter(year == 2018) %>%
  group_by(Sensor, Date) %>%
  summarise(Count = sum(Count, na.rm = TRUE)) %>%
  ungroup()

melb_daily_walk_weather_2018 <- melb_daily_walk_2018 %>%
  left_join(melb_weather_2018, by = c("Date" = "date"))

melb_daily_walk_weather_2018 %>%
  head(10) %>%
   kable(caption = "Joined pedestrian and weather data for 2018") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Joined pedestrian and weather data for 2018
Sensor Date Count station tmax tmin prcp high_prcp high_temp low_temp
Birrarung Marr 2018-01-01 8385 ASN00086282 26.2 14.0 0 none not not
Birrarung Marr 2018-01-02 9844 ASN00086282 23.6 15.5 0 none not not
Birrarung Marr 2018-01-03 4091 ASN00086282 22.3 11.2 0 none not not
Birrarung Marr 2018-01-04 1386 ASN00086282 25.5 11.5 0 none not not
Birrarung Marr 2018-01-05 1052 ASN00086282 30.5 12.2 0 none not not
Birrarung Marr 2018-01-06 6672 ASN00086282 41.5 16.6 0 none hot not
Birrarung Marr 2018-01-07 1567 ASN00086282 22.0 15.7 0 none not not
Birrarung Marr 2018-01-08 6201 ASN00086282 23.6 15.9 0 none not not
Birrarung Marr 2018-01-09 8403 ASN00086282 22.8 13.9 0 none not not
Birrarung Marr 2018-01-10 12180 ASN00086282 25.5 12.1 0 none not not

Q2H Exploring sensor count against weather flagging variables (4 marks)

Create a few plots that look at the spread of the daily totals for each of the sensors, according to the weather flagging variables (high_prcp, high_temp, and low_temp). Write a paragraph that tells us what you learn from these plots, how you think weather might be impacting how people go outside. Make sure to discuss the strengths and limitations of the plots summarised like this, what assumption do they make?

# Plot of count for each sensor against high rain
ggplot(melb_daily_walk_weather_2018,
       aes(y = Count,
           x = Sensor,
           colour = high_prcp)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Distribution of pedestrian counts for days classied as rain")

# Plot against high temperature
ggplot(melb_daily_walk_weather_2018,
       aes(y = Count,
           x = Sensor,
           colour = high_temp)) +
    geom_boxplot() +
    theme(legend.position = "bottom") +
  ggtitle("Distribution of pedestrian counts for days classified as hot")

# Plot of low temperature
ggplot(melb_daily_walk_weather_2018,
       aes(y = Count,
           x = Sensor,
           colour = low_temp)) +
    geom_boxplot() +
    theme(legend.position = "bottom") +
 ggtitle("Distribution of pedestrian counts for days classified as cold")

!> Answer: The above plots depict summaries of the pedestrian count data recorded across the 4 sensors in Melbourne for the first four months of 2018. The Sensor is on the x-axis and count data is on the y-axis. The first plot compares days with rain and days without. From this plot it can be seen that generally people are less likely to go out on rainy days, however there is a much greater variation on non-rainy days in comparison. Meanwhile the second plot, depicting hot days (above 33 degrees) against non-hot days displays somewhat less difference when this variable is accounted for, with the median being lower for some Sensors and higher for others. Likewise the variation changes depending on which Sensor is chosen. The final graph depicts cold days (less than 6 degrees) vs non-cold days. As can be seen on the graph there is limited data available for these types of days. This may be due to the fact that the chosen period was over the final two months of summer and the first two months on autumn. As such during these warmer months one day below 6 degrees was recorded making comparisons somewhat more difficult. However, the median values do appear to be quite similar for these days across all four sensors. Futhermore while these plots are well equipped to summmarise large amounts of data and indetify the presence of outliers, they are limited in that they don’t provide insight into each individual data point, only summary statistics. This is particularly evident in the cold weather plot.

Q2F Combine the weather data with the pedestrian count data (2 marks)

The visualisations tell us something interesting about the data, but to really understand the data, we need to perform some modelling. To do this, you need to combine the weather data with the pedestrian data. We have provided the weather data for 2018 and 2020, combine with the pedestrian data for 2018 and 2020.

melb_weather_2018 <- read_csv(here::here("data-raw/melb_ncdc_2018.csv"))
melb_weather_2020 <- read_csv(here::here("data-raw/melb_ncdc_2020.csv"))

# task: combine the weather data together into an object, `melb_weather`
melb_weather <- bind_rows(melb_weather_2018,
                          melb_weather_2020) %>%
# remember to add info about high precipitation, high temperature, + low temps
mutate(
  high_prcp = if_else(condition = prcp > 5,
                          true = "rain",
                          false = "none"),
    high_temp = if_else(condition = tmax > 33, 
                        true = "hot",
                        false = "not"),
    low_temp = if_else(condition = tmin < 6,
                        true = "cold",
                        false = "not")
)

# now combine this weather data with the walking data
melb_walk_weather <- melb_walk_hols %>% 
  left_join(melb_weather, by = c("Date" = "date"))

Part 3: Modelling Count data

We have been able to start answering the question, “Does the weather make a difference to the number of people walking out?” by looking at a few exploratory plots. However, we want to get a bit more definitive answer by performing some statistical modelling.

We are going to process the data somewhat so we can fit a linear model to the data. First, let’s take a set relevant variables to be factors. This ensures that the linear model interprets them appropriately.

We also add one to count and then take the natural log of it. The reasons for this are a bit complex, but essentially a linear model is not the most ideal model to fit for this data, and we can help it be more ideal by taking the log of the counts, which helps stabilise the residuals (predictions - observed) when we fit the model.

melb_walk_weather_prep_lm <- melb_walk_weather %>% 
  mutate_at(.vars = vars(Sensor, 
                         Time,
                         month,
                         year,
                         workday,
                         high_prcp,
                         high_temp,
                         low_temp),
            as_factor) %>% 
  mutate(log_count = log1p(Count))

Q3A: Fit a linear model (1 mark)

Now we fit a linear model, predicting logCount using Time, Month, weekday and the weather flag variables (high_prcp, high_temp, and low_temp)

walk_fit_lm <- lm (
  formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp,
  data = melb_walk_weather_prep_lm
)

Q3B: Evaluate the model fit statistics (1 mark)

Provide some summary statistics on how well this model fits the data? What do you see? What statistics tell us about our model fit?

glance(walk_fit_lm) %>%
 kable(caption = "Model fit statistics for walk_fit_lm linear model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model fit statistics for walk_fit_lm linear model
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.5244989 0.5237145 1.213939 668.6655 0 36 -34259.01 68592.02 68886.7 31266.37 21217

!> Answer: The r.squared value is one of the most common measurements of the strength of a linear model fit. This value depicts the variability in the response variable explained by the model with the remaining variation explained by variables not in the moodel. Thus the r.squared value provides means that approximately 52.4% of the variability in the log count of pedestrians is explained by the Time, month, day of the week, and whether the day was rainy, hot or cold.

Q3C: Look at the model predictions (1 mark)

We have had one look at the model fit statistics, but let’s now look at the fitted and observed (log_count) values, for each sensor:

peds_aug_lm <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)

ggplot(peds_aug_lm,
       aes(x = log_count,
       y = .fitted)) +
  geom_point() +
  facet_wrap(~Sensor) +
ggtitle("Broom plot for each sensor using the walk_fit_lm linear model")

There is actually a lot of variation. Looking at this, you might assume that the model does a bad job of fitting the residuals. However, we must remember that there is an inherent time structure to the data. A better way to explore this is to look directly at the temporal components. We can do this directly with a calendar plot.

Q3D: Arrange the data so we can examine predictions (1 Mark)

Before we can get the data into the right format for analysis, we need to pivot the data into a longer format, so that we have columns of Date, Time, Count, Model, and log_count.

flinders_lm <- peds_aug_lm %>% 
  # filter the data to only look at flinder st
  filter(Sensor == "Flinders Street Station Underpass") %>%
  # Select the Date, Time, Count, .fitted, and log_count
  select(Date, 
         Time, 
         Count, 
         .fitted, 
         log_count) %>%
  # Now pivot the log count and fitted columns into a column called "model"
  # data so we have columns of Date, Time, Count, Model, 
  # and log_count. 
  pivot_longer(.fitted:log_count, names_to = "model", values_to = "log_count") %>%
  # Now we're going to undo the intiial data transformation
  mutate(Count = expm1(log_count))

Q3F: Plot the observed and fitted values in a calendar plot (3 Marks)

flinders_lm_cal <- flinders_lm %>% 
# Let's just look at 2020 to make it a bit easier
  filter(year(Date) == "2020") %>% 
  frame_calendar(x = Time, y = Count, date = Date)

gg_cal <- ggplot(flinders_lm_cal) +
  # Part of the interface to overlaying multiple lines in a calendar plot
  # is drawing two separate `geom_line`s -
  # See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
  # for details
  geom_line(data = filter(flinders_lm_cal,  model == ".fitted"),
         aes(x = .Time, 
             y = .Count, 
             colour = model, 
             group = Date)) +  
  geom_line(data = filter(flinders_lm_cal, model == "log_count"),
         aes(x = .Time, 
             y = .Count, 
             colour = model, 
             group = Date)) 

prettify(gg_cal) + theme(legend.position = "bottom") +
  ggtitle("Calender plot of fitted counts vs observed counts for the walk_fit_lm linear model")

Write a paragraph answering these questions:

  • What do you see in the difference between the fitted and the observed (log_count)?
  • What is a difference between 2020 and 2018?
  • Is there a difference across sensors?
  • What other variables might you consider adding to the model, and why?

!> Answer: This plot depicts the counts of pedestrians taken by the Flinders Street Station Underpass Sensor in 2020 for each day between January and the end of April. The orange line depicts the value fit by the model while the blue line depicts the log of the observed count data. Generally the observed values are higher than the fitted across the recorded period up until the end of March and beginning of April. This may be due to the fact that some variables are not being captured by the model. As mentioned previously, the lockdown restrictions in 2020 significantly altered the pedestrian counts and these laws came into affect around the period where the observed and predicted flip. Similary as seen previously, each Sensor has different patterns in their pedestrian accounts. Thus these two variables should be considered for the moodel as it is clear they affect the counts of pedestrians which is the response variable for the model.

Q3G: Fit one more (or more!) models to explore and compare (2 marks)

What sort of improvements do you think you could make to the model? Fit two more models, Try adding more variables to the linear model.

walk_fit_lm_sensor <- lm(
  formula = log_count ~ Time + month + weekday + Sensor + high_prcp + high_temp + low_temp,
  data = melb_walk_weather_prep_lm
  )

walk_fit_lm_year <- lm(
  formula = log_count ~ Time + month + weekday + year + high_prcp + high_temp + low_temp,
  data = melb_walk_weather_prep_lm
  )

Why did you add those variables?

!> Answer: As mentioned in the above section, Sensor was chosen as each of the four sensors depict different patterns in pedestrian traffic to varying degrees. Meanwhile, the year chosen has a clear impact on the counts of pedestrians recorded due to the new lockdown laws in 2020. As such the response variable, log_count, is likely to be effected by these variables as a correlation might exist.

Q3H Compare the model fit statistics (2 marks)

bind_rows(
  first = glance(),
  sensor = glance(walk_fit_lm_sensor),
  year = glance(walk_fit_lm_year),
  .id = "type"
) %>%
  kable(caption = "Model fit statistics for additional linear models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model fit statistics for additional linear models
type r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
sensor 0.6898277 0.6892721 0.9805127 1241.5851 0 39 -29718.93 59517.86 59836.43 20395.25 21214
year 0.5542828 0.5535264 1.1753331 732.8801 0 37 -33571.64 67219.28 67521.92 29307.95 21216

Which model does the best? Why do you think that is? What statistics are you basing this decision off?

!> Answer: From these two models, the one incorporating Sensor data appears to be the best, with this model explaining roughly 68.9% of the variability in log_count compared to only 55.4% for the model incorporating the year. Thus, more of the variability is explained by the sensor model. Similary AIC, BIC and deviance can also be used to compare models with lower indicating a better model fit. As can be seen, the Sensor model has a lower value for all these variables supporting its postion as the superior model.

Q3I Recreate the same calendar plot for your best model and explore the model fit. (2 Marks)

(Suggestion - Perhaps write this as a function to speed up comparison)

peds_aug_lm_sensor <- augment(walk_fit_lm_sensor, data = melb_walk_weather_prep_lm)
pivot_sensor <- function(lm_fit, sensor = "Flinders Street Station Underpass"){
  lm_fit %>% 
  filter(Sensor == sensor) %>%
  select(Date,
         Time,
         Count,
         .fitted,
         log_count) %>%
  pivot_longer(.fitted:log_count, names_to = "model", values_to = "log_count") %>%
  mutate(Count = expm1(log_count))
}

calendar_fit_obs <- function(lm_fit_aug){
  
  data_cal <- lm_fit_aug %>% 
    frame_calendar(x = Time, y = Count, date = Date)

gg_cal <-
  ggplot(data_cal) +
  geom_line(data = filter(data_cal, model == ".fitted"),
         aes(x = .Time,
             y = .Count,
             colour = model,
             group = Date)) +
  geom_line(data = filter(data_cal, model == "log_count"),
         aes(x = .Time,
             y = .Count,
             colour = model,
             group = Date))

prettify(gg_cal) + theme(legend.position = "bottom")

}

pivot_sensor(peds_aug_lm_sensor) %>%
filter(year(Date) == "2020") %>%
  calendar_fit_obs() +
  ggtitle("Calender plot of fitted counts vs observed counts for the walk_fit_lm_sensor linear model")

What do you see? How does it compare to the previous model?

!> Answer: This plot depicts the counts of pedestrians taken by the Flinders Street Station Underpass Sensor in 2020 for each day between January and the end of April. The orange line depicts the value fit by the model while the blue line depicts the log of the observed count data. Compared to the previous calender plot, the fitted model is much closer to the observed values for most of the time period. However, the issue still exists for the period from the end of March to April where the observed values fall well below the fitted ones. This is likely due to the fact the Year variable was not included. Thus in future models both the Sensor and Year variables could be included.

Q3J Compare model residuals against the fit (3 marks)

Compare the fitted against the residuals, perhaps write a function to help you do this in a more readable way.

walk_fit_lm_original_aug <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
walk_fit_lm_year_aug <- augment(walk_fit_lm_year, data = melb_walk_weather_prep_lm)
walk_fit_lm_sensor_aug <- augment(walk_fit_lm_sensor, data = melb_walk_weather_prep_lm)

plot_fit_resid <- function(data){
  ggplot(data,
         aes(x = .fitted,
             y = .std.resid)) +
  geom_point() +
  geom_hline(yintercept = 0, colour = "red") +
  facet_wrap(~Sensor)
}
plot_fit_resid(walk_fit_lm_original_aug) +
  ggtitle("Residual plot for the walk_fit_lm linear model")

plot_fit_resid(walk_fit_lm_year_aug) +
  ggtitle("Residual plot for the walk_fit_lm_year linear model")

plot_fit_resid(walk_fit_lm_sensor_aug) +
  ggtitle("Residual plot for the walk_fit_lm_sensor linear model")

We’ve looked at all these models, now pick your best one, and compare the predicted values against the actual observed values. What do you see? Is the model good? Is it bad? Do you have any thoughts on why it is good or bad?

!> Answer: These plots depcit the residuals for each of the produced models with the standardised residuals on the y-axis and the fitted values on x-axis with each point relating to one observed value. Such plots are used to determine the appropriateness of linear models. If the points are randomly dispersed around the horizontal axis (depicted by the red line) then a model is appropriate. Beginning with the first graph for the original model, the residuals for Flinders Street Station Underpass and Melbourne Central are postioned generally above the horizontal axis while the other two are generally below. The pattern for the second graph of the year model exhibits similar results, inferring that neither are a good fit for a linear model. Meanwhile the final graph for the Sensor model depcits the most evenly dispersed residuals around the horizontal axis and are thus the most random. This suggests that it is the most appropriate model of the three.

References

Make sure to reference all of the R packages that you used here, along with any links or blog posts that you used to help you answer these questions

Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL http://www.jstatsoft.org/v40/i03/.

D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

Kirill Müller (2017). here: A Simpler Way to Find Your Files. R package version 0.1. https://CRAN.R-project.org/package=here

David Robinson and Alex Hayes (2020). broom: Convert Statistical Analysis Objects into Tidy Tibbles. R package version 0.5.6. https://CRAN.R-project.org/package=broom

Earo Wang (2020). rwalkr: API to Melbourne Pedestrian Data. R package version 0.5.2. https://CRAN.R-project.org/package=rwalkr

Wang, E, D Cook, and RJ Hyndman (2020). Calendar-based graphics for visualizing people’s daily schedules. Journal of Computational and Graphical Statistics. doi:10.1080/10618600.2020.1715226.

Wang, E, D Cook, and RJ Hyndman (2020). A new tidy data structure to support exploration and modeling of temporal data. Journal of Computational and Graphical Statistics. doi:10.1080/10618600.2019.1695624.

Hadley Wickham, Jim Hester and Romain Francois (2018). readr: Read Rectangular Text Data. R package version 1.3.1. https://CRAN.R-project.org/package=readr

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Diethelm Wuertz, Tobias Setz, Yohan Chalabi, Martin Maechler and Joe W. Byers (2018). timeDate: Rmetrics - Chronological and Calendar Objects. R package version 3043.102. https://CRAN.R-project.org/package=timeDate

Yihui Xie (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.28.

Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

Hao Zhu (2019). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0. https://CRAN.R-project.org/package=kableExtra

Extra code

This code below here is what was used to retrieve the data in the data-raw folder.

# melb_bbox <- c(min(ped_loc$longitude) - .001,
#                min(ped_loc$latitude) - 0.001,
#                max(ped_loc$longitude) + .001,
#                max(ped_loc$latitude) + 0.001)
# 
# melb_map <- get_map(location = melb_bbox, source = "osm")
# write_rds(melb_map,
#           path = here::here("2020/assignment-2/data-raw/melb-map.rds"),
#           compress = "xz")
# code to download the stations around the airport and the weather times
# this is purely here so you can see how we downloaded this data
# it is not needed for you to complete the assignment, so it is commented out
# melb_stns <- read_table(
#   file = "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
#   col_names = c("ID",
#                 "lat",
#                 "lon",
#                 "elev",
#                 "state",
#                 "name",
#                 "v1",
#                 "v2",
#                 "v3"),
#   skip = 353,
#   n_max = 17081
#   ) %>%
#   filter(state == "MELBOURNE AIRPORT")
# # 
# get_ncdc <- function(year){
#   vroom::vroom(
#     glue::glue(
#       "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/{year}.csv.gz"
#     ),
#     col_names = FALSE,
#     col_select = 1:4
#     )
# }
# 
# clean_ncdc <- function(x){
#   x %>%
#   filter(X1 == melb_stns$ID, X3 %in% c("PRCP", "TMAX", "TMIN")) %>%
#   rename_all(~ c("station", "date", "variable", "value")) %>%
#   mutate(date = ymd(date), value = value / 10) %>%
#   pivot_wider(names_from = variable, values_from = value) %>%
#   rename_all(tolower)
#   }

# ncdc_2018 <- get_ncdc(2018)
# melb_ncdc_2018 <- clean_ncdc(ncdc_2018)
# write_csv(melb_ncdc_2018,
#           path = here::here("2020/assignment-2/data-raw/melb_ncdc_2018.csv"))
# 
# ncdc_2020 <- get_ncdc(2020)
# beepr::beep(sound = 4)
# melb_ncdc_2020 <- clean_ncdc(ncdc_2020)
# beepr::beep(sound = 4)
# write_csv(melb_ncdc_2020,
#           path = here::here("2020/assignment-2/data-raw/melb_ncdc_2020.csv"))