Instructions to Students

This assignment is designed to simulate a scenario where you are given a dataset, and taking over someone’s existing work, and continuing with it to draw some further insights. This aspect of it is similar to Assignment 1, but it will provide less scaffolding, and ask you to draw more insights, as well as do more communication.

Your previous submission of crime data was well received!

You’ve now been given a different next task to work on. Your colleague at your consulting firm, Amelia (in the text treatment below) has written some helpful hints throughout the assignment to help guide you.

Questions that are worth marks are indicated with “Q” at the start and end of the question, as well as the number of marks in parenthesis. For example

## Q1A some text (0.5 marks)

Is question one, part A, worth 0.5 marks

Marking + Grades

This assignment will be worth 10% of your total grade, and is marked out of 58 marks total.

  • 9 Marks for grammar and clarity. You must write in complete sentences and do a spell check.
  • 9 Marks for presentation of the data visualisations

  • 40 marks for the questions
    • Part 1: 7 Marks
    • Park 2: 17 Marks
    • Park 3: 16 Marks
  • Your marks will be weighted according to peer evaluation.

A Note on skills

As of week 6, you have seen most of the code for parts 1 - 2 that needs to be used here, and Week 7 will give you the skills to complete part 3. I do not expect you to know immediately what the code below does - this is a challenge for you! We will be covering skills on modelling in the next weeks, but this assignment is designed to simulate a real life work situation - this means that there are some things where you need to “learn on the job”. But the vast majority of the assignment will cover things that you will have seen in class, or the readings.

Remember, you can look up the help file for functions by typing ?function_name. For example, ?mean. Feel free to google questions you have about how to do other kinds of plots, and post on the ED if you have any questions about the assignment.

How to complete this assignment.

To complete the assignment you will need to fill in the blanks for function names, arguments, or other names. These sections are marked with *** or ___. At a minimum, your assignment should be able to be “knitted” using the knit button for your Rmarkdown document.

If you want to look at what the assignment looks like in progress, but you do not have valid R code in all the R code chunks, remember that you can set the chunk options to eval = FALSE like so:

```{r this-chunk-will-not-run, eval = FALSE}`r''`
ggplot()
```

If you do this, please remember to ensure that you remove this chunk option or set it to eval = TRUE when you submit the assignment, to ensure all your R code runs.

You will be completing this assignment in your assigned groups. A reminder regarding our recommendations for completing group assignments:

  • Each member of the group completes the entire assignment, as best they can.
  • Group members compare answers and combine it into one document for the final submission.

Your assignments will be peer reviewed, and results checked for reproducibility. This means:

  • 25% of the assignment grade will come from peer evaluation.
  • Peer evaluation is an important learning tool.

Each student will be randomly assigned another team’s submission to provide feedback on three things:

  1. Could you reproduce the analysis?
  2. Did you learn something new from the other team’s approach?
  3. What would you suggest to improve their work?

Due Date

This assignment is due in by 6pm on Wednesday 20th May. You will submit the assignment via ED. Please change the file name to include your teams name. For example, if you are team dplyr, your assignment file name could read: “assignment-2-2020-s1-team-dplyr.Rmd”

Treatment

You work as a data scientist in the well named consulting company, “Consulting for You”.

On your second day at the company, you impressed the team with your work on crime data. Your boss says to you:

Amelia has managed to find yet another treasure trove of data - get this: pedestrian count data in inner city Melbourne! Amelia is still in New Zealand, and now won’t be back now for a while. They discovered this dataset the afternoon before they left on holiday, and got started on doing some data analysis.

We’ve got a meeting coming up soon where we need to discuss some new directions for the company, and we want you to tell us about this dataset and what we can do with it.

Most Importantly, can you get this to me by Wednesday 20th May, COB (COB = Close of Business at 6pm).

I’ve given this dataset to some of the other new hire data scientists as well, you’ll all be working as a team on this dataset. I’d like you to all try and work on the questions separately, and then combine your answers together to provide the best results.

From here, you are handed a USB stick. You load this into your computer, and you see a folder called “melbourne-walk”. In it is a folder called “data-raw”, and an Rmarkdown file. It contains the start of a data analysis. Your job is to explore the data and answer the questions in the document.

Note that the text that is written was originally written by Amelia, and you need to make sure that their name is kept up top, and to pay attention to what they have to say in the document!

Overview

The City of Melbourne has sensors set up in strategic locations across the inner city to keep hourly tallies of pedestrians. The data is updated on a monthly basis and available for download from Melbourne Open Data Portal. The rwalkr package provides an API in R to easily access sensor counts and geographic locations.

There are three parts to this work:

  1. Data wrangling and data visualisation of the pedestrian data
  2. Joining data together weather data
  3. Performing preliminary modelling

Part 1: Data wrangling and data visualisation of the pedestrian data

Amelia: I’ve downloaded a map chunk of Melbourne. Can you take the map I made, and plot the location of the sensors on top of it? We want to be able to see all of the sensors, but we also want to create different shapes for the following sensors:

First we download the data on the pedestrian sensor locations around Melbourne.

And now we draw a plot on a map tile of the pedestrian sensor locations

Q1A Tell me what this map shows us? (0.5 Mark)

Answer: This is a map of Mebourne with longitude on the x axis and latitude on the y axis. The dot and triangle represents the sensors. We can see different sensors locate in a different area. There are four triangles which present four selected sensor.

Q1B Create a markdown table of the number of sensors installed each year (0.5 Mark)

ped_loc %>% 
  # calculate the year from the date information
  mutate(year = format(installation_date, "%Y")) %>%
  # count up the number of sensors
  count(year) %>%
  # then use `kable()` to create a markdown table
  kable()
year n
2009 13
2013 12
2014 2
2015 7
2016 1
2017 9
2018 5
2019 6
2020 4

Additionally, how many sensors were added in 2016, and in 2017?

Answer: There was one sensor added in 2016. There were 9 sensors added in 2017. …

Q1C Filter the data down to just look at the selected four sensors (1 mark)

We would like you to focus on the foot traffic at 4 sensors:

  • “Southern Cross Station”
  • “Melbourne Central”
  • “Flinders Street Station Underpass”
  • “Birrarung Marr”

Your task is to:

  • Extract the data for the year 2018 from Jan 1 to April 30 using rwalkr (you might want to cache this so it doesn’t run every time you knit)

  • Filter the data down to include only the four sensors above, note that you could do this directly with rwalkr
  • add variables that contain the day of the month, month, year, and day in the year.

walk_2018 <- walk_2018 %>% 
  # Filter the data down to include only the four sensors above
  filter(Sensor == "Southern Cross Station"|
           Sensor =="Melbourne Central" |
           Sensor =="Flinders Street Station Underpass" |
           Sensor =="Birrarung Marr") %>%
# now add four columns, containing month day, month, year, and day of the year
  # using functions from lubridate.
  mutate(mday = mday(Date), month = month(Date), year = year(Date), wday = wday(Date)
         )

Now we can plot the pedestrian count information for January - April in 2018

ggplot(walk_2018,
       aes(x = Date_Time, 
           y = Count)) +
  geom_line(size = 0.3) +
  facet_grid(Sensor ~ ., 
             # this code presents the facets ina  nice way
             labeller = labeller(Sensor = label_wrap_gen(20))) +
  # this code mades the x axis a bit nicer to read
  scale_x_datetime(date_labels = "%d %b %Y", 
                   date_minor_breaks = "1 month") +
  labs(x = "Date Time")

We can see that there are quite different patterns in each of the sensors. Let’s explore this further.

Q1D How many types of activities might this capture at the four selected places? (0.5 marks)

Answer: not sure how do we answer this question. . The graphic captures the daily sensor activities. e.g. Peak hours during the day
. The graphic captures weekly sensor activities. e.g. Weekdays vs weekends . The graphic captures abnormal acitivites. e.g. public holidays and events . The graphic captures the difference in behavior across different sensors

Q1E Create a plot of the counts of each sensor over the day (2 Marks)

We’re primarily interested in exploiting pedestrian patterns at various time resolutions and across different locations. In light of people’s daily schedules, let’s plot the counts against time of day for each sensor.

ggplot(walk_2018,
       aes(x = Time,
           y = Count,
           group = factor(Date))) +
  geom_line(size=0.3) +
  facet_wrap(~ Sensor,
             labeller = labeller(Sensor = label_wrap_gen(20))) +
  scale_colour_brewer(palette = "Dark2",
                      name = "Sensor") +
  theme(legend.position = "none")

Write a short paragraph that describe what the plot shows:

  • What is plotted on the graph? What does each line represent?
  • Is the data in these sensors similar, or different?
  • Does each panel show the same trend within it, or is there variation?
  • What do you learn?

Answer: 1. The above graph shows the number of pedestrians pass though each sensor in each time of day. Each line represents one date. x-axis shows the time, y-axis shows the number of pedestrian.
2. The data for across all sensors are similar because all data include the same variable (count, date, time)
3. The trend between Southern Cross Station and Flinder Street Station Underpass appear to have similar within the day. This could be as these sensors mainly capture commuters to and from work. Birrarung Marr and the Melbourne Cnetral sensors appear to have different trends.
4. There is a clear pedestrian pattern in Melbourne Central, Flinders street station and Southern Cross Station. Because there is one or two trends for most of dates in these station.

Q1F Exploring non work days (2 marks)

Use the data inside the hols_2018 data to identify weekdays and weekends, and holidays.

hols_2018 <- tsibble::holiday_aus(year = 2018, state = "VIC")

walk_2018_hols <- walk_2018 %>% 
  mutate(weekday = wday(Date, label = TRUE, week_start = 1), 
         workday = if_else(
           condition = Date %in% hols_2018$date,
           true = 'Holiday',
           false = 'Workday'
           )) %>%
  mutate(workday= if_else(
           condition = weekday %in% c('Sun','Sat'),
           true = 'weekend',
           false = workday
           )) #I add an extra condition to defind whether the workday is weekend or not

Now create a plot to compare the workdays to the non workdays.

ggplot(walk_2018_hols,
       aes(Time,Count,group=factor(Date),colour=workday)) +
  geom_line(size = 0.3,
            alpha = 0.3) +
    facet_grid(~ Sensor,
             labeller = labeller(Sensor = label_wrap_gen(20))) +
  scale_colour_brewer(palette = "Dark2", name = "Sensor") +
  theme(legend.position = "none")

Write a short paragraph that describe what the plot shows, and helps us answer these questions:

  • What is plotted on the graph? What does each line represent?
  • How are the data in these sensors similar or different?
  • Does each panel show the same trend within it, or is there variation?
  • What do you learn?

Answer:
1. The above graph shows the number of pedestrians pass though each sensor in each time of day. Each line represents one date, x-axis shows the time, y-axis shows the number of pedestrian. There are three colors here (purple = workday, orange = weekday, green = holiday).
2. The data in each sensor is similar. Because all data include the same variable (count, date, time).
3. The weekday trends at Flinder Street Station Underpass and Southern Cross Station appear similar with Flinder Street having higher magnitudes. This may be as during the week mainly commuters will pass through these sensors. The weekend trend through Flinder Street Station Underpass and Melbourne Central Station also appear similar.
4. There is a clear trend in workday (purple color) as most of lines (date) in each location have a similar trend. From the Southern Cross station data, we can see there is less pedestrian in the weekend and holiday. We also notice there is little movement through the Birrarung Marr sensors during the week but large amounts during public holidays and some weekends.

Q1G Calendar plot (0.5 mark)

To locate those unusual moments, Flinders Street Station data is calendarised on the canvas, using the sugrrants package. We can spot the unusual weekday patterns on public holidays using their color. Using the calendar plot, try to spot another unusual pattern, do a google search to try to explain the change in foot traffic. (Hint: Look at the plot carefully, does a particular day show a different daily pattern? Is there a specific event or festival happening on that day?)

# filter to just look at flinders st station
flinders <- walk_2018_hols %>% filter(Sensor == "Flinders Street Station Underpass")

flinders_cal <- flinders %>% 
  frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- flinders_cal %>% 
  ggplot(aes(x = .Time, y = .Count, colour = workday, group = Date)) +
  geom_line()
prettify(gg_cal) +
  theme(legend.position = "bottom")

Answer: The graph shows the daily pedestrian pattern in each month at Flinder Street station. There are three colors here (Blue = workday, green = weekday, orange = holiday). Each line represents one date, x-axis shows the time + weekday, y-axis shows the number of pedestrian. Overall, Mon to Fri has one pattern, and weekend has one pattern. The unually dates are 01/01 (Mon, New Years Day), 26/01 (Fri, Australia Day), 17/02 (Sat), 12/03 (Mon, labour Day), 30/03 (Fri, Good Friday), 02/04 (Mon, Easter Monday) and 25/04 (Wed Anzac Day). These dates are public holiday. Therefore, the pedestrian pattern is different between workday, weekend and holiday.

Part Two: Combining data sources

Q2A Extract the pedestrian count data for 2020 from the Jan 1 to April 30, can you add that to the data? (2 marks)

You’ll need to ensure that you follow the steps we did earlier to filter the data and add the holiday information.

walk_2020 <- walk_2020 %>% 
  filter(Sensor == "Southern Cross Station"|
           Sensor =="Melbourne Central" |
           Sensor =="Flinders Street Station Underpass" |
           Sensor =="Birrarung Marr") %>% 
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
  mutate(mday = mday(Date), month = month(Date), year = year(Date), wday = wday(Date))

Now add the holiday data

# also the steps for adding in the holiday info
hols_2020 <- tsibble::holiday_aus(year = 2020, state = "VIC")

walk_2020_hols <- walk_2020 %>% 
  mutate(weekday = wday(Date, label = TRUE, week_start = 1),
         workday = if_else(
           condition = Date %in% hols_2020$date,
           true = 'Holiday',
           false = 'Workday'
           )) %>%
  mutate(workday= if_else(
           condition = weekday %in% c('Sun','Sat'),
           true = 'weekend',
           false = workday)
  )

melb_walk_hols <- bind_rows(walk_2020_hols, walk_2018_hols) # conbine two dataset

Q2B There is some repetition in the code above from previous, describe two (or more) ways to limit that repetition, and demonstrate the use of this as a function (1 Mark)

!>Answer:

filter_sensor <- function(data, sensors){
  data %>% filter(Sensor %in% selected$sensor)
}

add_day_info <- function(data){
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
  data %>% 
  mutate(mday = mday(Date), month = month(Date), year = year(Date), wday = wday(Date))
}

add_working_day <- function(data){
  
  walk_years <- unique(data$year)
  
  hols <- tsibble::holiday_aus(year = walk_years, state = "VIC")
  
  data %>% 
    mutate(weekday = wday(Date, label = TRUE, week_start = 1),
         workday = if_else(
           condition = Date %in% hols$date,
           true = 'Holiday',
           false = 'Workday'
           )) %>%
  mutate(workday= if_else(
           condition = weekday %in% c('Sun','Sat'),
           true = 'weekend',
           false = workday)
  )
}
# Step one, combine the walking data
bind_rows(walk_2018, walk_2020) %>%
  # Step two, filter the sensors
  filter_sensor() %>%
  # step three, add the info on day of the year, month, etc
  add_day_info() %>%
  # strep four, add info on working days.
  add_working_day()
## # A tibble: 23,136 x 11
##    Sensor Date_Time           Date        Time Count  mday month  year  wday
##    <chr>  <dttm>              <date>     <int> <int> <int> <dbl> <dbl> <dbl>
##  1 Melbo… 2018-01-01 00:00:00 2018-01-01     0  2996     1     1  2018     2
##  2 Flind… 2018-01-01 00:00:00 2018-01-01     0  3443     1     1  2018     2
##  3 Birra… 2018-01-01 00:00:00 2018-01-01     0  1828     1     1  2018     2
##  4 South… 2018-01-01 00:00:00 2018-01-01     0  1411     1     1  2018     2
##  5 Melbo… 2018-01-01 01:00:00 2018-01-01     1  3481     1     1  2018     2
##  6 Flind… 2018-01-01 01:00:00 2018-01-01     1  3579     1     1  2018     2
##  7 Birra… 2018-01-01 01:00:00 2018-01-01     1  1143     1     1  2018     2
##  8 South… 2018-01-01 01:00:00 2018-01-01     1   436     1     1  2018     2
##  9 Melbo… 2018-01-01 02:00:00 2018-01-01     2  1721     1     1  2018     2
## 10 Flind… 2018-01-01 02:00:00 2018-01-01     2  3157     1     1  2018     2
## # … with 23,126 more rows, and 2 more variables: weekday <ord>, workday <chr>

Q2C For Flinders st, in the month of April, can you compare 2018 to 2020? (2 Marks)

Write a paragraph that describe what you learn from these plots. Can you describe any similarities, and differences amongst the plots, and why they might be similar or different? (You might need to play with the plot output size to clearly see the pattern)

melb_walk_hols_flinders_april <- melb_walk_hols %>%
  filter(month == 4,Sensor == 'Flinders Street Station Underpass')

ggplot(melb_walk_hols_flinders_april,
       aes(Time,Count,group=Date,
           colour = as.factor(year))) +
  geom_line(size=0.1) +
  facet_wrap(~weekday, ncol = 5) +
  theme(legend.position = "bottom") +
  labs(colour = "Year")

Answer: The graph shows the daily pedestrian pattern in each weekday in April at Flinder Street station. The color represents the year.Each line represents one date, x-axis shows the time, y-axis shows the number of pedestrian. There is a different pattern between workday and weekend. There is also a different pattern between 2018 and 2020.

Q2D Produce a similar plot (to 2C) that will also allow us to contrast the patterns across the four sensors. (2 marks)

What do you learn? Which Sensors seem the most similar? Or the most different?

melb_walk_hols_april <- melb_walk_hols %>%  filter(month == 4)

ggplot(melb_walk_hols_april,
       aes(Time,Count,group=Date,colour = as.factor(year))) +
  geom_line(size=0.1) +
  facet_grid(~ Sensor) +
  theme(legend.position = "bottom") +
  labs(colour = "Year")

Answer: The graph shows the daily pedestrian pattern in each sensor in April. The color represents the year.Each line represents one date, x-axis shows the time, y-axis shows the number of pedestrian. There are more pedestrian in year 2018. The pedestrian patten in Flinder street station and Southern Cross station looks very smiliar. There are more pedestrians around 8-9am and 17-18pm.

Combining weather data with pedestrian counts

One question we want to answer is: “Does the weather make a difference to the number of people walking out?”

Time of day and day of week are the predominant driving force of the number of pedestrian, depicted in the previous data plots. Apart from these temporal factors, the weather condition could possibly affect how many people are walking in the city. In particular, people are likely to stay indoors, when the day is too hot or too cold, or raining hard.

Daily meteorological data as a separate source, available on National Climatic Data Center, is used and joined to the main pedestrian data table using common dates.

Binary variables are created to serve as the tipping points

We have pulled information on weather stations for the Melbourne area - can you combine it together into one dataset?

Q2E Read in and add flagging variables to the weather data (2 marks)

  • high_prcp if prcp > 5 (if yes, “rain”, if no, “none”)
  • high_temp if tmax > 33 (if yes, “hot”, if no, “not”)
  • low_temp if tmin < 6 (if yes, “cold”, if no, “not”)
# Now create some flag variables 
melb_weather_2018 <- read_csv(
  here::here("data-raw/melb_ncdc_2018.csv")
  ) %>% 
  mutate(
  high_prcp = if_else(condition = prcp > 5,
                          true = 'rain',
                          false = 'none'),
    high_temp = if_else(condition = tmax > 33,
                        true = 'hot',
                        false = 'none'),
    low_temp = if_else(condition = tmin < 6,
                        true = 'cold',
                        false = 'none')
  )

Q2F summarise the pedestrian count data and join it to the weather data (2 marks)

The weather data is per day, and the pedestrian count data is every hour. One way to explore this data is to collapse the pedestrian count data down to the total daily counts, so we can compare the total number of people each day to the weather for each day. This means each row is the total number of counts at each sensor, for each day.

Depending on how you do this, you will likely need to merge the pedestrian count data back with the weather data. Remember that we want to look at the data for 2018 only

melb_daily_walk_2018 <- melb_walk_hols %>%
  filter(year==2018) %>%
  group_by(Sensor, Date) %>%
  summarise(daily_count = sum(Count, na.rm = TRUE)) %>%
  ungroup()

melb_daily_walk_weather_2018 <- melb_daily_walk_2018 %>%
  left_join(melb_weather_2018, by = c('Date' = 'date'))

melb_daily_walk_weather_2018
## # A tibble: 480 x 10
##    Sensor Date       daily_count station  tmax  tmin  prcp high_prcp high_temp
##    <chr>  <date>           <int> <chr>   <dbl> <dbl> <dbl> <chr>     <chr>    
##  1 Birra… 2018-01-01        8385 ASN000…  26.2  14       0 none      none     
##  2 Birra… 2018-01-02        9844 ASN000…  23.6  15.5     0 none      none     
##  3 Birra… 2018-01-03        4091 ASN000…  22.3  11.2     0 none      none     
##  4 Birra… 2018-01-04        1386 ASN000…  25.5  11.5     0 none      none     
##  5 Birra… 2018-01-05        1052 ASN000…  30.5  12.2     0 none      none     
##  6 Birra… 2018-01-06        6672 ASN000…  41.5  16.6     0 none      hot      
##  7 Birra… 2018-01-07        1567 ASN000…  22    15.7     0 none      none     
##  8 Birra… 2018-01-08        6201 ASN000…  23.6  15.9     0 none      none     
##  9 Birra… 2018-01-09        8403 ASN000…  22.8  13.9     0 none      none     
## 10 Birra… 2018-01-10       12180 ASN000…  25.5  12.1     0 none      none     
## # … with 470 more rows, and 1 more variable: low_temp <chr>

Q2H Exploring sensor count against weather flagging variables (4 marks)

Create a few plots that look at the spread of the daily totals for each of the sensors, according to the weather flagging variables (high_prcp, high_temp, and low_temp). Write a paragraph that tells us what you learn from these plots, how you think weather might be impacting how people go outside. Make sure to discuss the strengths and limitations of the plots summarised like this, what assumption do they make?

# Plot of count for each sensor against high rain
ggplot(melb_daily_walk_weather_2018,
       aes(y = daily_count,
           x = Sensor,
           colour = high_prcp)) +
  geom_boxplot() +
  theme(legend.position = "bottom")

# Plot against high temperature
ggplot(melb_daily_walk_weather_2018,
       aes(y = daily_count,
           x = Sensor,
           colour = high_temp)) +
  geom_boxplot() +
  theme(legend.position = "bottom")

# Plot of low temperature
ggplot(melb_daily_walk_weather_2018,
       aes(y = daily_count,
           x = Sensor,
           colour = low_temp)) +
  geom_boxplot() +
  theme(legend.position = "bottom")

Answer: We created three boxplots. As we all know, the line inside each box represent the median of the dataset. The first boxplot is about the relationship between high_prcp and the daily total pedestrian count. The median of none high_prcp (no rain) is higher than rain which means there are more pedestrians when there is no rain. The second and third boxplot is relate to the temperature. There is not many difference of pedestrain count between cold day and not cold day. But, it seems like there are less pedestrain when the wether is hot. Therefore, the hot temperature and rain will affect people to go outside. The strength for the boxplot is we can compare the median of daily_count for each condition easily. However, the boxplot is not useful when there is no enough data. If you look at the third boxplot, there is not much cold weather data so that we cannot come out with any useful findings.

Q2F Combine the weather data with the pedestrian count data (2 marks)

The visualisations tell us something interesting about the data, but to really understand the data, we need to perform some modelling. To do this, you need to combine the weather data with the pedestrian data. We have provided the weather data for 2018 and 2020, combine with the pedestrian data for 2018 and 2020.

melb_weather_2018 <- read_csv(here::here("data-raw/melb_ncdc_2018.csv"))
melb_weather_2020 <- read_csv(here::here("data-raw/melb_ncdc_2020.csv"))

# task: combine the weather data together into an object, `melb_weather`
melb_weather <- bind_rows(melb_weather_2018,
                          melb_weather_2020) %>%
# remember to add info about high precipitation, high temperature, + low temps
mutate(
  high_prcp = if_else(condition = prcp > 5,
                      true = 'rain',
                      false = 'none'),
  high_temp = if_else(condition = tmax > 33,
                      true = 'hot',
                      false = 'none'),
  low_temp = if_else(condition = tmin < 6,
                      true = 'cold',
                      false = 'none')
)

# now combine this weather data with the walking data
melb_walk_weather <- melb_walk_hols %>% 
  left_join(melb_weather, by = c('Date'='date'))

Part 3: Modelling Count data

We have been able to start answering the question, “Does the weather make a difference to the number of people walking out?” by looking at a few exploratory plots. However, we want to get a bit more definitive answer by performing some statistical modelling.

We are going to process the data somewhat so we can fit a linear model to the data. First, let’s take a set relevant variables to be factors. This ensures that the linear model interprets them appropriately.

We also add one to count and then take the natural log of it. The reasons for this are a bit complex, but essentially a linear model is not the most ideal model to fit for this data, and we can help it be more ideal by taking the log of the counts, which helps stabilise the residuals (predictions - observed) when we fit the model.

melb_walk_weather_prep_lm <- melb_walk_weather %>% 
  mutate_at(.vars = vars(Sensor, 
                         Time,
                         month,
                         year,
                         workday,
                         high_prcp,
                         high_temp,
                         low_temp),
            as_factor) %>% 
  mutate(log_count = log1p(Count))

Q3A: Fit a linear model (1 mark)

Now we fit a linear model, predicting logCount using Time, Month, weekday and the weather flag variables (high_prcp, high_temp, and low_temp)

walk_fit_lm <- lm (
  formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp,
  data = melb_walk_weather_prep_lm
)

Q3B: Evaluate the model fit statistics (1 mark)

Provide some summary statistics on how well this model fits the data? What do you see? What statistics tell us about our model fit?

glance(walk_fit_lm)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>   <dbl>  <dbl>  <dbl>
## 1     0.524         0.524  1.21      669.       0    36 -34259. 68592. 68887.
## # … with 2 more variables: deviance <dbl>, df.residual <int>

Answer: The adj.R squared is 0.524 which indicates that 52% of count variation could be explained by the factor. The R squared value is 0.524. As the r.squared value is faily low this may indicate that the model does not fit the data well.

Q3C: Look at the model predictions (1 mark)

We have had one look at the model fit statistics, but let’s now look at the fitted and observed (log_count) values, for each sensor:

peds_aug_lm <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)

ggplot(peds_aug_lm,
       aes(x = log_count,
       y = .fitted)) +
  geom_point() +
  facet_wrap(~Sensor)

There is actually a lot of variation. Looking at this, you might assume that the model does a bad job of fitting the residuals. However, we must remember that there is an inherent time structure to the data. A better way to explore this is to look directly at the temporal components. We can do this directly with a calendar plot.

Q3D: Arrange the data so we can examine predictions (1 Mark)

Before we can get the data into the right format for analysis, we need to pivot the data into a longer format, so that we have columns of Date, Time, Count, Model, and log_count.

flinders_lm <- peds_aug_lm %>% 
  # filter the data to only look at flinder st
  filter(Sensor == 'Flinders Street Station Underpass') %>%
  # Select the Date, Time, Count, .fitted, and log_count
  select(Date, Time, Count, .fitted, log_count) %>%
  # Now pivot the log count and fitted columns into a column called "model"
  # data so we have columns of Date, Time, Count, Model, 
  # and log_count. 
  pivot_longer(cols = c(log_count, .fitted), names_to = 'model',values_to = 'log_count') %>%
  # Now we're going to undo the intiial data transformation
  mutate(Count = expm1(log_count))

#not too sure about this question

Q3F: Plot the observed and fitted values in a calendar plot (3 Marks)

flinders_lm_cal <- flinders_lm %>% 
# Let's just look at 2020 to make it a bit easier
  filter(year(Date) == "2020") %>% 
  frame_calendar(x = Time, y = Count, date = Date)

gg_cal <- ggplot(flinders_lm_cal) +
  # Part of the interface to overlaying multiple lines in a calendar plot
  # is drawing two separate `geom_line`s -
  # See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
  # for details
  geom_line(data = filter(flinders_lm_cal,  model == 'log_count'),
         aes(x = .Time, 
             y = .Count, 
             colour = model, 
             group = Date)) +  
  geom_line(data = filter(flinders_lm_cal, model == '.fitted'),
         aes(x = .Time, 
             y = .Count, 
             colour = model, 
             group = Date)) 

prettify(gg_cal) + theme(legend.position = "bottom")

Write a paragraph answering these questions:

  • What do you see in the difference between the fitted and the observed (log_count)?
  • What is a difference between 2020 and 2018?
  • Is there a difference across sensors?
  • What other variables might you consider adding to the model, and why?

Answer:The fitted is lower than the observed. Interestingly, from late in March in 2020,the log_count is decreased,and the fitted is higher than the observed. There are some difference across sensors, but the overall trend is similar.

  1. The fitted is missing some trends fro mthe observed such as the peaks during the weekdays and the differing pattern on the weekend.

Q3G: Fit one more (or more!) models to explore and compare (2 marks)

What sort of improvements do you think you could make to the model? Fit two more models, Try adding more variables to the linear model.

walk_fit_lm_sensor <- lm(
  formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp + workday + Count,
  data = melb_walk_weather_prep_lm
  ) # I add workday value in this linear model

walk_fit_lm_year <- lm(
  formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp + workday + Count + tmax + tmin + prcp,
  data = melb_walk_weather_prep_lm
  ) # i add workday,count,tmax,tmin and prcp into this linear model

Why did you add those variables?

Answer: For the first model (walk_fit_lm_sensor), I add ‘workday’ and ‘count’ into the model. Because I think these two variables are very important. For the second model (walk_fit_lm_year), I add three additional variables which are tmax, tmin and prcp. I try to add as many variables as I can to see if the number of variable will improve the model or not.

Q3H Compare the model fit statistics (2 marks)

bind_rows(
  first = glance(walk_fit_lm),
  sensor = glance(walk_fit_lm_sensor),
  year = glance(walk_fit_lm_year),
  .id = "type"
)
## # A tibble: 3 x 12
##   type  r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC
##   <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>   <dbl>  <dbl>
## 1 first     0.524         0.524 1.21       669.       0    36 -34259. 68592.
## 2 sens…     0.786         0.786 0.814     2112.       0    38 -25752. 51583.
## 3 year      0.788         0.787 0.811     1969.       0    41 -25686. 51456.
## # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>

Which model does the best? Why do you think that is? What statistics are you basing this decision off?

Answer: The Year model does the best. Because the Year model has a higher R squared value.

Q3I Recreate the same calendar plot for your best model and explore the model fit. (2 Marks)

(Suggestion - Perhaps write this as a function to speed up comparison)

peds_aug_lm_sensor_year <- augment(walk_fit_lm_year, data = melb_walk_weather_prep_lm)
pivot_sensor <- function(lm_fit, sensor = "Flinders Street Station Underpass"){
  lm_fit %>% 
  filter(Sensor == sensor) %>%
  select(Date, Time, Count, .fitted, log_count) %>%
  pivot_longer(-c("Date","Time","Count"), names_to = "model", values_to = "log_count") %>%
  mutate(Count = expm1(log_count))
}

calendar_fit_obs <- function(lm_fit_aug){
  data_cal <- lm_fit_aug %>% 
    frame_calendar(x = Time, y = Count, date = Date)

gg_cal <-
  ggplot(data_cal) +
  geom_line(data = filter(data_cal,  model == ".fitted"),
         aes(x = .Time, 
             y = .Count, 
             colour = model, 
             group = Date)) +
  geom_line(data = filter(data_cal, model == "log_count"),
         aes(x = .Time, 
             y = .Count, 
             colour = model, 
             group = Date))

prettify(gg_cal) + theme(legend.position = "bottom")

}

pivot_sensor(peds_aug_lm_sensor_year) %>%
filter(year(Date) == "2020") %>%
  calendar_fit_obs()

What do you see? How does it compare to the previous model?

Answer: Most of the fitted values and log_count values are similar.

Q3J Compare model residuals against the fit (3 marks)

Compare the fitted against the residuals, perhaps write a function to help you do this in a more readable way.

walk_fit_lm_1 <- augment(walk_fit_lm, melb_walk_weather_prep_lm)
walk_fit_lm_sensor <- augment(walk_fit_lm_sensor, melb_walk_weather_prep_lm)
walk_fit_lm_year <- augment(walk_fit_lm_year, melb_walk_weather_prep_lm)

plot_fit_resid <- function(data){
  ggplot(data,
         aes(x = log_count,
             y = .fitted)) +
  geom_point() +
  facet_wrap(~Sensor)
}

plot_fit_resid(walk_fit_lm_1)

plot_fit_resid(walk_fit_lm_sensor)

plot_fit_resid(walk_fit_lm_year)

We’ve looked at all these models, now pick your best one, and compare the predicted values against the actual observed values. What do you see? Is the model good? Is it bad? Do you have any thoughts on why it is good or bad?

Answer: The walk_fit_lm_year is the best model, there are some difference between the predicted values and the observed values, so it is not that good either, maybe we should use other regression method to get better prediction.

References

Make sure to reference all of the R packages that you used here, along with any links or blog posts that you used to help you answer these questions

Extra code

This code below here is what was used to retrieve the data in the data-raw folder.

# melb_bbox <- c(min(ped_loc$longitude) - .001,
#                min(ped_loc$latitude) - 0.001,
#                max(ped_loc$longitude) + .001,
#                max(ped_loc$latitude) + 0.001)
# 
# melb_map <- get_map(location = melb_bbox, source = "osm")
# write_rds(melb_map,
#           path = here::here("2020/assignment-2/data-raw/melb-map.rds"),
#           compress = "xz")
# code to download the stations around the airport and the weather times
# this is purely here so you can see how we downloaded this data
# it is not needed for you to complete the assignment, so it is commented out
# melb_stns <- read_table(
#   file = "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
#   col_names = c("ID",
#                 "lat",
#                 "lon",
#                 "elev",
#                 "state",
#                 "name",
#                 "v1",
#                 "v2",
#                 "v3"),
#   skip = 353,
#   n_max = 17081
#   ) %>%
#   filter(state == "MELBOURNE AIRPORT")
# # 
# get_ncdc <- function(year){
#   vroom::vroom(
#     glue::glue(
#       "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/{year}.csv.gz"
#     ),
#     col_names = FALSE,
#     col_select = 1:4
#     )
# }
# 
# clean_ncdc <- function(x){
#   x %>%
#   filter(X1 == melb_stns$ID, X3 %in% c("PRCP", "TMAX", "TMIN")) %>%
#   rename_all(~ c("station", "date", "variable", "value")) %>%
#   mutate(date = ymd(date), value = value / 10) %>%
#   pivot_wider(names_from = variable, values_from = value) %>%
#   rename_all(tolower)
#   }

# ncdc_2018 <- get_ncdc(2018)
# melb_ncdc_2018 <- clean_ncdc(ncdc_2018)
# write_csv(melb_ncdc_2018,
#           path = here::here("2020/assignment-2/data-raw/melb_ncdc_2018.csv"))
# 
# ncdc_2020 <- get_ncdc(2020)
# beepr::beep(sound = 4)
# melb_ncdc_2020 <- clean_ncdc(ncdc_2020)
# beepr::beep(sound = 4)
# write_csv(melb_ncdc_2020,
#           path = here::here("2020/assignment-2/data-raw/melb_ncdc_2020.csv"))