This assignment is designed to simulate a scenario where you are given a dataset, and taking over someone’s existing work, and continuing with it to draw some further insights. This aspect of it is similar to Assignment 1, but it will provide less scaffolding, and ask you to draw more insights, as well as do more communication.
Your previous submission of crime data was well received!
You’ve now been given a different next task to work on. Your colleague at your consulting firm, Amelia (in the text treatment below) has written some helpful hints throughout the assignment to help guide you.
Questions that are worth marks are indicated with “Q” at the start and end of the question, as well as the number of marks in parenthesis. For example
## Q1A some text (0.5 marks)
Is question one, part A, worth 0.5 marks
This assignment will be worth 10% of your total grade, and is marked out of 58 marks total.
9 Marks for presentation of the data visualisations
Your marks will be weighted according to peer evaluation.
As of week 6, you have seen most of the code for parts 1 - 2 that needs to be used here, and Week 7 will give you the skills to complete part 3. I do not expect you to know immediately what the code below does - this is a challenge for you! We will be covering skills on modelling in the next weeks, but this assignment is designed to simulate a real life work situation - this means that there are some things where you need to “learn on the job”. But the vast majority of the assignment will cover things that you will have seen in class, or the readings.
Remember, you can look up the help file for functions by typing ?function_name
. For example, ?mean
. Feel free to google questions you have about how to do other kinds of plots, and post on the ED if you have any questions about the assignment.
To complete the assignment you will need to fill in the blanks for function names, arguments, or other names. These sections are marked with ***
or ___
. At a minimum, your assignment should be able to be “knitted” using the knit
button for your Rmarkdown document.
If you want to look at what the assignment looks like in progress, but you do not have valid R code in all the R code chunks, remember that you can set the chunk options to eval = FALSE
like so:
```{r this-chunk-will-not-run, eval = FALSE}`r''`
ggplot()
```
If you do this, please remember to ensure that you remove this chunk option or set it to eval = TRUE
when you submit the assignment, to ensure all your R code runs.
You will be completing this assignment in your assigned groups. A reminder regarding our recommendations for completing group assignments:
Your assignments will be peer reviewed, and results checked for reproducibility. This means:
Each student will be randomly assigned another team’s submission to provide feedback on three things:
This assignment is due in by 6pm on Wednesday 20th May. You will submit the assignment via ED. Please change the file name to include your teams name. For example, if you are team dplyr
, your assignment file name could read: “assignment-2-2020-s1-team-dplyr.Rmd”
You work as a data scientist in the well named consulting company, “Consulting for You”.
On your second day at the company, you impressed the team with your work on crime data. Your boss says to you:
Amelia has managed to find yet another treasure trove of data - get this: pedestrian count data in inner city Melbourne! Amelia is still in New Zealand, and now won’t be back now for a while. They discovered this dataset the afternoon before they left on holiday, and got started on doing some data analysis.
We’ve got a meeting coming up soon where we need to discuss some new directions for the company, and we want you to tell us about this dataset and what we can do with it.
Most Importantly, can you get this to me by Wednesday 20th May, COB (COB = Close of Business at 6pm).
I’ve given this dataset to some of the other new hire data scientists as well, you’ll all be working as a team on this dataset. I’d like you to all try and work on the questions separately, and then combine your answers together to provide the best results.
From here, you are handed a USB stick. You load this into your computer, and you see a folder called “melbourne-walk”. In it is a folder called “data-raw”, and an Rmarkdown file. It contains the start of a data analysis. Your job is to explore the data and answer the questions in the document.
Note that the text that is written was originally written by Amelia, and you need to make sure that their name is kept up top, and to pay attention to what they have to say in the document!
The City of Melbourne has sensors set up in strategic locations across the inner city to keep hourly tallies of pedestrians. The data is updated on a monthly basis and available for download from Melbourne Open Data Portal. The rwalkr package provides an API in R to easily access sensor counts and geographic locations.
There are three parts to this work:
Amelia: I’ve downloaded a map chunk of Melbourne. Can you take the map I made, and plot the location of the sensors on top of it? We want to be able to see all of the sensors, but we also want to create different shapes for the following sensors:
First we download the data on the pedestrian sensor locations around Melbourne.
And now we draw a plot on a map tile of the pedestrian sensor locations
!> Answer: The map above depicts the city of Melbourne by longitude on the x axis and latitude on the y axis with the locations of the pedestrian sensors. The transparent blue circles indicate the sensors in the specified area with the multi coloured triangles identifying the locations of the four key sensors: 1. Green triangle - Birrarung Marr sensor sensor 2. Orange triangle - Flinders Street Station Underpass sensor 3. Purple triangle - Melbourne Central sensor 4. Pink triangle - Souther cross station sensor
Birrarung Marr and Flinders Street Station are relatively close together along the bank of the yarra river, with the Birrarung Marr sensor looking like it’s near park land/nature. The Southern Cross station sensor looks furthest away on the west end of the map while Melbourne Central is located on the east end of the map in an area that indicates it’s primarily populated with buildings.
ped_loc %>%
# calculate the year from the date information
mutate(year = year(installation_date)) %>%
# count up the number of sensors
count(year) %>%
rename(Year = year,
Sensors_installed = n) %>%
# then use `kable()` to create a markdown table
kable(caption = "The number of sensors installed each year") %>%
kable_styling(bootstrap_option = c("hover", "striped"))
Year | Sensors_installed |
---|---|
2009 | 13 |
2013 | 12 |
2014 | 2 |
2015 | 7 |
2016 | 1 |
2017 | 9 |
2018 | 5 |
2019 | 6 |
2020 | 4 |
Additionally, how many sensors were added in 2016, and in 2017?
!> Answer: The table displays the year on the left hand side and the count of the number of sensors installed on the right hand side, it shows that In 2016 1 sensor was installed and in 2017 there were 9 installed.
We would like you to focus on the foot traffic at 4 sensors:
Your task is to:
Extract the data for the year 2018 from Jan 1 to April 30 using rwalkr
(you might want to
cache this so it doesn’t run every time you knit)
rwalkr
add variables that contain the day of the month, month, year, and day in the year.
walk_2018 <- walk_2018 %>%
# Filter the data down to include only the four sensors above
filter(Sensor %in% c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr")) %>%
# now add four columns, containing month day, month, year, and day of the year
# using functions from lubridate.
ungroup() %>%
mutate(mday = mday(Date),
month = month(Date),
year = year(Date),
year_day = yday(Date))
Now we can plot the pedestrian count information for January - April in 2018
ggplot(walk_2018,
aes(x = Date_Time,
y = Count)) +
geom_line(size = 0.3) +
facet_grid(Sensor ~ .,
# this code presents the facets ina nice way
labeller = labeller(Sensor = label_wrap_gen(20))) +
# this code mades the x axis a bit nicer to read
scale_x_datetime(date_labels = "%d %b %Y",
date_minor_breaks = "1 month") +
labs(x = "Date Time")
We can see that there are quite different patterns in each of the sensors. Let’s explore this further.
!> Answer:
There are a vast array of activities and events of which patterns could be captured across the locations where the sensors are positioned. These activities may include patterns of: - Working week: - People going to and from work who work in traditional occupations that are mon-fri - People going to and from work who work in the retail and service industry, an industry that is usually operating seven days a week - Capture public events such as White Night - Sporting events and holidays such as the Australian Open - Public holidays such as New Years Day and Anzac day - Patterns of peoples movement across the weekend
We’re primarily interested in exploiting pedestrian patterns at various time resolutions and across different locations. In light of people’s daily schedules, let’s plot the counts against time of day for each sensor.
ggplot(walk_2018,
aes(x = Time,
y = Count,
group = Date,
colour = Sensor)) +
geom_line() +
facet_grid(month ~ Sensor,
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2",
name = "Sensor") +
scale_x_continuous(breaks = seq(from = 0, to = 24, by = 4)) +
ggtitle("2018 Pedestrian count for each sensor location by month") +
theme(legend.position = "bottom")
Figure 4.1: The figure displays the pedestrian count for each sensor across the observed period of Januray 1st to April 30th in 2018
Write a short paragraph that describe what the plot shows:
!> Answer:
The Figure 4.1 above visualises the count of pedestrians picked up by each sensor across every hour for every day from the month of January through to April. The pedestrian count data is plotted on the y axis and the time of day is along the x-axis. The plot is divided into a grid to show the count data for each of the four sensors in each month, as indicated by the labels across the top and the right hand side.
Flinders Street Station, Melbourne Central and Southern Cross station keep a relatively consistent shape across each month, while Birrarung Marr shows a more random distribution of pedestrian counts from month to month, with the most counts coming in Jan and the largest number of pedestrian recordings usually coming after 10:00 which is likely associated with the Australian Open taking place between the 15th and 28th of January.
Both Flinders Street and Southern Cross Station show spikes of pedestrian counts between 7:00 and 8:00 and then between 16:00 and 18:00, which may signify the patterns associated with the traditional working day. Southern Cross station also shows a formation of recordings at a very low count, indicating not much travel across certain periods while Flinders Streets shows a secondary pattern that follows a relatively smoother slightly negatively skewed shape that is similar to the shape of the Melbourne Central plot. As noted, Melbourne Central follows a relatively smoother negatively skewed curve that records a relatively consistent number of pedestrian counts from 12:00 to around 20:00. It’s shape may be more indicative of a location that is populated with service and retail industries that operate later into the day as well as being at a location surrounded by many attractions that is a destination point for people at a time of leisure.
All four sensors show a spike in pedestrian counts at the beginning of January in the early morning hours at around 12:00am, with Flinders Street and Melbourne Central sensors having the largest counts stretching out to around 5am, which would likely be associated with activity around New Years eve/ New years day. Birrarung Marr, Flinders Street and Melbourne Central have a relatively similar spike in pedestrian count in the month of Feb at around 22:00 which may be related to a specific event taking place in the CBD.
Use the data inside the hols_2018
data to identify weekdays and weekends, and holidays.
hols_2018 <- tsibble::holiday_aus(year = 2018, state = "VIC")
walk_2018_hols <- walk_2018 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(condition = Date %in% hols_2018$date | weekday %in% c("Sat","Sun"),
true = "no",
false = "yes"))
Now create a plot to compare the workdays to the non workdays.
ggplot(walk_2018_hols,
aes(x = Time,
y = Count,
group = Date,
color = workday)) +
geom_line(size = 1,
alpha = 0.3) +
facet_grid(Sensor ~ weekday,
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_viridis(discrete = TRUE, name = "Workday") +
ggtitle("Pedestrian count across the working and non-working per each location") +
scale_x_continuous(breaks = seq(from = 0, to = 24, by = 4)) +
theme_dark() +
theme(legend.position = "right")
Figure 4.2: The figure presents the pedestrian counts for each sensor location across the the 7 day week displaying the difference between the working week count and the non working week count, for period of 1st Jan to 30th April 2018
Write a short paragraph that describe what the plot shows, and helps us answer these questions: - What is plotted on the graph? What does each line represent? - How are the data in these sensors similar or different? - Does each panel show the same trend within it, or is there variation? - What do you learn?
!> Answer:
Figure 4.2 visualises the pedestrian count at each of the four sensor locations across every hour for every day of both the working and non-working week, for the period of 01-01-2018 to 30-04-2018. The pedestrian count data is plotted on the y axis and the time of day is shown along the x-axis. The plot is divided into a grid format with the sensor location names along the right hand side and the day of the week listed along the top . The figure depicts working days by the yellow line and non-working days (public holidays and weekends) in purple. The figure shows that in 2018 there were no public holidays on Tuesdays or Thursdays.
The plot shows similarities between the working day pedestrian counts for Flinders Street Station and Southern Cross Station with a consistent shape containing two peaks for both sensors occuring around 8:00 and between 16:00 and 18:00, Monday to Friday. Birrarung Marr also indicates a similar pattern however on a relatively smaller scale as the density of the pedestrian count lines indicate the presence of peaks around similar times between Monday and Thursday for a smaller number of people. However, the Birrarung Marr sensor also shows far less dense lines with higher peaks around those times, with most days a third peak being present around the 22:00 hour mark, with the highest peaks coming on non working days. It can also be seen that on weekends Birrarung Marr records relatively higher counts across a longer period of the day then working days however, the results are fairly inconsistent from Monday to Friday with some relative consistency across Saturday and Sunday.
Melbourne Central working day pattern takes an individual shape to the other three. By comparing both working and non working days on both the weekend and midweek it is clear that relatively the same pattern exists whereby the highest number of counts are between 12:00 and 20:00. This may be a result of the area being populated with service and retail businesses which would usually operate 7 days a week and through most holidays, leading employees and customers to the area despite the day. Furthermore, Flinders Street station on non working days takes a different shape from the working day with a pattern that is relatively similar to Melbourne Central whereby it follows a smoother shape that dips down in the early hours and rises to have the highest pedestrian counts between 12:00 and 22:00.
Conversely, Southern Cross station registers a minimal pedestrian count on non working days that is consistent across all the non-working days and starkly contrasts its working day pattern, clearly showing that in 2018 when people were not going to work they were not coming to this station. Birrarung Marr displays higher pedestrian count peaks across non working days relative to the working days with relative inconsistency between Monday, Wednesday and Friday however, for Saturday and Sunday the plot indicates a relatively similar overall shape showing people mostly travel around this area early in the morning or between 12:00 and around 22:00.
To locate those unusual moments, Flinders Street Station data is calendarised on the canvas, using the sugrrants package. We can spot the unusual weekday patterns on public holidays using their color. Using the calendar plot, try to spot another unusual pattern, do a google search to try to explain the change in foot traffic. (Hint: Look at the plot carefully, does a particular day show a different daily pattern? Is there a specific event or festival happening on that day?)
# filter to just look at flinders st station
flinders <- walk_2018_hols %>%
filter(Sensor == "Flinders Street Station Underpass")
flinders_cal <- flinders %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- flinders_cal %>%
ggplot(aes(x = .Time,
y = .Count,
colour = workday,
group = Date)) +
geom_line() +
ggtitle("Calander plot displaying the working and non-working days for the 2018 period")
prettify(gg_cal) +
theme(legend.position = "bottom")
Figure 4.3: Calander plot displaying the working and non-working days across the period of 1st Jan to 30th April 2018
!> Answer:
Figure 4.3 calenderises the pedestrian count per day represented by workdays and non workdays from January to April. It can be seen that there is a unique pattern on the third sunday of Feburary which is the 17th and that shows the pedestrian count increasing across the whole day until about 23:55 which breaks the pattern that every other Saturday recorded in this period. This is likely due to the public event “White Night” being held on this date, where the city and many of its business and venues are open for 24 hours and throughout the night, leading to higher number of pedestrians requiring public transport late at night and the counts being recorded.
You’ll need to ensure that you follow the steps we did earlier to filter the data and add the holiday information.
walk_2020 <- walk_2020 %>%
filter(Sensor %in% c("Southern Cross Station","Melbourne Central","Flinders Street Station Underpass","Birrarung Marr")) %>%
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
mutate(mday = mday(Date),
month = month(Date),
year = year(Date),
dayoyear = yday(Date))
Now add the holiday data
# also the steps for adding in the holiday info
hols_2020 <- tsibble::holiday_aus(year = 2020, state = "VIC")
walk_2020_hols <- walk_2020 %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(condition = Date %in% hols_2020$date | weekday %in% c("Sat","Sun"),
true = "no",
false = "yes"))
melb_walk_hols <- bind_rows(walk_2018_hols, walk_2020_hols)
filter_sensor <- function(data, sensors){
data %>% filter(Sensor %in% c("Southern Cross Station","Melbourne Central","Flinders Street Station Underpass","Birrarung Marr"))
}
add_day_info <- function(data){
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
data %>%
mutate(mday = mday(Date), month = month(Date), year = year(Date), dayoyear = yday(Date))
}
add_working_day <- function(data){
walk_years <- unique(data$year)
hols <- tsibble::holiday_aus(year = walk_years, state = "VIC")
data %>%
mutate(weekday = wday(Date, label = TRUE, week_start = 1),
workday = if_else(condition = Date %in% hols$date | weekday %in% c("Sat","Sun"),
true = "no",
false = "yes"))
}
# Step one, combine the walking data
bind_rows(walk_2018, walk_2020) %>%
# Step two, filter the sensors
filter_sensor(sensors = Sensor) %>%
# step three, add the info on day of the year, month, etc
add_day_info() %>%
# strep four, add info on working days.
add_working_day()
## # A tibble: 23,136 x 12
## Sensor Date_Time Date Time Count mday month year year_day
## <chr> <dttm> <date> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Melbo… 2018-01-01 00:00:00 2018-01-01 0 2996 1 1 2018 1
## 2 Flind… 2018-01-01 00:00:00 2018-01-01 0 3443 1 1 2018 1
## 3 Birra… 2018-01-01 00:00:00 2018-01-01 0 1828 1 1 2018 1
## 4 South… 2018-01-01 00:00:00 2018-01-01 0 1411 1 1 2018 1
## 5 Melbo… 2018-01-01 01:00:00 2018-01-01 1 3481 1 1 2018 1
## 6 Flind… 2018-01-01 01:00:00 2018-01-01 1 3579 1 1 2018 1
## 7 Birra… 2018-01-01 01:00:00 2018-01-01 1 1143 1 1 2018 1
## 8 South… 2018-01-01 01:00:00 2018-01-01 1 436 1 1 2018 1
## 9 Melbo… 2018-01-01 02:00:00 2018-01-01 2 1721 1 1 2018 1
## 10 Flind… 2018-01-01 02:00:00 2018-01-01 2 3157 1 1 2018 1
## # … with 23,126 more rows, and 3 more variables: dayoyear <dbl>, weekday <ord>,
## # workday <chr>
Write a paragraph that describe what you learn from these plots. Can you describe any similarities, and differences amongst the plots, and why they might be similar or different? (You might need to play with the plot output size to clearly see the pattern)
melb_walk_hols_flinders_april <- melb_walk_hols %>%
filter(Sensor == "Flinders Street Station Underpass",
month == "4")
ggplot(melb_walk_hols_flinders_april,
aes(x = Time,
y = Count,
colour = as.factor(year))) +
geom_line() +
facet_grid(workday ~ weekday) + #think about a potential facet_grid
theme(legend.position = "bottom") +
labs(colour = "Year")
!> Answer: The above line graph represents the daily sensor counts for the month of April in the year 2018 and 2020 at Flinders street station, the graph is plotted with the counts (of pedestrians) on the y-axis and Time(of the day) on the x-axis and each grid represents the 30 days in the month of april. Using the holiday_aus() function and if_else condition helps us find the different patterns of counts on each day with the comparison of workdays/non-workdays and the holidays.
It is very clear in the plot above that there are significant changes in the workdays to the non workdays and the holidays. Observing the weekends (dates: 1,2,7,8,14,15,21,22,28,29) the counts are relatively low when compared to those of the working days in the year 2018, also considering the holidays (dates: 1,2,3,25) are Easter week and Anzac public holidays they have almost similar patterns with the non-workdays, the workdays have stable changes and have a higher pedestrian count than those of the non-workdays and holidays. There is a sequence which can be observed within each workday, there is a peak in the number of counts during 8:00 hrs and another peak around the time of 16:00 hrs, which could be a representation of people getting to their workspaces and getting back home respectively. However, in the year 2020, there is no significant changes in the month of April whether it is a workday or not , this is due to the COVID-19 restrictions. The non-workdays/ holidays all look similar to the workdays, its almost equal to 0 counts. Also a notable feature within the graphs is the missing data on 12th of April,2018 which could be due to the sensors being down at that instance of time in the day or the data was lost due to unforeseen circumstances. In conclusion, the time and the day of the week or the holidays are dependent variables for the frequency of pedestrians at specific sensor locations.
What do you learn? Which Sensors seem the most similar? Or the most different?
melb_walk_hols_april <- melb_walk_hols %>% filter(month == 4)
ggplot(melb_walk_hols_april,
aes(x = Time, y= Count, color = as.factor(year), group = Date)) +
geom_line() +
facet_grid(Sensor~ weekday) +
theme(legend.position = "bottom") +
labs(colour = "Year")
!> Answer: The plot ?? illustrates the count of pedestrians throughout the weekdays for each sensor locations at intervals of time for the month of April. This is a line graph which has the pedestrian count on the y-axis and the time of the day on the x-axis. The plot is facetted around sensor - weekday as a grid to show the activity at each sensor during the weekday for the entire month of April, across the four sensors, which is depicted on the right hand side of the grids, for the two years 2018 & 2020 as shown by the legends in the bottom.
Flinders street station underpass and Southern Cross Station have an almost similar patterns of pedestrians counts during the workday whereas Melbourne central station and Flinders street station have a similar patterns during the non-workdays. The Birrarung Marr station is very varied throughout the weekdays, it has its peaks on the working days around 17:00 hrs each workday and it seems like Birrarung Marr is a hotspot for non-workdays leisure’s, it hosts many events and has places you can visit with family and friends which is supported with the distinct counts variation shown in the non-workdays. It seems that the Southern cross station is not a spot people go for during the non-workdays, as the counts almost seem insignificant. The Flinders Street and the Southern Cross Stations have peaks during the workdays during 8:00 - 9:00 hrs in the morning where people might be getting to work and getting back home at 16:00 - 17:00 hrs during the evening during the workdays.
Combining weather data with pedestrian counts
One question we want to answer is: “Does the weather make a difference to the number of people walking out?”
Time of day and day of week are the predominant driving force of the number of pedestrian, depicted in the previous data plots. Apart from these temporal factors, the weather condition could possibly affect how many people are walking in the city. In particular, people are likely to stay indoors, when the day is too hot or too cold, or raining hard.
Daily meteorological data as a separate source, available on National Climatic Data Center, is used and joined to the main pedestrian data table using common dates.
Binary variables are created to serve as the tipping points
We have pulled information on weather stations for the Melbourne area - can you combine it together into one dataset?
prcp
> 5 (if yes, “rain”, if no, “none”)tmax
> 33 (if yes, “hot”, if no, “not”)tmin
< 6 (if yes, “cold”, if no, “not”)# Now create some flag variables
melb_weather_2018 <- read_csv(
here::here("data-raw/melb_ncdc_2018.csv")
) %>%
mutate(
high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none"),
high_temp = if_else(condition = tmax > 33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin < 6,
true = "cold",
false = "not")
)
The weather data is per day, and the pedestrian count data is every hour. One way to explore this data is to collapse the pedestrian count data down to the total daily counts, so we can compare the total number of people each day to the weather for each day. This means each row is the total number of counts at each sensor, for each day.
Depending on how you do this, you will likely need to merge the pedestrian count data back with the weather data. Remember that we want to look at the data for 2018 only
melb_daily_walk_2018 <- melb_walk_hols %>%
filter(year == 2018) %>%
group_by(Sensor, Date) %>%
summarise(count = sum(Count,na.rm = TRUE)) %>%
ungroup()
melb_daily_walk_weather_2018 <- melb_daily_walk_2018 %>%
left_join(melb_weather_2018,by = c("Date" = "date"))
melb_daily_walk_weather_2018
## # A tibble: 480 x 10
## Sensor Date count station tmax tmin prcp high_prcp high_temp
## <chr> <date> <int> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Birra… 2018-01-01 8385 ASN000… 26.2 14 0 none not
## 2 Birra… 2018-01-02 9844 ASN000… 23.6 15.5 0 none not
## 3 Birra… 2018-01-03 4091 ASN000… 22.3 11.2 0 none not
## 4 Birra… 2018-01-04 1386 ASN000… 25.5 11.5 0 none not
## 5 Birra… 2018-01-05 1052 ASN000… 30.5 12.2 0 none not
## 6 Birra… 2018-01-06 6672 ASN000… 41.5 16.6 0 none hot
## 7 Birra… 2018-01-07 1567 ASN000… 22 15.7 0 none not
## 8 Birra… 2018-01-08 6201 ASN000… 23.6 15.9 0 none not
## 9 Birra… 2018-01-09 8403 ASN000… 22.8 13.9 0 none not
## 10 Birra… 2018-01-10 12180 ASN000… 25.5 12.1 0 none not
## # … with 470 more rows, and 1 more variable: low_temp <chr>
Create a few plots that look at the spread of the daily totals for each of the sensors, according to the weather flagging variables (high_prcp
, high_temp
, and low_temp
).
Write a paragraph that tells us what you learn from these plots, how you think weather might be impacting how people go outside. Make sure to discuss the strengths and limitations of the plots summarised like this, what assumption do they make?
# Plot of count for each sensor against high rain
ggplot(melb_daily_walk_weather_2018,
aes(y = count,
x = Sensor,
colour = high_prcp )) +
geom_boxplot() +
theme(legend.position = "bottom")
Figure 5.1: high precipitation
# Plot against high temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = count,
x = Sensor,
colour = high_temp)) +
geom_boxplot()
Figure 5.2: high temperatures
# Plot of low temperature
ggplot(melb_daily_walk_weather_2018,
aes(y = count,
x = Sensor,
colour = low_temp)) +
geom_boxplot()
Figure 5.3: low temperatures
!> Answer:ll three of the above plots are displaying daily pedestrian count figures across each sensor, with daily-count on the y axis and the sensor location on the x axis.
Figure 5.1 displays count data for each sensor showing the counts for days where the weather had high precipitation, indicated by the red boxplot, or not, indicated by the blue. The plot shows that on days where there was rain, there is relatively far less pedestrian traffic at all four of the locations in comparison to dry days, indicating that rain has a large impact on the decision for people to come to these areas.
The results of Birrarung Marr show that the interquartile range is relatively similar on both wet and dry days, indicating that there is a certain level of pedestrian travel that would happen regardless of the precipitation conditions. For both Flinders Street Station and Southern Cross Station, the minimum counts on both rainy and none rainy days look relatively similar, indicating there is a minimum level of pedestrian traffic that would take place regardless of the precipitation conditions.
For rainy days, Melbourne central indicates having the least variance in its results with Flinders Street station recording the highest number of counts. Conversely, on non rainy days Birrarung Marr records the largest maximum count of pedestrians however, Flinders Street Station records the highest number of pedestrians when outliers are not being assessed.
Figure 5.2 displays count data for each sensor showing the counts for days where the weather was considered “hot”, indicated by the red boxplot, and days where it was “not”, indicated by the blue. The plot shows that overall days that were not considered hot registered the highest number of pedestrian counts, indicating that a greater maximum number of people are more likely to pass through these locations on days where temperature is below 33 degrees. Specifically, Birrarung Marr indicates that for days below 33 degrees it registered the highest maximum counts which could be due to the park lands in the area and people feeling uncomfortable being outdoors on days above 33 degree. As in plot one, this plot shows that both Flinders Street Station and Southern Cross Station have relatively similar minimum counts which indicates regardless of the temperature being above or below 33 degree, there will be a minimum number of pedestrian traffic. While Birrarung Marr shows a relatively similar first quartile range and Melbourne Central shows a relatively similar third quartile range, indicating that despite the weather being 33 degree or not, Birrarung Marr will have 25% of observations below that level and Melbourne Central will have 75% below that point.
Figure 5.3 displays count data for each sensor showing the counts for days where the weather was considered “cold” which was temperature below 6 degree, indicated by the red boxplot, and days where it was “not”, indicated by the blue. The plot showed that on days where the temperature was bove 6 degrees there was far more pedestrian traffic, indicating that temperature that is not considered cold would lead to relatively far more pedestrian traffic. However, the observation period only recorded 4 observations of temperature below 6 degrees as indicated by the relatively minimal number of observations in red.
Furthermore, all three plots above display the count data that was observed relative to the specified conditions and locations in a clear and interpretable manner. However, the plot indicates that the variable in focus is the sole contributor to the count result as there are no additional variables of which to base further inferences off and understand how other variables such as, day of the week, working day or not, may impact the result. The observation period is also across a four month period at the start of the Australian year which includes the seasons Summer and Autumn and given these are warmer periods, there are minimal observations for days with temperature below 6 degrees. Given winter would contain many more colder observations this data is not very representative of the actual impact of weather below 6 degrees on individuals’ decisions to go to these locations, thus the plot is not sufficient for inferences to be made about the impact of the cold weather.
The visualisations tell us something interesting about the data, but to really understand the data, we need to perform some modelling. To do this, you need to combine the weather data with the pedestrian data. We have provided the weather data for 2018 and 2020, combine with the pedestrian data for 2018 and 2020.
melb_weather_2018 <- read_csv(here::here("data-raw/melb_ncdc_2018.csv"))
melb_weather_2020 <- read_csv(here::here("data-raw/melb_ncdc_2020.csv"))
# task: combine the weather data together into an object, `melb_weather`
melb_weather <- bind_rows(melb_weather_2018,
melb_weather_2020) %>%
# remember to add info about high precipitation, high temperature, + low temps
mutate(
high_prcp = if_else(condition = prcp > 5,
true = "rain",
false = "none"),
high_temp = if_else(condition = tmax > 33,
true = "hot",
false = "not"),
low_temp = if_else(condition = tmin < 6,
true = "cold",
false = "not")
)
# now combine this weather data with the walking data
melb_walk_weather <- melb_walk_hols %>%
left_join(melb_weather, by = c("Date" = "date"))
We have been able to start answering the question, “Does the weather make a difference to the number of people walking out?” by looking at a few exploratory plots. However, we want to get a bit more definitive answer by performing some statistical modelling.
We are going to process the data somewhat so we can fit a linear model to the data. First, let’s take a set relevant variables to be factors. This ensures that the linear model interprets them appropriately.
We also add one to count and then take the natural log of it. The reasons for this are a bit complex, but essentially a linear model is not the most ideal model to fit for this data, and we can help it be more ideal by taking the log of the counts, which helps stabilise the residuals (predictions - observed) when we fit the model.
melb_walk_weather_prep_lm <- melb_walk_weather %>%
mutate_at(.vars = vars(Sensor,
Time,
month,
year,
workday,
high_prcp,
high_temp,
low_temp),
as_factor) %>%
mutate(log_count = log1p(Count))
Now we fit a linear model, predicting logCount using Time, Month, weekday and the weather flag variables (high_prcp
, high_temp
, and low_temp
)
walk_fit_lm <- lm (
formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp,
data = melb_walk_weather_prep_lm
)
walk_fit_lm
##
## Call:
## lm(formula = log_count ~ Time + month + weekday + high_prcp +
## high_temp + low_temp, data = melb_walk_weather_prep_lm)
##
## Coefficients:
## (Intercept) Time1 Time2 Time3 Time4
## 4.692082 -0.562636 -1.056344 -1.283673 -1.237656
## Time5 Time6 Time7 Time8 Time9
## -0.301751 0.706176 1.481418 2.040289 1.893324
## Time10 Time11 Time12 Time13 Time14
## 1.802180 1.945258 2.303573 2.315245 2.217888
## Time15 Time16 Time17 Time18 Time19
## 2.287660 2.507296 2.737825 2.490345 2.021638
## Time20 Time21 Time22 Time23 month2
## 1.653232 1.406048 1.208926 0.741751 0.026044
## month3 month4 weekday.L weekday.Q weekday.C
## -0.217182 -0.416541 -0.159779 -0.206872 -0.057006
## weekday^4 weekday^5 weekday^6 high_prcprain high_temphot
## 0.065570 0.120913 0.042203 -0.302129 -0.002752
## low_tempcold
## 0.556686
Provide some summary statistics on how well this model fits the data? What do you see? What statistics tell us about our model fit?
glance(walk_fit_lm) %>%
kable(caption = "Summary statistics") %>%
kable_styling(bootstrap_options = c("hover","stripped"),
latex_options = c("hold_position", "scale_down"))
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.5244989 | 0.5237145 | 1.213939 | 668.6655 | 0 | 36 | -34259.01 | 68592.02 | 68886.7 | 31266.37 | 21217 |
!> Answer: The model produced an R quare of 52.4%, which says that 52.4% of dependent variable is expplained by independent variables. p-value of the model below 5% is considered statistically significant, indicating that we can reject the null hypothesis where coefficient = 0. Regadless of a meaningfull p.value, the model has quite a low R square, suggesting a high level of variability presented which may affect the prediction power of the model.
We have had one look at the model fit statistics, but let’s now look at the fitted and observed (log_count
) values, for each sensor:
peds_aug_lm <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm)
ggplot(peds_aug_lm,
aes(x = .fitted,
y = log_count)) +
geom_point() +
facet_wrap(~Sensor) +
ggtitle("Fitted and log_count by Sensors")
There is actually a lot of variation. Looking at this, you might assume that the model does a bad job of fitting the residuals. However, we must remember that there is an inherent time structure to the data. A better way to explore this is to look directly at the temporal components. We can do this directly with a calendar plot.
Before we can get the data into the right format for analysis, we need to pivot the data into a longer format, so that we have columns of Date, Time, Count, Model, and log_count.
flinders_lm <- peds_aug_lm %>%
# filter the data to only look at flinder st
filter(Sensor == "Flinders Street Station Underpass") %>%
# Select the Date, Time, Count, .fitted, and log_count
select(Date, Time, Count, .fitted, log_count) %>%
# Now pivot the log count and fitted columns into a column called "model"
# data so we have columns of Date, Time, Count, Model,
# and log_count.
pivot_longer(cols = c(.fitted, log_count),
names_to = "model",
values_to = "log_count") %>%
# Now we're going to undo the initial data transformation
mutate(Count = expm1(log_count))
flinders_lm_cal_2020 <- flinders_lm %>%
# Let's just look at 2020 to make it a bit easier
filter(year(Date) == "2020") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal_2020 <- ggplot(flinders_lm_cal_2020) +
# Part of the interface to overlaying multiple lines in a calendar plot
# is drawing two separate `geom_line`s -
# See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by
# for details
geom_line(data = filter(flinders_lm_cal_2020, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(flinders_lm_cal_2020, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal_2020) + theme(legend.position = "bottom") +
ggtitle("Flinders padestrian count - 2020")
Write a paragraph answering these questions:
flinders_lm_cal_2018 <- flinders_lm %>%
filter(year(Date) == "2018") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal_2018 <- ggplot(flinders_lm_cal_2018) +
geom_line(data = filter(flinders_lm_cal_2018, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(flinders_lm_cal_2018, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal_2018) + theme(legend.position = "bottom") +
ggtitle("Flinders padestrian count - 2018")
!> Answer: while the patterns for January and February are pretty much the same for both 2018 and 2020, changes start to appear from the end of second week of March as the log count of 2020 dropped below that of 2018 and even flattened in April. This is mostly due to the breakout of Coronavirus in the beginning of March when people were instructed to refrain from going out and social distancing was applied to prevent the spread of the pandemic.
- Is there a difference across sensors?
melbourne_lm <- peds_aug_lm %>%
filter(Sensor == "Melbourne Central") %>%
# Select the Date, Time, Count, .fitted, and log_count
select(Date, Time, Count, .fitted, log_count) %>%
# Now pivot the log count and fitted columns into a column called "model"
# data so we have columns of Date, Time, Count, Model,
# and log_count.
pivot_longer(cols = c(.fitted, log_count),
names_to = "model",
values_to = "log_count") %>%
# Now we're going to undo the initial data transformation
mutate(Count = expm1(log_count))
melbourne_lm_cal_18 <- melbourne_lm %>%
filter(year(Date) == "2018") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal_melb_2018 <- ggplot(melbourne_lm_cal_18) +
geom_line(data = filter(melbourne_lm_cal_18, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(melbourne_lm_cal_18, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal_melb_2018) + theme(legend.position = "bottom") +
ggtitle("Melbourne Central padestrian count - 2018")
melbourne_lm_cal_20 <- melbourne_lm %>%
filter(year(Date) == "2020") %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal_melb_2020 <- ggplot(melbourne_lm_cal_20,
aes(x = .Time,
y = .Date,
group = Date)) +
geom_line(data = filter(melbourne_lm_cal_20, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(melbourne_lm_cal_20, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal_melb_2020) + theme(legend.position = "bottom") +
ggtitle("Melbourne Central padestrian count - 2020")
!> Answer: Sam please add. Thanks!
What sort of improvements do you think you could make to the model? Fit two more models, Try adding more variables to the linear model.
# Add variable "Sensor" and "workday" to model
walk_fit_lm_sensor <- lm(
formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp + Sensor + workday,
data = melb_walk_weather_prep_lm
)
walk_fit_lm_sensor
##
## Call:
## lm(formula = log_count ~ Time + month + weekday + high_prcp +
## high_temp + low_temp + Sensor + workday, data = melb_walk_weather_prep_lm)
##
## Coefficients:
## (Intercept)
## 5.241775
## Time1
## -0.569393
## Time2
## -1.065477
## Time3
## -1.298391
## Time4
## -1.237719
## Time5
## -0.300149
## Time6
## 0.707778
## Time7
## 1.483020
## Time8
## 2.042730
## Time9
## 1.895764
## Time10
## 1.804620
## Time11
## 1.947698
## Time12
## 2.306013
## Time13
## 2.317685
## Time14
## 2.220329
## Time15
## 2.290101
## Time16
## 2.509733
## Time17
## 2.740265
## Time18
## 2.492785
## Time19
## 2.024078
## Time20
## 1.655672
## Time21
## 1.408489
## Time22
## 1.211366
## Time23
## 0.744191
## month2
## 0.018813
## month3
## -0.220986
## month4
## -0.419011
## weekday.L
## -0.087576
## weekday.Q
## -0.158326
## weekday.C
## -0.064994
## weekday^4
## 0.043385
## weekday^5
## 0.093973
## weekday^6
## 0.038317
## high_prcprain
## -0.306521
## high_temphot
## -0.006546
## low_tempcold
## 0.544489
## SensorFlinders Street Station Underpass
## 0.216828
## SensorBirrarung Marr
## -1.340151
## SensorSouthern Cross Station
## -1.286440
## workdayyes
## 0.081310
# Add variable "year" on top of "Sensor" and "workday" to the model
walk_fit_lm_year <- lm(
formula = log_count ~ Time + month + weekday + high_prcp + high_temp + low_temp + Sensor + year + workday,
data = melb_walk_weather_prep_lm
)
walk_fit_lm_year
##
## Call:
## lm(formula = log_count ~ Time + month + weekday + high_prcp +
## high_temp + low_temp + Sensor + year + workday, data = melb_walk_weather_prep_lm)
##
## Coefficients:
## (Intercept)
## 5.53628
## Time1
## -0.57322
## Time2
## -1.06787
## Time3
## -1.30072
## Time4
## -1.23768
## Time5
## -0.29933
## Time6
## 0.70860
## Time7
## 1.48384
## Time8
## 2.04320
## Time9
## 1.89623
## Time10
## 1.80509
## Time11
## 1.94817
## Time12
## 2.30648
## Time13
## 2.31816
## Time14
## 2.22080
## Time15
## 2.29057
## Time16
## 2.51088
## Time17
## 2.74074
## Time18
## 2.49326
## Time19
## 2.02455
## Time20
## 1.65614
## Time21
## 1.40896
## Time22
## 1.21184
## Time23
## 0.74466
## month2
## 0.02903
## month3
## -0.21376
## month4
## -0.56750
## weekday.L
## -0.08128
## weekday.Q
## -0.17279
## weekday.C
## -0.07523
## weekday^4
## 0.03912
## weekday^5
## 0.08368
## weekday^6
## 0.05284
## high_prcprain
## -0.08247
## high_temphot
## -0.04148
## low_tempcold
## 0.37801
## SensorFlinders Street Station Underpass
## 0.21683
## SensorBirrarung Marr
## -1.34123
## SensorSouthern Cross Station
## -1.28700
## year2020
## -0.63553
## workdayyes
## 0.09025
Why did you add those variables?
!> Answer: Using findings from previous sections, we understand that year, sensor and workday have quite an impact on the pedestrian count, as: - Sensors represent different locations which corresponds to different pedestrian’s behaviour. The pedestrian counts for both the sensors take unique shapes across working days. It can be seen that during 2018 Flinders Street Station follows a pattern that has two significant count peaks throughout the day, indicating the traditional working day. While Melbourne Central displays a rounder pattern that follows a curve that is relatively negatively skewed and is relatively the same across working and on working days, which may be a result of the location being populated with the businesses from the retail and service industry whereby they’re open during holidays and weekends. However, across weekends and non working days, Flinders street takes a relatively similar shape to Melbourne Central, which is indicative of the location of Flinders Street Station whereby it’s surrounded by a mixture of industries places to visit during people’s leisure time. This holds true all the way through to the mid way point of March 2020 when, as we are aware of, COVID19 restrictions came into full effect with both locations showing heavily reduced pedestrian count on a relatively similar scale. It can be seen that they both display a micro pattern of their previous patterns observed, reflecting the reduction in pedestrian traffic on account of the restrictions that have been imposed. - Year 2018 and 2020 have a striking difference in pedestrian count, mostly due to the Covid-19 in 2020 holding people from going out to the streets. With that in mind, we want to add year and sensor into the model, hoping their impact may help increase the explainatory power of the model. - From the analysis in section 2, the pedestrian count is also impacted by the workday where the pattern observed during the week was different from when observed in the weekend and that pattern repeated consistenly for almost every week.
bind_rows(
first = glance(walk_fit_lm),
sensor = glance(walk_fit_lm_sensor),
year = glance(walk_fit_lm_year),
.id = "type"
)
## # A tibble: 3 x 12
## type r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 first 0.524 0.524 1.21 669. 0 36 -34259. 68592.
## 2 sens… 0.690 0.689 0.980 1210. 0 40 -29716. 59514.
## 3 year 0.720 0.719 0.932 1363. 0 41 -28635. 57354.
## # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
Which model does the best? Why do you think that is? What statistics are you basing this decision off?
!> Answer: The model consisting of two new variables “year” and “sensor” - the “year” model have the highest performance, which we can see from a high R value of c.72% and a p.value of less than 5%. Other two models, even though statistically significant, have a lower R square than the “year” model, indicating a weaker explainatory power.
(Suggestion - Perhaps write this as a function to speed up comparison)
peds_aug_lm_sensor_year <- augment(walk_fit_lm_year, data = melb_walk_weather_prep_lm)
pivot_sensor <- function(lm_fit, sensor = "Flinders Street Station Underpass"){
lm_fit %>%
filter(Sensor == sensor) %>%
select(Date, Time, Count, .fitted, log_count) %>%
pivot_longer(cols = c(.fitted, log_count),
names_to = "model",
values_to = "log_count") %>%
mutate(Count = expm1(log_count))
}
calendar_fit_obs <- function(lm_fit_aug){
data_cal <- lm_fit_aug %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <-
ggplot(data_cal,
x = .Time,
y = .Count,
group = Date) +
geom_line(data = filter(data_cal, model == ".fitted"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date)) +
geom_line(data = filter(data_cal, model == "log_count"),
aes(x = .Time,
y = .Count,
colour = model,
group = Date))
prettify(gg_cal) + theme(legend.position = "bottom")
}
pivot_sensor(peds_aug_lm_sensor_year, ) %>%
filter(year(Date) == "2020") %>%
calendar_fit_obs()
What do you see? How does it compare to the previous model?
!> Answer: From the first look, the model bears little, if none, variations from the first model. However, by using augment on both models, we can observe changes in the intercept and slope. Such changes were small in values, which explained why we can hardly make out of variations in the calendar plot visually.
Compare the fitted against the residuals, perhaps write a function to help you do this in a more readable way.
walk_fit_lm <- augment(walk_fit_lm, melb_walk_weather_prep_lm)
walk_fit_lm_year <- augment(walk_fit_lm_year, melb_walk_weather_prep_lm)
walk_fit_lm_sensor <- augment(walk_fit_lm_sensor, melb_walk_weather_prep_lm)
plot_fit_resid <- function(data){
ggplot(data,
aes(x = .fitted,
y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, colour = "yellow") +
facet_wrap(~Sensor) +
theme_classic()
}
plot_fit_resid(walk_fit_lm)
plot_fit_resid(walk_fit_lm_sensor)
plot_fit_resid(walk_fit_lm_year)
We’ve looked at all these models, now pick your best one, and compare the predicted values against the actual observed values. What do you see? Is the model good? Is it bad? Do you have any thoughts on why it is good or bad?
!> Answer: The plots in ?? are residual plots with residuals plotted on the y-axis and fitted values on the x-axis which helps detect problems with the linear models than the data itself.
Looking at the plot walk_fit_lm original model, there are some values which are scattered for instance, the one in Birrarung Marr and in Melbourne Central, the model is good but it can be seen that the one in Melbourne Central is slightly leaning towards the negative trend, which means that there is a slightly higher prediction than what was required, although it isn’t that big a trend.
The plot walk_fit_lm with sensor is a plot constructed from the linear model with the variable sensor as an addition to the original model for prediction, here it can be observed that the model is a lot less scattered and are clustered around the mid(0.0) which is a better fit for prediction of the model although it is good, there is still a slight negative trend in the model in Melbourne Central and Flinders Street Station Underpass.
However, with the addition of new variables into the model , we can see that in plot walk_fit_lm sensor and year model, the slight negative trend has been reduced, we have a model with a relatively random scatter of points centred around zero, which indicate that the model’s predictions are relatively correct. Looking at these graphs it can be inferred that the prediction made for these models is accurate. There is no significant pattern with these models, they are symmetrically distributed tending to cluster towards the middle of the plot (0.0).
Make sure to reference all of the R packages that you used here, along with any links or blog posts that you used to help you answer these questions
This code below here is what was used to retrieve the data in the data-raw
folder.
# melb_bbox <- c(min(ped_loc$longitude) - .001,
# min(ped_loc$latitude) - 0.001,
# max(ped_loc$longitude) + .001,
# max(ped_loc$latitude) + 0.001)
#
# melb_map <- get_map(location = melb_bbox, source = "osm")
# write_rds(melb_map,
# path = here::here("2020/assignment-2/data-raw/melb-map.rds"),
# compress = "xz")
# code to download the stations around the airport and the weather times
# this is purely here so you can see how we downloaded this data
# it is not needed for you to complete the assignment, so it is commented out
# melb_stns <- read_table(
# file = "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
# col_names = c("ID",
# "lat",
# "lon",
# "elev",
# "state",
# "name",
# "v1",
# "v2",
# "v3"),
# skip = 353,
# n_max = 17081
# ) %>%
# filter(state == "MELBOURNE AIRPORT")
# #
# get_ncdc <- function(year){
# vroom::vroom(
# glue::glue(
# "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/{year}.csv.gz"
# ),
# col_names = FALSE,
# col_select = 1:4
# )
# }
#
# clean_ncdc <- function(x){
# x %>%
# filter(X1 == melb_stns$ID, X3 %in% c("PRCP", "TMAX", "TMIN")) %>%
# rename_all(~ c("station", "date", "variable", "value")) %>%
# mutate(date = ymd(date), value = value / 10) %>%
# pivot_wider(names_from = variable, values_from = value) %>%
# rename_all(tolower)
# }
# ncdc_2018 <- get_ncdc(2018)
# melb_ncdc_2018 <- clean_ncdc(ncdc_2018)
# write_csv(melb_ncdc_2018,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2018.csv"))
#
# ncdc_2020 <- get_ncdc(2020)
# beepr::beep(sound = 4)
# melb_ncdc_2020 <- clean_ncdc(ncdc_2020)
# beepr::beep(sound = 4)
# write_csv(melb_ncdc_2020,
# path = here::here("2020/assignment-2/data-raw/melb_ncdc_2020.csv"))