Smart meters are now installed at every property in Melbourne. The Victorian government has an initiative to encourage households to examine their energy use and particularly to get a better deal on energy pricing.
You can actually download your own energy usage data for free - today we are going to be analysing someone’s electricity data, which they have kindly provided to us for this practical exam.
The data contains half hourly measurements of electricity usage for a few years, but it is definitely not tidy.
To make it a little easier for you, a skeleton of R code is provided in the Rmd file. Where you see ???
or ___
means that something is missing and you will need to fill it in with the appropriate function, argument or operator.
You will also need to rearrange the code as necessary to do the calculations needed. You can write your own code, if you prefer.
Original work is expected. Any material used from external sources needs to be acknowledged.
Spell check using the RStudio spell-checker before submission.
What to turn in:
REMEMBER to knit your document as you go
Total points for the practical exam is 40.
5 points of the score from the assignment will be given based on whether the marker can compile your report, and get the same answers as you, and find your explanations of the plots understandable and informative.
5 points will be reserved for readability, clearly written R code, and appropriate citing of external sources.
Accuracy and completeness of answers, and clarity of explanations will be the basis for the remaining 30 points.
The goals of this exam are for you to demonstrate:
Remember, you are able to look up functions on the internet, from the lecture slides, wherever.
But you cannot communicate with each other.
library(tidyverse)
library(lubridate)
library(tsibble)
library(sugrrants)
library(glue)
# We perform some initial data cleaning here for you
# you do not need to do any coding here
elec_raw <- read_csv("CITIPOWER_DETAILED.csv",
skip = 1,
col_names = c("id",
"date",
paste0("t", 1:48),
paste0("extra", 1:5)),
col_types = "ccddddddddddddddddddddddddddddddddddddddddddddddddccccc") %>%
select(id:t48) %>%
mutate(meter = if_else(grepl("meter",date),
date,
NA_character_)) %>%
fill(meter) %>%
filter(meter == "meter1") %>%
select(-meter)
elec_raw
# A tibble: 756 x 50
id date t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 200 meter1 NA NA NA NA NA NA 30 NA NA NA
2 300 20171… 0 0 0 0 0 0 0 0 0 0
3 300 20171… 0 0 0 0 0 0 0 0 0 0
4 300 20171… 0 0 0 0 0 0 0 0 0 0
5 300 20171… 0 0 0 0 0 0 0 0 0 0
6 300 20171… 0 0 0 0 0 0 0 0 0 0
7 300 20171… 0 0 0 0 0 0 0 0 0 0
8 400 1 43 NA NA NA NA NA NA NA NA NA
9 400 44 44 NA 89 NA NA NA NA NA NA NA
10 400 45 48 NA NA NA NA NA NA NA NA NA
# … with 746 more rows, and 38 more variables: t11 <dbl>, t12 <dbl>,
# t13 <dbl>, t14 <dbl>, t15 <dbl>, t16 <dbl>, t17 <dbl>, t18 <dbl>,
# t19 <dbl>, t20 <dbl>, t21 <dbl>, t22 <dbl>, t23 <dbl>, t24 <dbl>,
# t25 <dbl>, t26 <dbl>, t27 <dbl>, t28 <dbl>, t29 <dbl>, t30 <dbl>,
# t31 <dbl>, t32 <dbl>, t33 <dbl>, t34 <dbl>, t35 <dbl>, t36 <dbl>,
# t37 <dbl>, t38 <dbl>, t39 <dbl>, t40 <dbl>, t41 <dbl>, t42 <dbl>,
# t43 <dbl>, t44 <dbl>, t45 <dbl>, t46 <dbl>, t47 <dbl>, t48 <dbl>
We have provided the elec_raw
dataset, and done some intial cleaning for you. Look at it, and then in a few sentences, describe the steps that you will need to take to read in your electricity usage, and tidy it.
The end product should look something like this:
date hour kwh
<date> <dbl> <dbl>
1 2017-11-24 0.5 0
2 2017-11-24 0.5 0
3 2017-11-24 1 0
4 2017-11-24 1 0
5 2017-11-24 1.5 0
6 2017-11-24 1.5 0
(although your hour
variable may be more explicit like 12:00-12:30, 12:30-01:00, 01:00-01:30. It is up to you.)
Your answer, a few (2-3) sentence describe in in words the cleaning you need to do to clean the data.
Before you start cleaning, you need to do the following:
id
column. There are some rows with a lot of missing values that contain junk. Use the information in the id
column to only keep the most useful data (Hint: keep only the id
that provides the most information).lubridate
# fill in the blanks here to follow the steps above
elec_prep <- elec_raw %>%
# think about which number id contains the most information
filter(id == 300) %>%
# convert character format into date format using a function from lubridate
mutate(date = ymd(date)) %>%
# Filter the date to only contain all dates the start of the year (dates equal to or larger than 01-01-2019).
filter(date >= dmy("01-01-2019"))
elec_prep
# A tibble: 288 x 50
id date t1 t2 t3 t4 t5 t6 t7 t8 t9
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 300 2019-01-01 0.55 0.138 0.131 0.106 0.1 0.069 0.056 0.056 0.056
2 300 2019-01-02 0.063 0.056 0.056 0.056 0.044 0.056 0.05 0.056 0.05
3 300 2019-01-03 0.038 0.031 0.038 0.038 0.038 0.031 0.031 0.038 0.038
4 300 2019-01-04 0.044 0.05 0.038 0.044 0.044 0.044 0.038 0.031 0.038
5 300 2019-01-05 0.031 0.044 0.044 0.038 0.031 0.038 0.1 0.063 0.044
6 300 2019-01-06 0.056 0.031 0.038 0.044 0.025 0.038 0.044 0.038 0.025
7 300 2019-01-07 0.038 0.031 0.038 0.038 0.038 0.038 0.031 0.031 0.056
8 300 2019-01-08 0.031 0.038 0.038 0.031 0.031 0.038 0.038 0.031 0.031
9 300 2019-01-09 0.113 0.056 0.038 0.038 0.038 0.038 0.031 0.031 0.031
10 300 2019-01-10 0.031 0.031 0.038 0.038 0.038 0.031 0.031 0.038 0.038
# … with 278 more rows, and 39 more variables: t10 <dbl>, t11 <dbl>,
# t12 <dbl>, t13 <dbl>, t14 <dbl>, t15 <dbl>, t16 <dbl>, t17 <dbl>,
# t18 <dbl>, t19 <dbl>, t20 <dbl>, t21 <dbl>, t22 <dbl>, t23 <dbl>,
# t24 <dbl>, t25 <dbl>, t26 <dbl>, t27 <dbl>, t28 <dbl>, t29 <dbl>,
# t30 <dbl>, t31 <dbl>, t32 <dbl>, t33 <dbl>, t34 <dbl>, t35 <dbl>,
# t36 <dbl>, t37 <dbl>, t38 <dbl>, t39 <dbl>, t40 <dbl>, t41 <dbl>,
# t42 <dbl>, t43 <dbl>, t44 <dbl>, t45 <dbl>, t46 <dbl>, t47 <dbl>,
# t48 <dbl>
(remember to knit your document as you work on this)
The columns t1 through to t48 represent half hourly times (t1 = first half hour, t2 = the second half hour … t48 is the 48th half hour).
Gather the data into a tidy format where you have columns: id
, date
, halfhour
, and kwh
(kilowatt hours).
elec_gather <- elec_prep %>%
gather(key = halfhour, #; remove all the gather stuff, they should know how
#; to do this.
value = kwh,
t1:t48)
elec_gather
# A tibble: 13,824 x 4
id date halfhour kwh
<chr> <date> <chr> <dbl>
1 300 2019-01-01 t1 0.55
2 300 2019-01-02 t1 0.063
3 300 2019-01-03 t1 0.038
4 300 2019-01-04 t1 0.044
5 300 2019-01-05 t1 0.031
6 300 2019-01-06 t1 0.056
7 300 2019-01-07 t1 0.038
8 300 2019-01-08 t1 0.031
9 300 2019-01-09 t1 0.113
10 300 2019-01-10 t1 0.031
# … with 13,814 more rows
hour
variable, by extracting out the number from the halfhour
column (see parse_number()
), and convert that into hours.lubridate
elec_hh <- elec_gather %>%
# Create an `hour` variable, by extracting out the number from the `halfhour` column (see `parse_number()`), and convert that into hours.
mutate(hour = parse_number(halfhour) / 2) %>%
# drop half hour
select(-halfhour) %>%
# We arrange by date and hour for you, so that the data is in chronological order
# you don't need to add anything here
arrange(date,
hour) %>%
#; give them the variable names for mutate, but not much else
# Also add columns for weekday, month, and year, using functions from `lubridate`
mutate(wday = wday(date,
label = TRUE,
abbr = TRUE,
week_start = 1),
month = month(date,
label = TRUE,
abbr = TRUE),
year = year(date))
elec_hh
# A tibble: 13,824 x 7
id date kwh hour wday month year
<chr> <date> <dbl> <dbl> <ord> <ord> <dbl>
1 300 2019-01-01 0.55 0.5 Tue Jan 2019
2 300 2019-01-01 0.138 1 Tue Jan 2019
3 300 2019-01-01 0.131 1.5 Tue Jan 2019
4 300 2019-01-01 0.106 2 Tue Jan 2019
5 300 2019-01-01 0.1 2.5 Tue Jan 2019
6 300 2019-01-01 0.069 3 Tue Jan 2019
7 300 2019-01-01 0.056 3.5 Tue Jan 2019
8 300 2019-01-01 0.056 4 Tue Jan 2019
9 300 2019-01-01 0.056 4.5 Tue Jan 2019
10 300 2019-01-01 0.056 5 Tue Jan 2019
# … with 13,814 more rows
Now add a column that is “workday” if a weekday is Mon-Fri, and “holiday” otherwise (hint: look at if_else
and %in%
).
In the same, work
column, use information from the vic_holidays
data, so that we add information so it is “holiday” if it is in the vic_holidays$date
, else it stays the same.
vic_holidays <- holiday_aus(2017:2019, state = "VIC")
# first mutate to create a column, `work`, which is "workday" if the
# condition is true, and "holiday" if it is false
elec <- elec_hh %>%
mutate(work = if_else(condition = wday %in% c("Mon", #; replace this vector
"Tue", #; with nothing
"Wed", #; but leave everything
"Thu", #; else
"Fri"),
true = "workday",
false = "holiday")) %>%
# then mutate work again with a new condition, where the date is in the
# `vic_holiday$date` - which indicates if it is a holiday, or a work day
#
mutate(work = if_else(condition = date %in% vic_holidays$date,
true = "holiday",
false = work)) #; leave work here
(remember to knit your document as you work on this)
Aggregate the daily kwh usage to the total kwH for each day of the year (assuming no missing data).
Add a column describing the day of the week to this data.
elec_daily <- elec %>%
# group by each day of the day
group_by(date) %>% #; leave blank
# summarise the kilowat hours to be the total
summarise(kwh = sum(kwh)) %>% #; place summarise, leave rest blank
# create the weekday from the date column
mutate(wday = wday(date, #; leave blank
label = TRUE,
abbr = TRUE,
week_start = 1))
(remember to knit your document as you work on this)
Make a side-by-side boxplot of kilowat hours for each week day.
ggplot(elec_daily,
aes(x = wday, #; replace basically most code with blanks
y = kwh)) +
geom_boxplot()
Write what you learn, and answer the question: “Are some days of the week using more electricity than others?”
Your Answer here - what do you learn from this plot?
Using the tidied up electricity data (elec
), filter the dates to be between the 25th September and the 15th October.
Make a line plot of half-hourly usage.
# filter out the data to look at data from 25-09-2019 to 15-10-2019
elec_sep_oct <- elec %>%
filter(date >= dmy("25-09-2019"), #; remove most of this
date <= dmy("15-10-2019"))
#;omit everything except for one thing
ggplot(elec_sep_oct,
aes(x = hour,
y = kwh,
group = date)) + #; keep this
geom_line() +
# show a facet for each date
facet_wrap(~date) #; omit this
Describe what you learn about electricity use during this period.
Your Answer:
(remember to knit your document as you work on this)
We have extracted temperature data at Melbourne airport from the Bureau of Meterology, and saved it in the max_temp.csv
. The next questions relate to this weather data.
Join the temperature and electricity data so you can draw insights on how maximum temperature and electricity usage data are connected.
max_temp <- readr::read_csv("max_temp.csv")
# join the elec data with max temp.
elec_temp <- elec %>%
left_join(max_temp, by = "date") #; omit this, they should know how to do a join
elec_temp
# A tibble: 13,824 x 11
id date kwh hour wday month year work station_number
<chr> <date> <dbl> <dbl> <ord> <ord> <dbl> <chr> <dbl>
1 300 2019-01-01 0.55 0.5 Tue Jan 2019 holi… 86282
2 300 2019-01-01 0.138 1 Tue Jan 2019 holi… 86282
3 300 2019-01-01 0.131 1.5 Tue Jan 2019 holi… 86282
4 300 2019-01-01 0.106 2 Tue Jan 2019 holi… 86282
5 300 2019-01-01 0.1 2.5 Tue Jan 2019 holi… 86282
6 300 2019-01-01 0.069 3 Tue Jan 2019 holi… 86282
7 300 2019-01-01 0.056 3.5 Tue Jan 2019 holi… 86282
8 300 2019-01-01 0.056 4 Tue Jan 2019 holi… 86282
9 300 2019-01-01 0.056 4.5 Tue Jan 2019 holi… 86282
10 300 2019-01-01 0.056 5 Tue Jan 2019 holi… 86282
# … with 13,814 more rows, and 2 more variables: max_temperature <dbl>,
# quality <chr>
Explore the dates February 2nd - Feburary 7th, and create graphics below:
library(viridis)
# filter the data to contain just dates between the 2nd Feb and 7th Feb
elec_temp_feb <- elec_temp %>%
filter(date >= dmy("02-02-2019"), #; omit, they should know how to do this now
date <= dmy("07-02-2019"))
# plot a line plot of the
ggplot(elec_temp_feb,
aes(x = hour, #; omit this
y = kwh, #; omit this
group = date, #; keep this
colour = max_temperature)) + #; omit this
geom_line() + #; omit this
# you can explore with facets if you like
# facet_wrap(~date, ncol=3) + #; keep this commented out
# this plots the temperature as a colour, where brighteer colours are
# hotter temperatures
scale_colour_viridis_c("temperature", option = "inferno") + #; keep this
theme_dark() + #; keep
theme(legend.position = "bottom") #; keep
Your answer here: what do you learn?
Decide on one (or more) more plots to describe about your energy usage and weather. Make these plots and write a few (2-4) sentences describing what you have learned from these data visualisations.
The code below contains the code used to extract the BoM data - for your own interest.
# This code does not need to run, it is only here for your own interests
library(bomrang)
library(viridis)
stations <- sweep_for_stations(latlon = c(-37.8136, 144.9631)) #Melbourne lat/long
max_temp <- get_historical(stationid = "086282", type = "max") %>%
as.data.frame() %>%
mutate(date = paste(year,
month,
day,
sep = "-")) %>%
mutate(date = ymd(date)) %>%
filter(year >= 2019,
month >= 1) %>%
select(station_number,
max_temperature,
quality,
date)
readr::write_csv(max_temp, "max_temp.csv")