Background

Smart meters are now installed at every property in Melbourne. The Victorian government has an initiative to encourage households to examine their energy use and particularly to get a better deal on energy pricing.

You can actually download your own energy usage data for free - today we are going to be analysing someone’s electricity data, which they have kindly provided to us for this practical exam.

The data contains half hourly measurements of electricity usage for a few years, but it is definitely not tidy.

Instructions

Marks

Exam Goals

The goals of this exam are for you to demonstrate:

Exam conditions

Remember, you are able to look up functions on the internet, from the lecture slides, wherever.

But you cannot communicate with each other.

Q1: Scoping out initial data cleaning (2 points)

library(tidyverse)
library(lubridate)
library(tsibble)
library(sugrrants)
library(glue)
# We perform some initial data cleaning here for you
# you do not need to do any coding here
elec_raw <- read_csv("CITIPOWER_DETAILED.csv",
                 skip = 1,
                 col_names = c("id", 
                               "date", 
                               paste0("t", 1:48), 
                               paste0("extra", 1:5)),
                 col_types = "ccddddddddddddddddddddddddddddddddddddddddddddddddccccc") %>% 
  select(id:t48) %>% 
  mutate(meter = if_else(grepl("meter",date), 
                         date, 
                         NA_character_)) %>% 
  fill(meter) %>% 
  filter(meter == "meter1") %>% 
  select(-meter)

elec_raw
# A tibble: 756 x 50
   id    date      t1    t2    t3    t4    t5    t6    t7    t8    t9   t10
   <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 200   meter1    NA    NA    NA    NA    NA    NA    30    NA    NA    NA
 2 300   20171…     0     0     0     0     0     0     0     0     0     0
 3 300   20171…     0     0     0     0     0     0     0     0     0     0
 4 300   20171…     0     0     0     0     0     0     0     0     0     0
 5 300   20171…     0     0     0     0     0     0     0     0     0     0
 6 300   20171…     0     0     0     0     0     0     0     0     0     0
 7 300   20171…     0     0     0     0     0     0     0     0     0     0
 8 400   1         43    NA    NA    NA    NA    NA    NA    NA    NA    NA
 9 400   44        44    NA    89    NA    NA    NA    NA    NA    NA    NA
10 400   45        48    NA    NA    NA    NA    NA    NA    NA    NA    NA
# … with 746 more rows, and 38 more variables: t11 <dbl>, t12 <dbl>,
#   t13 <dbl>, t14 <dbl>, t15 <dbl>, t16 <dbl>, t17 <dbl>, t18 <dbl>,
#   t19 <dbl>, t20 <dbl>, t21 <dbl>, t22 <dbl>, t23 <dbl>, t24 <dbl>,
#   t25 <dbl>, t26 <dbl>, t27 <dbl>, t28 <dbl>, t29 <dbl>, t30 <dbl>,
#   t31 <dbl>, t32 <dbl>, t33 <dbl>, t34 <dbl>, t35 <dbl>, t36 <dbl>,
#   t37 <dbl>, t38 <dbl>, t39 <dbl>, t40 <dbl>, t41 <dbl>, t42 <dbl>,
#   t43 <dbl>, t44 <dbl>, t45 <dbl>, t46 <dbl>, t47 <dbl>, t48 <dbl>

We have provided the elec_raw dataset, and done some intial cleaning for you. Look at it, and then in a few sentences, describe the steps that you will need to take to read in your electricity usage, and tidy it.

The end product should look something like this:

  date           hour   kwh
  <date>        <dbl> <dbl>
1 2017-11-24      0.5     0
2 2017-11-24      0.5     0
3 2017-11-24      1       0
4 2017-11-24      1       0
5 2017-11-24      1.5     0
6 2017-11-24      1.5     0

(although your hour variable may be more explicit like 12:00-12:30, 12:30-01:00, 01:00-01:30. It is up to you.)

Your answer, a few (2-3) sentence describe in in words the cleaning you need to do to clean the data.

Q2: Next steps in data cleaning (2 points)

Before you start cleaning, you need to do the following:

# fill in the blanks here to follow the steps above
elec_prep <- elec_raw %>% 
  # think about which number id contains the most information
  filter(id == 300) %>% 
  # convert character format into date format using a function from lubridate
  mutate(date = ymd(date)) %>% 
  # Filter the date to only contain all dates the start of the year (dates equal to or larger than 01-01-2019).
  filter(date >= dmy("01-01-2019"))

elec_prep
# A tibble: 288 x 50
   id    date          t1    t2    t3    t4    t5    t6    t7    t8    t9
   <chr> <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 300   2019-01-01 0.55  0.138 0.131 0.106 0.1   0.069 0.056 0.056 0.056
 2 300   2019-01-02 0.063 0.056 0.056 0.056 0.044 0.056 0.05  0.056 0.05 
 3 300   2019-01-03 0.038 0.031 0.038 0.038 0.038 0.031 0.031 0.038 0.038
 4 300   2019-01-04 0.044 0.05  0.038 0.044 0.044 0.044 0.038 0.031 0.038
 5 300   2019-01-05 0.031 0.044 0.044 0.038 0.031 0.038 0.1   0.063 0.044
 6 300   2019-01-06 0.056 0.031 0.038 0.044 0.025 0.038 0.044 0.038 0.025
 7 300   2019-01-07 0.038 0.031 0.038 0.038 0.038 0.038 0.031 0.031 0.056
 8 300   2019-01-08 0.031 0.038 0.038 0.031 0.031 0.038 0.038 0.031 0.031
 9 300   2019-01-09 0.113 0.056 0.038 0.038 0.038 0.038 0.031 0.031 0.031
10 300   2019-01-10 0.031 0.031 0.038 0.038 0.038 0.031 0.031 0.038 0.038
# … with 278 more rows, and 39 more variables: t10 <dbl>, t11 <dbl>,
#   t12 <dbl>, t13 <dbl>, t14 <dbl>, t15 <dbl>, t16 <dbl>, t17 <dbl>,
#   t18 <dbl>, t19 <dbl>, t20 <dbl>, t21 <dbl>, t22 <dbl>, t23 <dbl>,
#   t24 <dbl>, t25 <dbl>, t26 <dbl>, t27 <dbl>, t28 <dbl>, t29 <dbl>,
#   t30 <dbl>, t31 <dbl>, t32 <dbl>, t33 <dbl>, t34 <dbl>, t35 <dbl>,
#   t36 <dbl>, t37 <dbl>, t38 <dbl>, t39 <dbl>, t40 <dbl>, t41 <dbl>,
#   t42 <dbl>, t43 <dbl>, t44 <dbl>, t45 <dbl>, t46 <dbl>, t47 <dbl>,
#   t48 <dbl>

(remember to knit your document as you work on this)

Q3: Put the data into tidy format (2 points)

The columns t1 through to t48 represent half hourly times (t1 = first half hour, t2 = the second half hour … t48 is the 48th half hour).

Gather the data into a tidy format where you have columns: id, date, halfhour, and kwh (kilowatt hours).

elec_gather <- elec_prep %>% 
  gather(key = halfhour,  #; remove all the gather stuff, they should know how
                          #; to do this.
         value = kwh, 
         t1:t48)

elec_gather
# A tibble: 13,824 x 4
   id    date       halfhour   kwh
   <chr> <date>     <chr>    <dbl>
 1 300   2019-01-01 t1       0.55 
 2 300   2019-01-02 t1       0.063
 3 300   2019-01-03 t1       0.038
 4 300   2019-01-04 t1       0.044
 5 300   2019-01-05 t1       0.031
 6 300   2019-01-06 t1       0.056
 7 300   2019-01-07 t1       0.038
 8 300   2019-01-08 t1       0.031
 9 300   2019-01-09 t1       0.113
10 300   2019-01-10 t1       0.031
# … with 13,814 more rows

Q3A: Add time information (2 points)

  • Create an hour variable, by extracting out the number from the halfhour column (see parse_number()), and convert that into hours.
  • We arrange by date and hour for you, so that the data is in chronological order
  • Also add columns for weekday, month, and year, using functions from lubridate
elec_hh <- elec_gather %>% 
  # Create an `hour` variable, by extracting out the number from the `halfhour` column (see `parse_number()`), and convert that into hours. 
  mutate(hour = parse_number(halfhour) / 2) %>% 
  # drop half hour
  select(-halfhour) %>% 
  #  We arrange by date and hour for you, so that the data is in chronological order
  # you don't need to add anything here
  arrange(date, 
          hour) %>% 
  #; give them the variable names for mutate, but not much else
  # Also add columns for weekday, month, and year, using functions from `lubridate`
  mutate(wday = wday(date, 
                     label = TRUE, 
                     abbr = TRUE,
                     week_start = 1),
         month = month(date, 
                       label = TRUE, 
                       abbr = TRUE),
         year = year(date))

elec_hh  
# A tibble: 13,824 x 7
   id    date         kwh  hour wday  month  year
   <chr> <date>     <dbl> <dbl> <ord> <ord> <dbl>
 1 300   2019-01-01 0.55    0.5 Tue   Jan    2019
 2 300   2019-01-01 0.138   1   Tue   Jan    2019
 3 300   2019-01-01 0.131   1.5 Tue   Jan    2019
 4 300   2019-01-01 0.106   2   Tue   Jan    2019
 5 300   2019-01-01 0.1     2.5 Tue   Jan    2019
 6 300   2019-01-01 0.069   3   Tue   Jan    2019
 7 300   2019-01-01 0.056   3.5 Tue   Jan    2019
 8 300   2019-01-01 0.056   4   Tue   Jan    2019
 9 300   2019-01-01 0.056   4.5 Tue   Jan    2019
10 300   2019-01-01 0.056   5   Tue   Jan    2019
# … with 13,814 more rows

Q3B: Add further weekday information (3 points)

Now add a column that is “workday” if a weekday is Mon-Fri, and “holiday” otherwise (hint: look at if_else and %in%).

In the same, work column, use information from the vic_holidays data, so that we add information so it is “holiday” if it is in the vic_holidays$date, else it stays the same.

vic_holidays <- holiday_aus(2017:2019, state = "VIC")

# first mutate to create a column, `work`, which is "workday" if the 
# condition is true, and "holiday" if it is false
elec <- elec_hh %>% 
  mutate(work = if_else(condition = wday %in% c("Mon", #; replace this vector
                                                "Tue", #; with nothing
                                                "Wed", #; but leave everything 
                                                "Thu", #; else
                                                "Fri"), 
                        true = "workday", 
                        false = "holiday")) %>% 
  # then mutate work again with a new condition, where the date is in the 
  # `vic_holiday$date` - which indicates if it is a holiday, or a work day
  # 
  mutate(work = if_else(condition = date %in% vic_holidays$date, 
                        true = "holiday", 
                        false = work)) #; leave work here

(remember to knit your document as you work on this)

Q4: Summarise the data (2 points)

Aggregate the daily kwh usage to the total kwH for each day of the year (assuming no missing data).

Add a column describing the day of the week to this data.

elec_daily <- elec %>%
  # group by each day of the day
  group_by(date) %>% #; leave blank
  # summarise the kilowat hours to be the total
  summarise(kwh = sum(kwh)) %>% #; place summarise, leave rest blank
  # create the weekday from the date column
  mutate(wday = wday(date,  #; leave blank
                     label = TRUE, 
                     abbr = TRUE,
                     week_start = 1))

(remember to knit your document as you work on this)

Q5: Visualise electricty (3 points)

Make a side-by-side boxplot of kilowat hours for each week day.

ggplot(elec_daily, 
       aes(x = wday, #; replace basically most code with blanks
           y = kwh)) + 
  geom_boxplot()

Write what you learn, and answer the question: “Are some days of the week using more electricity than others?”

Your Answer here - what do you learn from this plot?

Q6: Line plot of half hourly usage (3 points)

Using the tidied up electricity data (elec), filter the dates to be between the 25th September and the 15th October.

Make a line plot of half-hourly usage.

# filter out the data to look at data from 25-09-2019 to 15-10-2019
elec_sep_oct <- elec %>% 
  filter(date >= dmy("25-09-2019"), #; remove most of this
         date <= dmy("15-10-2019"))

#;omit everything  except for one thing
ggplot(elec_sep_oct, 
       aes(x = hour, 
           y = kwh, 
           group = date)) + #; keep this
  geom_line() + 
  # show a facet for each date
  facet_wrap(~date) #; omit this

Describe what you learn about electricity use during this period.

Your Answer:

(remember to knit your document as you work on this)

Q7: Add insight with weather data

We have extracted temperature data at Melbourne airport from the Bureau of Meterology, and saved it in the max_temp.csv. The next questions relate to this weather data.

Q7A: Join the data (2 points)

Join the temperature and electricity data so you can draw insights on how maximum temperature and electricity usage data are connected.

max_temp <- readr::read_csv("max_temp.csv")

# join the elec data with max temp.
elec_temp <- elec %>% 
  left_join(max_temp, by = "date") #; omit this, they should know how to do a join

elec_temp
# A tibble: 13,824 x 11
   id    date         kwh  hour wday  month  year work  station_number
   <chr> <date>     <dbl> <dbl> <ord> <ord> <dbl> <chr>          <dbl>
 1 300   2019-01-01 0.55    0.5 Tue   Jan    2019 holi…          86282
 2 300   2019-01-01 0.138   1   Tue   Jan    2019 holi…          86282
 3 300   2019-01-01 0.131   1.5 Tue   Jan    2019 holi…          86282
 4 300   2019-01-01 0.106   2   Tue   Jan    2019 holi…          86282
 5 300   2019-01-01 0.1     2.5 Tue   Jan    2019 holi…          86282
 6 300   2019-01-01 0.069   3   Tue   Jan    2019 holi…          86282
 7 300   2019-01-01 0.056   3.5 Tue   Jan    2019 holi…          86282
 8 300   2019-01-01 0.056   4   Tue   Jan    2019 holi…          86282
 9 300   2019-01-01 0.056   4.5 Tue   Jan    2019 holi…          86282
10 300   2019-01-01 0.056   5   Tue   Jan    2019 holi…          86282
# … with 13,814 more rows, and 2 more variables: max_temperature <dbl>,
#   quality <chr>

Q7B: Visualise electricity and temperature (3 points)

Explore the dates February 2nd - Feburary 7th, and create graphics below:

  • Colour the lines, based on the maximum daily temperature.
  • What do you learn about the relationship between your energy use and the maximum daily temperature.
library(viridis)

# filter the data to contain just dates between the 2nd Feb and 7th Feb
elec_temp_feb <- elec_temp %>% 
  filter(date >= dmy("02-02-2019"), #; omit, they should know how to do this now
         date <= dmy("07-02-2019"))

# plot a line plot of the 
ggplot(elec_temp_feb,
       aes(x = hour, #; omit this
           y = kwh,  #; omit this
           group = date, #; keep this
           colour = max_temperature)) +  #; omit this
  geom_line() + #; omit this
  # you can explore with facets if you like
  # facet_wrap(~date, ncol=3) + #; keep this commented out
  # this plots the temperature as a colour, where brighteer colours are
  # hotter temperatures
  scale_colour_viridis_c("temperature", option = "inferno") + #; keep this
  theme_dark() + #; keep
  theme(legend.position = "bottom") #; keep

Your answer here: what do you learn?

Q8: You Decide (5 points)

Decide on one (or more) more plots to describe about your energy usage and weather. Make these plots and write a few (2-4) sentences describing what you have learned from these data visualisations.

Appendix

The code below contains the code used to extract the BoM data - for your own interest.

# This code does not need to run, it is only here for your own interests
library(bomrang)
library(viridis)
stations <- sweep_for_stations(latlon = c(-37.8136, 144.9631)) #Melbourne lat/long 

max_temp <- get_historical(stationid = "086282", type = "max") %>% 
  as.data.frame() %>% 
  mutate(date = paste(year, 
                      month, 
                      day, 
                      sep = "-")) %>%
  mutate(date = ymd(date)) %>% 
  filter(year >= 2019,  
         month >= 1) %>% 
  select(station_number, 
         max_temperature,
         quality, 
         date)

readr::write_csv(max_temp, "max_temp.csv")