ETC5510: Introduction to Data Analysis

<div class="shade_black"  style="width:60%;right:0;bottom:0;padding:10px;border: dashed 4px white;margin: auto;">
<i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See <a href=/>here for PDF <i class="fas fa-file-pdf"></i></a>.
</div>

<br>

.white[Press the **right arrow** to progress to the next slide!]

---

background-image: url(images/bg1.jpg)
background-size: cover
class: hide-slide-number split-70 title-slide
count: false

.column.shade_black[.content[

<br>

# .monash-blue.outline-text[ETC5510: Introduction to Data Analysis]

<br>

<h2 style="font-weight:900!important;">Classification Trees</h2>

.bottom_abs.width100[

Lecturer: *Professer Di Cook & Nicholas Tierney & Stuart Lee*

Department of Econometrics and Business Statistics

<span><i class="fas  fa-envelope faa-float animated "></i></span>  ETC5510.Clayton-X@monash.edu

May 2020

<br>
]

]]

---
# recap

- Decision Tree

---
# Admin

- Assignment 2 peer evaluation, to be completed on ED by next Monday

- Project
  - Talk to us about your data in class and at consults, next milestone
  is due on Friday and is avalaible on the assessments tab on ED

- Practical exam
  - Next Wednesday from 6pm Wednesday, closing 6pm Thursday

---
# What is a decision tree?

.pull-left[
Tree based models consist of one or more of nested `if-then` statements for the predictors that partition the data. Within these partitions, a model is used to predict the outcome.
]

.pull-right[
<img src="images/tree.jpg" width="100%" style="display: block; margin: auto;" />

.small[Source: [Egor Dezhic](becominghuman.ai)]

]

---
# Regression Tree

.pull-left[
<img src="lecture_10b_files/figure-html/reg-tree-split-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="lecture_10b_files/figure-html/show-split-1.png" width="100%" style="display: block; margin: auto;" />

]

---
# Regression Tree

.pull-left[
<img src="lecture_10b_files/figure-html/show-split-again-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="lecture_10b_files/figure-html/rpart-plot-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# Regression tree

- What if we want to predict something being in a particular group? Say,predicting whether someone passes a course based on two exam scores:
- Moving from continuous to categorical response.

---
# Regression? Classification?
  
- Regression trees give the predicted response for an observation by using the mean response of the observations that belong to the
same terminal node:

---
# Classification

A classification tree predicts each observation belonging to the most commonly occurring class of observations.

However, when we interpret a classification tree, we are often interested not only in the class prediction (what is most common), but also the proportion of correct classifications.

---
# Building a classification tree

- Similar approach to building a classification tree as for regression trees
- We use this "recursive binary splitting" approach
- But we don't use the residual sums of squares

$$
SS_T = \sum (y_i-\bar{y})^2
$$

Since we now have a category, we need some way to describe that.

We need something else!

---
# Classification tree

- We can use the "classification error".
- Where we count up the number of mis-classified things, and choose the split that has the lowest number of mis-classified things.
- We can represent this in an equation as the .orange[fraction of observations in a region which don't belong to the most common class].
  
`$$E = 1 - \text{max}_{k}(\hat{p}_{mk})$$`

Here,  `$\hat{p}_{mk}$` refers to the proportion of observations in the `$m$`th region, from the `$k$`th class.

---
# Understanding classification

Another way to think about this is to understand when E is zero, and when E is large

`$E = 1 - \text{max}_{k}(\hat{p}_{mk})$`

E is zero when `$\text{max}_{k}(\hat{p}_{mk})$` is 1, which is 1 when observations are the same class:

---
# Classification trees

- A classification tree is used to predict a .orange[categorical response] and regression tree is used to predict a quantitative response
- Use a recursive binary splitting to grow a classification tree. That is, sequentially break the data into two subsets, typically using a single variable each time.
- The predicted value for a new observation, `$x_0$`, will be the .orange[most commonly occurring class] of observations in the sub-region in which `$x_0$` falls

---
# Predicting pass or fail ?

Consider the dataset `Exam` where two exam scores are given for each student, 
and a class `Label` represents whether they passed or failed the course.

.pull-left[

```
##      Exam1    Exam2 Label
## 1 34.62366 78.02469     0
## 2 30.28671 43.89500     0
## 3 35.84741 72.90220     0
## 4 60.18260 86.30855     1
```
]

.pull-right[
<img src="lecture_10b_files/figure-html/unnamed-chunk-2-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# Your turn:

Open "10b-exercise-intro.Rmd" and let's decide a point to split the data.

---
# Calculate the number of misclassifications

Along all splits for `Exam1` classifying according to the majority class for the left and right splits
 
<img src="gifs/two_d_cart.gif" width="80%" style="display: block; margin: auto;" />

Red dots are .orange["fails"], blue dots are .green["passes"], and crosses indicate misclassifications. .small[Source: John Ormerod, U.Syd]

---
# Calculate the number of misclassifications

Along all splits for `Exam2` classifying according to the majority class for the top and bottom splits

Red dots are .orange["fails"], blue dots are .green["passes"], and crosses indicate misclassifications. .small[Source: John Ormerod, U.Syd]

---
# Combining the results from `Exam1` and `Exam2` splits

- The minimum number of misclassifications from using all possible splits of `Exam1` was 19 when the value of `Exam1` was **56.7**
- The minimum number of misclassifications from using all possible splits of `Exam2` was 23 when the value of `Exam2` was .orange[52.5]

So we split on the best of these, i.e., split the data on `Exam1` at 56.7.
---
# Split criteria - purity/impurity metrics

It turns out that classification error is not sufficiently sensitive for tree-growing.

In practice two other measures are preferable, as they are more sensitive:

- The Gini Index and 
- Information Entropy.

They are both quite similar numerically.

Small values mean that a node contains mostly observations of a single class, referred to as .orange[node purity].

---
# Example - predicting heart disease

`$Y$`: presence of heart disease (Yes/No)

`$X$`: heart and lung function measurements

```
##  [1] "Age"       "Sex"       "ChestPain" "RestBP"    "Chol"      "Fbs"      
##  [7] "RestECG"   "MaxHR"     "ExAng"     "Oldpeak"   "Slope"     "Ca"       
## [13] "Thal"      "AHD"
```

---
# Deeper trees

Trees can be built deeper by:

- decreasing the value of the complexity parameter `cp`, which sets the difference between impurity values required to continue splitting.
- reducing  the `minsplit` and `minbucket` parameters,  which control the number of  observations  below splits are forbidden.

---
# Tabulate true vs predicted to make a .orange[confusion table].

<center>
<table>
<tr>  <td> </td><td> </td> <td colspan="2" align="center" > true </td> </tr>
<tr>  <td> </td><td> </td> <td align="center" bgcolor="#daf2e9" width="80px"> C1 (positive) </td> <td align="center" bgcolor="#daf2e9" width="80px"> C2 (negative) </td> </tr>
<tr height="50px">  <td> pred- </td><td bgcolor="#daf2e9"> C1 </td> <td align="center" bgcolor="#D3D3D3"> <em>a</em> </td> <td align="center" bgcolor="#D3D3D3"> <em>b</em> </td> </tr>
<tr height="50px">  <td>icted </td><td bgcolor="#daf2e9"> C2</td> <td align="center" bgcolor="#D3D3D3"> <em>c</em> </td> <td align="center" bgcolor="#D3D3D3"> <em>d</em> </td> </tr>
</table>
</center>

- .orange[Accuracy: *(a+d)/(a+b+c+d)*]
- .orange[Error: *(b+c)/(a+b+c+d)*]
- Sensitivity: *a/(a+c)*  (true positive, recall)
- Specificity: *d/(b+d)* (true negative)
- .orange[Balanced accuracy: *(sensitivity+specificity)/2*]

---
# Confusion and error

```
##           Reference
## Prediction No Yes
##        No  75   5
##        Yes 11  58
##  Accuracy 
## 0.8926174
```

---
# Example - Crabs

Physical measurements on WA crabs, males and females.

.small[*Data source*: Campbell, N. A. & Mahon, R. J. (1974)]

---
# Example - Crabs

---
# Comparing models

.pull-left[

Classification tree

<img src="lecture_10b_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[

Linear classifier

]

---
# Strengths and Weaknesses

Strengths:

- The decision rules provided by trees are very easy to explain, and follow. A simple classification model.
- Trees can handle a mix of predictor types, categorical and quantitative.
- Trees efficiently operate when there are missing values in the predictors.

Weaknesses:

- Algorithm is greedy, a better final solution might be obtained by taking a second best split earlier.
- When separation is in linear combinations of variables trees struggle to provide a good classification

---
# 👩‍💻 Made by a human with a computer

- Slides inspired by [https://iml.numbat.space](https://iml.numbat.space), [https://github.com/numbats/iml](https://github.com/numbats/iml).
- Created using [R Markdown](https://rmarkdown.rstudio.com) with flair by [**xaringan**](https://github.com/yihui/xaringan), and [**kunoichi** (female ninja) style](https://github.com/emitanaka/ninja-theme).

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.