Decision Trees

Decision trees can operate with non-numerical data.

Building a decision tree amounts to deciding the order in which the decisions should be made.

Surprise

Surprise inversely related to probability:

\[ s(x) = \log \Bigg\lparen\frac{1}{p(x)}\Bigg\rparen \]

Writing like this has a few useful benefits:

Calculating / Estimating Surprise

The surprise of a sequence of events is the sum of the surprise for each individual event:

\[ s(a,a,b) = s(a) + s(a) + s(b) \]

Entropy

Entropy Clearly Explained

The above video explains:

\[ \text{Entropy} = \sum_{i = 0}^n p(x_i) \log_2 \Bigg\lparen\frac{1}{p(x)}\Bigg\rparen \]

\[ \text{Entropy} = -\sum_{i = 0}^n p(x_i) \log_2 \Big(p(x_i)\Big) \]

Heads & Tails Example

Say we have a random binary variable \(x \in \{\text{Heads},\text{Tails}\}\), and \(\theta\) is the probability \(p(x = \text{Heads})\). The entropy \(H\) can be plotted as follows:

Entropy of Heads/Tails

You can see that entropy reduces as the probability approaches 0 and 1. That is to say, if we are more certain that a particular outcome will happen, entropy decreases.

Information Gain

When building a decision tree, we want the next decision to have a large impact on the entropy towards 0. The Information gain tells us how important a given attribute \(x_i\) is. It is used to decide the order in which to place an attribute.

Information Gain \(G(S,F)\) is defined as follows, where \(S\) is the set of elements and \(F\) is the set of features:

\[ G(S,F) = \text{Entropy}(S) - \sum_{f \in F} \frac{|S_f|}{|S|} \text{Entropy}(S_f) \]

Information Gain should be as high as possible over a decision tree.

Gini Impurity of a Set

Algorithms

ID3

Computes the information gain for every possible feature in the set \(F\) and chooses the one that produces the highest value, outputting a complete decision tree.

  1. Calculate the Information Gain of each feature.
  2. Considering that all rows don’t belong to the same class, split the dataset S into subsets using the feature for which the Information Gain is maximum.
  3. Make a decision tree node using the feature with the maximum Information gain.
  4. If all rows belong to the same class, make the current node as a leaf node with the class as its label.
  5. Repeat for the remaining features until we run out of all features, or the decision tree has all leaf nodes.

CART

Glossary