Decision trees can operate with non-numerical data.
Building a decision tree amounts to deciding the order in which the decisions should be made.
Surprise
Surprise inversely related to probability:
Writing like this has a few useful benefits:
If , undefined, which is okay because it’s not useful to denote surprise for something that can never happen.
If , , showing that there is no surprise that will always occur.
Calculating / Estimating Surprise
The surprise of a sequence of events is the sum of the surprise for each individual event:
Entropy
Entropy Clearly Explained
The above video explains:
The entropy of an event is the surprise of that event divided by the number of times it was performed.
Because calculating the surprise of events multiplies by , and calculating the entropy of such surprise divides by , then entropy cancels out and becomes the following formula:
Since , this is more commonly written as:
Heads & Tails Example
Say we have a random binary variable , and is the probability . The entropy can be plotted as follows:
Entropy of Heads/Tails
You can see that entropy reduces as the probability approaches 0 and 1. That is to say, if we are more certain that a particular outcome will happen, entropy decreases.
Information Gain
When building a decision tree, we want the next decision to have a large impact on the entropy towards 0. The Information gain tells us how important a given attribute is. It is used to decide the order in which to place an attribute.
Information Gain is defined as follows, where is the set of elements and is the set of features:
Information Gain should be as high as possible over a decision tree.
Gini Impurity of a Set
Algorithms
ID3
Computes the information gain for every possible feature in the set and chooses the one that produces the highest value, outputting a complete decision tree.
Calculate the Information Gain of each feature.
Considering that all rows don’t belong to the same class, split the dataset S into subsets using the feature for which the Information Gain is maximum.
Make a decision tree node using the feature with the maximum Information gain.
If all rows belong to the same class, make the current node as a leaf node with the class as its label.
Repeat for the remaining features until we run out of all features, or the decision tree has all leaf nodes.
CART
Glossary
Information: The amount of surprise in an outcome with a given probability .
Entropy: The amount of expected surprise in an outcome with a given probability.