HOME

Decision tree learning is one of the most commonly known machine learning algorithms out there. One of the advantages of decision trees are that there are quite staright forward, easily understandable by humans. Decision trees provide a way to approximate discrete valued functions and are robust to noisy data. Decision trees can be represented using the typical Tree Data Structure.

Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). It is one of the predictive modelling approaches used in statistics, data mining and machine learning.

In decision tree learning, a decision tree - now known by the umbrella term CART (Classification and Regression Tree) - can be used to visually and explicitly represent decisions and decision making. Though, it is common to use a tree-like model for decisions, learned trees can also be represented as sets of `if-else-then`

rules.

Though decision trees can be utilized for both classification and regression, it’s primarily used for classification.

Decision trees perform classification after sorting the instances in a top-down approach - from the root to the leaf. Each non-leaf node *splits* the set of instances based on a test of an attribute. Each branch emanting from a node corresponds to one of the possible values of the said attribute in the node. The leaves of the decision tree specifies the label or the class in which a given instance belongs to.

Here’s an example of a classification tree (Titanic Dataset):

Image courtesy: Wikipedia

The above model uses three attributes namely : *Gender, age and number of spouses/children*. As can be seen from the example, the internal nodes have an attribute test associated with them. This test splits the data set based on the value of the said attribute of the incoming instance. The branches correspond to the values of the attribute in question. At the end, the leaf node represent the class of the instance - in this case the fate of the titanic passengers.

Decision Trees represent a disjunction of conjunctions of constraints on attributes values of instances.

That is, Decision Trees represent a bunch of `AND`

‘statements’ chained by `OR`

statements. For example, let’s look at the titanic example above. The given tree can be represented by a disjunction of conjuctions as: ( female ) OR ( male AND less than 9.5 years of age AND more than 2.5 siblings)

- When it is imperative for the humans to understand and communicate the model.
- When you’d like to make minimalistic assumptions from the dataset.
- When you don’t want to normalize the data.
- When the dataset contains ample amount of noise (but not too much).
- Presence of Skewed variables in the dataset.
- When there are many missing attribute values in the dataset.
- When
*disjunctive*descriptions are required - When you need to build and test fast
- When the dataset is small in size

Before we start classifying, we first need to build the tree from the available dataset.

Most algorithms that have been developed for learning decision trees are variations of the core algorithm that employs a

top down,greedysearch through the possible space of decision trees.

In this article, I will be focussing on the Iterative Dichotomiser 3, commonly know as the ID3 algorithm. Variants/Extensions of the ID3 algorithm, such as C4.5, are very much in practical use today.

The ID3 algorithm builds the tree top-down, starting from the root by meticulously choosing which attribute that will be tested at each given node. Each attribute is evaluated through statistical means as to see **which attribute splits the dataset the best.** The best attribute is made the root, with it’s attribute values branching out. The process continues with the rest of the attributes. Once an attribute is selected, it is not possible to *backtrack*.

Entropy is a measure of

unpredictabilityof the state, or equivalently, of itsaverage information content.

Entropy is a statistical metric that measures the **impurity.** Given a collection of S, which contains two classes: *Positive* and *Negative*, of some arbitrary target concept, entropy with respect to this boolean classification is:

**Entropy(S) = E(S) = -p _{positive}log_{2} p _{positive} - p_{negative}log_{2} p _{negative}**

Where p_{positive} is the proportion (probability) of positive examples in S and p_{negative} is the proportion of negative examples in S. Entropy is 1 if the collection S contains equal number of examples from both classes, Entropy is 0 if all the examples in S contain the same example.

The entropy values vs the probabilities for a collection S follows a parabolic curve:

Image courtesy: MATLAB data science

One interpretation of entropy is that, entropy specifies the minimum number of bits required to **encode** the classification of any member of a collection S.

In general terms, when the classes of the target function may not always be boolean, entropy is defined as

Now that we know what entropy is, let’s look at an attribute that is more attached to the building of the decision tree - *Information Gain*. Information gain is a metric that measures the expected reduction in the impurity of the collection S, caused by splitting the data according to any given attribute.

The information gain IG(S,A) of an attribute A, from the collection S, can be defined as

Image courtesy: Abhyast

where `i`

spans through the entire set of all possible values for attribute A, and S _{i} is the portion of S for which attribute A has the value *i*. The first term is the entropy of the entire collection S. One way to think about IG is that, the value of IG is the number of bits saved when encoding a target value of an arbitrary member of the collection.

Whilst building the decision tree, the information gain metric is used by the ID3 algorithm to select the best attribute - the attribute the provides the “best split” - at each level.

Let’s take the example of a dataset. This dataset assesses the risk of tumour in a patient. We will be generating a decision tree using the ID3 algorithm.

HEADACHE | DIZZYNESS | BLOOD PRESSURE | RISK |

YES | NO | HIGH | YES |

YES | YES | HIGH | YES |

NO | NO | NORMAL | NO |

YES | YES | NORMAL | YES |

YES | NO | NORMAL | NO |

NO | YES | NORMAL | YES |

First, let’s find the entropy of the entire collection:

**E(S) = -p _{yes} log_{2} p_{yes} - p_{no} log_{2} p_{no}**

From the dataset: *p _{yes}* = 4/6 and

So, E(S) = - { (4/6) log_{2}(4/6) } - { (2/6) log_{2}(2/6) }

This gives us **E(S) = 0.9182**

Now, the information gain, let’s consider the attribute `HEADACHE`

. This attribute has two values `YES`

and `NO`

. Now, the proportion of `YES`

in the attribute: 4/6 and the proportion of `NO`

in the attribute: 2/6

Hence, the split:

S_{YES} - [3+, 1-] ( 3 positive and 1 negative classification when HEADACHE has the value YES)

S_{NO} - [0+, 2-] ( 2 negative classifications when HEADACHE has the value NO)

Therefore,

IG(S, HEADACHE) = E(S) - (4/6) * E(S_{YES}) - (2/6) * E(S_{NO}).

After Calculation: **IG(S, HEADACHE) = 0.37734**.

Similarly, **IG(S, DIZZYNESS) = 0.4590** and **IG(S, BP) = 0.5848**

After we have calculated the information for these attributes, we choose the attribute with the highest information gain as the splitting attribute for the node. This process goes on top-down until we are left we just leaves - the classification. From above, it is clear that the attribute `BLOOD PRESSURE`

will be our attribute of choice at the root node.

Splitting at the root node using `BLOOD PRESSURE`

gives us the following tree:

If the value of `BLOOD PRESSURE`

is `YES`

then we directly arrive at the class being `YES`

. If the value is no, we need to split the dataset again and the child node. As said, this is a top-down approach.

Now, let’s move to choosing the next splitting attribute. The process is essentially the same, except we now consider the rows where the value of the `BLOOD PRESSURE`

attribute is `NO`

.

*p _{yes}* = 2/4 and

So, E(S) = - { (2/4) log_{2}(2/4) } - { (2/4) log_{2}(2/4) }

This gives us **E(S) = 1.0**

After calculations similar as above, we find **IG(HEADACHE) = 0** and **IG(DIZZYNESS) = 1**. We now choose the `DIZZYNESS`

attribute as the splitting attribute. This gives us the final decision tree, using which we can predict the classification. Notice that we don’t have to use the `HEADACHE`

attribute since, from the given data, we can make predictions based upon the other two attributes.

The ID3 algorithm usually prefers shorter, wider trees over the taller ones. The trees which has a high information gain are placed closer to the root.

Still have questions? Find me on Codementor

As the data that is fed becomes larger, the decision tree tends to become longer. In such cases, noise and corrupt/incorrect data can have a detrimental impact on the decision tree. This results in the decision tree **overfitting** the dataset, that is decision tree performs satisfactory for the training data, but fails to produce an appropriate approximation of the target concept when it encounters actual data. Overfitting can also occur when insufficent data is provided to build the decision tree (like perhaps, our previous with only 6 rows.)

In order overcome the overfitting scenario, one of the following two things must be done. Either the decision tree should stop growing before it overfits the data or an overfitting tree should be pruned to reduce the error.

That’s it for now, if you have any comments, please leave then below.