Classification using decision tree

Prologue: I have been working and practicing various skills and algorithms as a progress to show on my road-map to become as a matured data scientist. As a part of this expedition I have decided to document all those stuffs I am going through. So whatever you read under this column will be either be summary of my understanding or a post explaining the details of my experiment.

Objective of this writing is to show and explain my understating over Classification technique – The Decision tree.


Classification or categorization is problem in library science, information science and computer science Where the task is to assign a label/class/category to a particular document/record/data-set.

Solving classification problem will help to understand and process the data better for further business/research analysis.

Today there are many techniques and algorithms used to solve the classification problems and Decision tree is one among those algorithms that is simple and widely used approach among the classification practitioners.

To put simply Decision tree is an ordinary tree that has node, edged and leaf. Where each node correspond to a place where decisions are made. Each edged are the tree traversal path to reach another node or leaf. Each leaf correspond to the label or class or category of the classification problem itself. 


Above picture is a simplified representation of the decision tree for a classification problem to solve(classify) the given input as Even number E or Odd number O or Not a number N.

As you see, at first node a decision is made whether the given input is a number and based on the decision input is either assigned to a label N(Not a number) or moved to another node for further decision. At second leave at node two, again a decision is made whether the given input(a number) is divisible by 2. If yes input is assigned to label E (event number) or input is assignment to label O(odd number).

When solving real time problems decisions are made on the attribute or property of the document/recode/dataset. Here the attribute or property values can be either continuous or discrete value. But class/label/category values need to be discrete value only. Technically the class/label/category is again an attribute in the dataset which is called special attribute by convention. 

It is also important to mention that Decision trees have hight variance to its predictions, meaning it may get you a results that are slightly insignificant sometime. Due to this reason practitioners have developed a powerful technique called Random forest keeping decision tree has base to solve much more complex classification problems.

In terms of machine learning terminology the classification problems comes under the umbrella of supervised leaning chart. This is because the learning algorithms we will build is build upon the user defined training set.

Building Decision tree.

The prime step in this classification problem is building the decision tree. The decision tree is built in a top-down fashion, but the question is how do you choose which attribute or property to split at each node?

This is where entropy and Information gain comes into picture. And the answer is to find the attribute or property that has maximum information again. In other words find the attribute that has less impurity.

Going further, attribute that has highest information gain has to be considered as first decision making criteria. And then subsequence attributes that has hight Information gain has to be consider for building the rest of the decision tree.

Computing Entropy and Information again.

Information again is kind of measure of purity of information. It represents the expected amount of information that would be needed to specify whether a new instance should be classified as so and so are not.

Entropy is a measure of impurity (the opposite). It is defined as (for a binary class with values a/b).

Entropy = - p(a)*log(p(a)) – p(b)*log(p(b))
Thus information gain will be 

Information gain = Entropy before splitting – Entropy after splitting.

Formula representation for information gain.

= ∃(parent) – kj=1(N(Vj)/N)∃ (Vj)


is information gain.

is entropy.

N is total number of records at the parent node.

K is the number of attribute values.

N(Vj) is number of records associated with the child node Vj

At each node of the tree, this calculation is done for each attribute. And the attribute with the largest information gain is chosen for the split. This process continues iteratively until the end.

Post a Comment

Your email is never published nor shared. Required fields are marked *