Complete Overview of Decision Tree

Published in

Analytics Vidhya

4 min readMay 12, 2020

Decision tree Algorithm is a type of supervised learning for classification problems. Classification Is the process of dividing data sets into different categories or groups by adding label(Ex-spam/no spam)

Decision Tree

Decision Tree is the graphical representation of all the possible solutions to a decision based on certain conditions. It works for both categorical and continuous dependent variables. In this algorithm, we split the Total sample data into two or more homogeneous sets.

Ex-

Important terminology

Decision Node — which is also known as root node. Which is the starting point of algorithm with entire data set
Leaf Node- (End of the algorithm, from this node cannot be separated)
Splitting- Based on the condition
Branch/ Sub-tree — after splitting the tree or data
Pruning- Removing unwanted branches(Which reduces the complexity)

CART Algorithm

All goes well till now, but you may doubt that How one should we pick , How to choose best attribute, Where to start??

This algorithms helps you to get into it. Dont worry if get stuck at any point. Just go till the end :)

CART- Classification and regression tree algorithm

How to decide where to split??

For this we need to know

Gini index- Measure of impurities(degree of randomness) used in building decision tree
Information gain- Decrease in entropy after a data set is split on basis o an attribute.Construction of decision tree is all about finding attributes that return the highest information gain
Reduction in Variance- Which is an algorithm used for continuous target variables(regression problems).The split with lower variance is selected as the criteria to split the population(sample)
Chi-square — Which is an Algorithm to find out statistical significance between the differences between Sub-nodes and parent nodes

Entropy-

This is a metric which measures the impurity (degree of randomness) . This is the first step in the Algorithm

Entropy(s)= -P(Yes)log(yes)-P(No)log(No)

Information Gain

It measures the reduction in entropy Entropy and decides which attribute should be selected as the decision Node.

Formula for Information Gain

I(S)= Entropy(S) — [(Weighted average)*(Entropy(each feature))]

Apply this formula for all the attributes and select the attribute with maximum gain.

Ex-

In the image above, you can see that population is classified into four different groups based on multiple attributes to identify ‘if they will play or not’. To split the population into different heterogeneous groups, it uses various techniques like Gini, Information Gain, Chi-square, entropy

Some useful features and advantages of CART

CART is non parametric and therefore does not rely on data belonging to a particular type of distribution.
CART is not significantly impacted by outliers in the input variables.
You can relax stopping rules to “overgrow” decision trees and then prune back the tree to the optimal size. This approach minimizes the probability that important structure in the data set will be overlooked by stopping too soon.
CART incorporates both testing with a test data set and cross-validation to assess the goodness of fit more accurately.
CART can use the same variables more than once in different parts of the tree. This capability can uncover complex interdependencies between sets of variables.
CART can be used in conjunction with other prediction methods to select the input set of variables.

Pros of Decision Trees

Easy to understand
Easy to generate rules
There are almost null hyper- parameters to be tuned
Complex Decision Tree models can be significantly simplified by its visualizations

Cons of Decision Trees

Might suffer from over-fitting
Does not easily work with non- numerical data
When there are many class labels, calculations can be complex

Some Key suggestions-

We can use any classification algorithms in place of Decision Tree algorithms depends on data set. You can apply different algorithms and select which ever gives more accuracy
If the relationship between Dependent and independent variables is well approximated by linear model then Linear model is over perform tree based model
If there is high non linearity and complex relation between dependent and independent variable then Tree model out performs classification regression model

This skicit algorithm cheat sheet may helps while implementation in python