This website works better with desktop in both themes, for mobile devices please change to light theme.

Decision Tree Algorithm#

Decision trees can be applied to both regression and classification problems. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity.

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
[2]:
dummy_train_data = np.array([
    ['Green', 3, 'Apple'],
    ['Red', 3, 'Apple'],
    ['Red', 1, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3, 'Lemon'],
],dtype='O')
headers = ['color','diameter','fruit']

CART(Classification and Regression Trees)#

The representation for the CART model is a binary tree.

This is your binary tree from algorithms and data structures, nothing too fancy. Each root node represents a single input variable (x) and a split point on that variable (assuming the variable is numeric).

The leaf nodes of the tree contain an output variable (y) which is used to make a prediction.

Given a dataset with two inputs (x) of height in centimeters and weight in kilograms the output of sex as male or female, below is a crude example of a binary decision tree (completely fictitious for demonstration purposes only).

                         Is diameter >= 3 ?
                                  |
                     _____________|_____________
                True|                           |False
                    |                           |
           Is color == Yellow ?               Grape
                  |
     _____________|_____________
True|                           |False
    |                           |
   Lemon                      Apple

Question(Partition concept of the dataset)#

  • like a Threshold

  • Question is a concept for splitting the node.

  • It represents a True or False question.

  • match to see if it will go to true branch or false branch.

[3]:

class Question: def __init__(self, column_index, value, header): self.column_index = column_index self.value = value self.header = header def match(self, example): if isinstance(example, list): example = np.array(example, dtype="O") val = example[self.column_index] # adding numpy int and float data types as well if isinstance(val, (int, float, np.int64, np.float64)): # a condition for question to return True or False for numeric value return float(val) >= float(self.value) else: return str(val) == str(self.value) # categorical data comparison def __repr__(self): condition = "==" if isinstance(self.value, (int, float, np.int64, np.float64)): condition = ">=" return f"Is {self.header} {condition} {self.value} ?"
[4]:
q1 = Question(column_index=0,value='apple',header='fruit')
q1
[4]:
Is fruit == apple ?
[5]:
q1.match(['apple',10])
[5]:
True
[6]:
q2 = Question(column_index=1,value=30,header='weight')
q2
[6]:
Is weight >= 30 ?
[7]:
q2.match(['apple',10])
[7]:
False

Partition rows#

  • partition rows in true side and false side using the Question we have already defined.

  • here fruit is the target and questions are used to divide the values of target

  • first question here ‘is the color red’

              ['color','diameter','fruit']
    
               [['Green', 3, 'Apple'],
               ['Red', 3, 'Apple'],
               ['Red', 1, 'Grape'],
               ['Red', 1, 'Grape'],
               ['Yellow', 3, 'Lemon']]
    
                  Is color == Red ?
                           |
              _____________|_____________
         True|                           |False
             |                           |
    [['Red', 3, 'Apple'],          [['Green', 3, 'Apple'],
    ['Red', 1, 'Grape'],           ['Yellow', 3, 'Lemon']]
    ['Red', 1, 'Grape']]
    
[8]:
def partition(rows, question):
    true_idx, false_idx = [], []
    for idx, row in enumerate(rows):
        if question.match(row):
            true_idx.append(idx)
        else:
            false_idx.append(idx)
    return true_idx, false_idx
[9]:
dummy_train_data
[9]:
array([['Green', 3, 'Apple'],
       ['Red', 3, 'Apple'],
       ['Red', 1, 'Grape'],
       ['Red', 1, 'Grape'],
       ['Yellow', 3, 'Lemon']], dtype=object)
[10]:
Question(column_index=0,value='Red',header=headers[0])
[10]:
Is color == Red ?
[11]:
left, right = partition(dummy_train_data,Question(column_index=0,value='Red',header=headers[0]))
[12]:
dummy_train_data[left,:]
[12]:
array([['Red', 3, 'Apple'],
       ['Red', 1, 'Grape'],
       ['Red', 1, 'Grape']], dtype=object)
[13]:
dummy_train_data[right,:]
[13]:
array([['Green', 3, 'Apple'],
       ['Yellow', 3, 'Lemon']], dtype=object)

Tree Node - Leaf node, Decision Node#

                 Decision Node ->     Is diameter >= 3 ?

                                              |
                                 _____________|_____________
                            True|                           |False
                                |                           |

        Decision Node -> Is color == Yellow ?               Grape  <- Leaf Node

                              |
                 _____________|_____________
            True|                           |False
                |                           |

Leaf Node ->  Lemon                      Apple
  • Decision Node will have

    • a question

    • a true branch

    • a false branch

    • uncertainity value

  • Leaf Node will have

    • leaf value

[14]:
class Node:
    def __init__(self, question=None, true_branch=None, false_branch=None, uncertainty=None, *, leaf_value=None):
        self.question = question
        self.true_branch = true_branch
        self.false_branch = false_branch
        self.uncertainty = uncertainty
        self.leaf_value = leaf_value

    @property
    def _is_leaf_node(self):
        return self.leaf_value is not None

Build Tree#

  • start recursion

    • find best split for the data (actually the question)

    • if gain == 0 then it is already a leaf node

    • else get paritition based on the question’s answer true_rows, false_rows

    • build tree from true_rows and false_true and get Decision node with question, left and right branch

Classification Tree#

Gini Impurity#

Used by the CART (classification and regression tree) algorithm for classification trees, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The Gini impurity can be computed by summing the probability \(p_{i}\) of an item with label i being chosen times the probability

\(\sum_{k \neq i}{p_k} = 1 - {p_i}\)

of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category.

\begin{align*} {I_G(p)} =& {\sum_{i=1}^{J}{(p_i\sum_{k \neq i}p_k)}}\\ =& {\sum_{i=1}^{J}p_i(1 - p_i)}\\ =& {\sum_{i=1}^{J}(p_i - p_i^2)}\\ =& {\sum_{i=1}^{J}p_i - \sum_{i=1}^{J}p_i^2}\\ {I_G(p)} =& {1 - \sum_{i=1}^{J}p_i^2} \end{align*}

[20]:
def gini_impurity(arr):
    classes, counts = np.unique(arr, return_counts=True)
    gini_score = 1 - np.square(counts / arr.shape[0]).sum(axis=0)
    return gini_score
[21]:
print(gini_impurity(np.array([0,0,0,0,0,0,0,1,1,1,1,1,1])))
0.4970414201183432
[22]:
print(gini_impurity(np.array(['apple','apple','apple','apple'])))
0.0
[23]:
print(gini_impurity(np.array(['apple','apple','apple','orange'])))
0.375
[24]:
print(gini_impurity(np.array(['apple','apple','apple','orange','orange','orange'])))
0.5

Entropy#

  • information theory concept to calculate uncertainity in the data

\(H(T) = I_E(P) = - {\Sigma_{i=1}^{J}{p_i}log_2{p_i}}\)

[25]:
def entropy(arr):
    classes, counts = np.unique(arr, return_counts=True)
    p = counts / arr.shape[0]
    entropy_score = (-p * np.log2(p)).sum(axis=0)
    return entropy_score
[26]:
print(entropy(np.array([0,0,0,0,0,0,0,1,1,1,1,1,1])))
0.9957274520849255
[27]:
print(entropy(np.array(['apple','apple','apple','apple'])))
0.0
[28]:
print(entropy(np.array(['apple','apple','apple','orange'])))
0.8112781244591328
[29]:
print(entropy(np.array(['apple','apple','apple','orange','orange','orange'])))
1.0

Information Gain#

Information gain is the reduction in entropy or surprise by transforming a dataset and is often used in training decision trees. Information gain is calculated by comparing the entropy of the dataset before and after a transformation.

Inforgation Gain = H(Before splitting) - H(After splitting)

[30]:
def info_gain(left, right, parent_uncertainty):

    pr = left.shape[0] / (left.shape[0] + right.shape[0])

    child_uncertainty = pr * \
        gini_impurity(left) - (1 - pr) * gini_impurity(right)

    info_gain_value = parent_uncertainty - child_uncertainty
    return info_gain_value
[31]:
headers,dummy_train_data
[31]:
(['color', 'diameter', 'fruit'],
 array([['Green', 3, 'Apple'],
        ['Red', 3, 'Apple'],
        ['Red', 1, 'Grape'],
        ['Red', 1, 'Grape'],
        ['Yellow', 3, 'Lemon']], dtype=object))
[32]:
curr_uncert = gini_impurity(dummy_train_data[:,-1])
curr_uncert
[32]:
0.6399999999999999
[33]:
ti , fi = partition(dummy_train_data,Question(0,'Green',headers[0]))
[34]:
ts = dummy_train_data[ti,:]
ts
[34]:
array([['Green', 3, 'Apple']], dtype=object)
[35]:
fs = dummy_train_data[fi,:]
fs
[35]:
array([['Red', 3, 'Apple'],
       ['Red', 1, 'Grape'],
       ['Red', 1, 'Grape'],
       ['Yellow', 3, 'Lemon']], dtype=object)
[36]:
info_gain(ts[:,-1],fs[:,-1],curr_uncert)
[36]:
1.14

Find best split in all options#

  • based on the max information gain. find the best split

  • here based on the values of X, y is divided into true and false branch

  • before the split and after the split wherever we find maximum information gain we will take it as the best split

  • some artifacts like best question, best gain and parent_uncertainity are kept

[37]:
def find_best_split(X, y):
    max_gain = -1
    best_split_question = None

    parent_uncertainty = gini_impurity(y)

    m_samples, n_labels = X.shape

    for col_index in range(n_labels):  # iterate over feature columns
        # get unique values from the feature
        unique_values = np.unique(X[:, col_index])
        for val in unique_values:  # check for every value and find maximum info gain

            ques = Question(
                column_index=col_index,
                value=val,
                header=headers[col_index]
            )

            t_idx, f_idx = partition(X, ques)
            # if it does not split the data
            # skip it
            if len(t_idx) == 0 or len(f_idx) == 0:
                continue

            true_y = y[t_idx, :]
            false_y = y[f_idx, :]

            gain = info_gain(true_y, false_y, parent_uncertainty)
            if gain > max_gain:
                max_gain, best_split_question = gain, ques

    return max_gain, best_split_question, parent_uncertainty
[38]:
find_best_split(dummy_train_data[:,:-1],dummy_train_data[:,[-1]])
[38]:
(1.14, Is color == Green ?, 0.6399999999999999)

Classification/ Prediction#

  • traverse the tree and use the question to match and classify

Creating a Classification Model#

https://mightypy.readthedocs.io/en/main/api/mightypy.ml.html

[40]:
from mightypy.ml import DecisionTreeClassifier

Testing with Data#

[63]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

dataset = load_iris()

X = dataset.data
y = dataset.target
feature_name = dataset.feature_names
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42,test_size=0.05)
[64]:
dt = DecisionTreeClassifier(max_depth=10, min_samples_split=10)
dt.train(X=X_train,y=y_train,feature_name=feature_name)
dt.print_tree()
|-Is sepal length (cm) >= 7.9 ? | gini :0.6665344177742512
|---> True:
  |--  Predict : {2: 1}
|---> False:
  |--Is sepal width (cm) >= 4.4 ? | gini :0.6665660681052261
  |----> True:
    |---  Predict : {0: 1}
  |----> False:
    |---Is sepal width (cm) >= 4.2 ? | gini :0.6666326530612245
    |-----> True:
      |----  Predict : {0: 1}
    |-----> False:
      |----Is sepal width (cm) >= 4.1 ? | gini :0.6666321618963822
      |------> True:
        |-----  Predict : {0: 1}
      |------> False:
        |-----Is sepal width (cm) >= 4.0 ? | gini :0.6665616467128754
        |-------> True:
          |------  Predict : {0: 1}
        |-------> False:
          |------Is petal length (cm) >= 6.7 ? | gini :0.6664180297298737
          |--------> True:
            |-------  Predict : {2: 2}
          |--------> False:
            |-------Is sepal length (cm) >= 7.7 ? | gini :0.666556927297668
            |---------> True:
              |--------  Predict : {2: 1}
            |---------> False:
              |--------Is sepal length (cm) >= 7.6 ? | gini :0.6665181554912007
              |----------> True:
                |---------  Predict : {2: 1}
              |----------> False:
                |---------Is sepal length (cm) >= 7.4 ? | gini :0.666402849228334
                |-----------> True:
                  |----------  Predict : {2: 1}
                |-----------> False:
                  |----------Is sepal length (cm) >= 7.3 ? | gini :0.6662075298438934
                  |------------> True:
                    |-----------  Predict : {2: 1}
                  |------------> False:
                    |-----------Is petal length (cm) >= 6.1 ? | gini :0.6659285589417866
                    |-------------> True:
                      |------------  Predict : {2: 1}
                    |-------------> False:
                      |------------  Predict : {0: 44, 1: 46, 2: 40}
[66]:
dt = DecisionTreeClassifier(min_samples_split=10)
dt.train(X=X_train,y=y_train,feature_name=feature_name)

y_pred = dt.predict(X_test)

pd.DataFrame(np.c_[y_test.reshape(-1,1),y_pred], columns=['original','predicted'])
[66]:
original predicted
0 1 1
1 0 0
2 2 2
3 1 1
4 1 2
5 0 0
6 1 1
7 2 2
[67]:
y_pred = dt.predict_probability(X_test)
pd.DataFrame(np.c_[y_test.reshape(-1,1),y_pred], columns=['original','predicted'])
[67]:
original predicted
0 1 {1: 100.0}
1 0 {0: 100.0}
2 2 {2: 100.0}
3 1 {1: 100.0}
4 1 {2: 100.0}
5 0 {0: 100.0}
6 1 {1: 100.0}
7 2 {1: 50.0, 2: 50.0}

Regression Tree#

Variance Reduction#

  • similar to classification we need an information gain medium before and after splitting node.

  • here variance of parent is subtracted from variance of child components(pr(percentage weight) * var).

Find best split in all options#

  • based on the max information gain. find the best split

  • here based on the values of X, y is divided into true and false branch

  • before the split and after the split wherever we find maximum information gain we will take it as the best split

  • some artifacts like best question, best gain and parent_uncertainity are kept

Creating a Regression Model#

https://mightypy.readthedocs.io/en/main/api/mightypy.ml.html

[53]:
from mightypy.ml import DecisionTreeRegressor

Testing with Data#

[68]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
[93]:
X, y = make_regression(n_samples=500, n_features=4, random_state=2222)

X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=2222, test_size=0.10)

dt = DecisionTreeRegressor(max_depth=300, min_samples_split=3)
dt.train(X=X_train,y=y_train)

y_pred = dt.predict(X_test)
df = pd.DataFrame(np.c_[y_test.reshape(-1,1),y_pred],columns=['original','predicted'])
[94]:
print("Mean Squared Error :", mean_squared_error(df['original'],df['predicted']))
plt.plot(df['original'], '.-', label='original')
plt.plot(df['predicted'], '.-',label='predicted')
plt.legend()
plt.show()
Mean Squared Error : 8162.831645444925
../_images/TreeBasedModels_ExploreDecisionTree_58_1.png