Decision Tree Implementations

Note_Udacity: Intro to ML with Tensor Flow – Scholarship Program Nanodegree Program

Calculating Information Gain of Different Split Criteria:

The data lists as below, consisting of twenty-four made-up insects measured on their length and color. Which of the following splitting criteria provides the most information gain for discriminating Mobugs from Lobugs?

Color = Brown
Color = Blue
Color = Green
Length < 17 mm
Length < 20 mm

Here is my solution:

The output looks like this:

Color = Brown : 0.062
Color = Blue : 0.001
Color = Green : 0.043
Length < 17 : 0.113
Length < 20 : 0.101

We gain more information from the splitting criteria- Length < 17.

import numpy as np
import pandas as pd

df = pd.read_csv('ml-bugs.csv')
df.head()

def count_item(df, title, item):
    number = df[df[title] == item].shape[0]
    return number

def get_species_entropy(df):
    number_lobug = count_item(df, 'Species', 'Lobug')
    number_mobug = count_item(df, 'Species', 'Mobug')
    total_bugs = number_lobug + number_mobug

    num_bugs_list = [number_lobug, number_mobug]
    props = np.array([i/ total_bugs for i in num_bugs_list])
    logs = np.log2(props)
    entropy = np.sum(-props * logs)
    return entropy

def get_info_gain(item):
    sub = df[item].copy()
    not_sub = df[~item].copy()
    number_sub = sub.shape[0]
    number_not_sub = not_sub.shape[0]
    total = number_sub + number_not_sub

    sub_weight = number_sub / total
    not_sub_weight = number_not_sub / total
    sub_entropy = get_species_entropy(sub)
    not_sub_entropy = get_species_entropy(not_sub)
    parent_entropy = get_species_entropy(df)

    return parent_entropy - (sub_weight * sub_entropy + \
                             not_sub_weight * not_sub_entropy)

for item, string in [(df['Color'] == 'Brown', 'Color = Brown'),
                     (df['Color'] == 'Blue', 'Color = Blue'),
                     (df['Color'] == 'Green', 'Color = Green'),
                     (df['Length (mm)'] < 17, 'Length < 17'),
                     (df['Length (mm)'] < 20, 'Length < 20')]:
    info_gain = get_info_gain(item)
    print(string,':','{:.3f}'.format(info_gain))

Reference:

https://ryanwingate.com/intro-to-machine-learning/supervised/decision-tree-implementations/
Information gain in decision trees: https://en.wikipedia.org/wiki/Information_gain_in_decision_trees

Gina Hsu

gina@ginahsu.com

Decision Tree Implementations

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply