Contents
  1. Entropy
  2. Information Gain
  3. The Problem with Information Gain
  4. Gain Ratio
  5. Split Information (Intrinsic Value)
  6. Worked Tree Example
  7. Why Not Misclassification Error?
  8. Gini Index
  9. What You Can Do Now
← All posts

Entropy and Information Gain in Decision Trees

A decision tree splits data by asking questions. The right question is the one that reduces uncertainty the most. Entropy measures that uncertainty. Information gain measures how much a split reduces it.

These notes come from a university coursework assignment on decision tree learning.

Entropy

Entropy measures the impurity of a set of data instances, i.e. how mixed the class labels are:

H(S)=ipilog2piH(S) = -\sum_i p_i \log_2 p_i

For binary classification (positive/negative):

H(S)=p+log2p+plog2pH(S) = -p_+ \log_2 p_+ - p_- \log_2 p_-

Example from the assignment: a dataset that is 5/8 positive, 3/8 negative:

H(S)=58log25838log2380.954H(S) = -\frac{5}{8}\log_2\frac{5}{8} - \frac{3}{8}\log_2\frac{3}{8} \approx 0.954

This is close to 1, meaning highly impure. A pure set (all one class) has H=0H = 0. Maximum impurity (50/50 split) gives H=1H = 1.

Information Gain

Information gain measures how much a split on attribute AA reduces the entropy of SS:

IG(S,A)=H(S)vvalues(A)SvSH(Sv)\text{IG}(S, A) = H(S) - \sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} H(S_v)

It is the entropy before the split minus the weighted average entropy of the subsets after the split.

From the assignment:

  • For x1x_1: IG(S,x1)=0.01\text{IG}(S, x_1) = 0.01 (splitting on x1x_1 barely helps)
  • For x2x_2: IG(S,x2)=0.54\text{IG}(S, x_2) = 0.54 (splitting on x2x_2 reduces uncertainty significantly)

x2x_2 is the better split.

The Problem with Information Gain

Information gain favours attributes with many distinct values. An attribute that gives every example its own unique branch achieves the maximum possible IG but is completely useless. The tree memorises training data and generalises to nothing.

Gain Ratio

Gain ratio corrects for this by penalising attributes that produce many branches:

GR(S,A)=IG(S,A)IV(S,A)\text{GR}(S, A) = \frac{\text{IG}(S, A)}{\text{IV}(S, A)}

Split Information (Intrinsic Value)

IV measures how broadly the attribute splits the data:

IV(S,A)=vvalues(A)SvSlog2SvS\text{IV}(S, A) = -\sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}

From the assignment: IV(S,x1)=1.56\text{IV}(S, x_1) = 1.56, IV(S,x2)=1.5\text{IV}(S, x_2) = 1.5

Gain ratios:

  • Attribute x1x_1: GR(x1)=0.011.56=0.006\text{GR}(x_1) = \frac{0.01}{1.56} = 0.006
  • Attribute x2x_2: GR(x2)=0.541.5=0.86\text{GR}(x_2) = \frac{0.54}{1.5} = 0.86 ← best split
  • Attribute x3x_3: GR(x3)=0.15\text{GR}(x_3) = 0.15

x2x_2 wins decisively at 0.86.

Worked Tree Example

Root entropy: H(S)=0.954H(S) = 0.954

After splitting on x2x_2:

H({S1,V1})=23log22313log213=0.901H(\{S_1, V_1\}) = -\frac{2}{3}\log_2\frac{2}{3} - \frac{1}{3}\log_2\frac{1}{3} = 0.901
H({R,R2})=12log21212log212=1.0H(\{R, R_2\}) = -\frac{1}{2}\log_2\frac{1}{2} - \frac{1}{2}\log_2\frac{1}{2} = 1.0
Gain(x2)=0.95438(0.901)58(0.922)=0.949\text{Gain}(x_2) = 0.954 - \frac{3}{8}(0.901) - \frac{5}{8}(0.922) = 0.949

x2x_2 is confirmed as the best first split.

Why Not Misclassification Error?

Misclassification error: 1max(p+,p)1 - \max(p_+, p_-)

The problem: entropy and Gini index both decrease monotonically as a node gets purer through splits. Misclassification error does not. It plateaus in certain split configurations and fails to guide the tree toward better splits. This makes it a poor criterion for split selection, even though it is the ultimate metric we care about at test time.

Gini Index

An alternative to entropy, same purpose:

Gini(S)=1ipi2\text{Gini}(S) = 1 - \sum_i p_i^2

Both entropy and Gini decrease to zero for a pure node. In practice they produce similar trees; entropy is more commonly used in theoretical treatments.

What You Can Do Now

The code below computes entropy, information gain, and gain ratio from scratch, then identifies the best split attribute from a small dataset.

import numpy as np

def entropy(labels):
    labels = np.array(labels)
    n = len(labels)
    if n == 0: return 0.0
    _, counts = np.unique(labels, return_counts=True)
    probs = counts / n
    return -np.sum(probs * np.log2(probs + 1e-12))

def information_gain(parent_labels, subsets):
    """subsets: list of label arrays for each branch after the split."""
    n = len(parent_labels)
    weighted = sum((len(s) / n) * entropy(s) for s in subsets)
    return entropy(parent_labels) - weighted

def intrinsic_value(subsets, n_total):
    fracs = np.array([len(s) / n_total for s in subsets if len(s) > 0])
    return -np.sum(fracs * np.log2(fracs + 1e-12))

# Dataset: 8 examples, labels Y/N, two binary attributes x1 and x2
labels = ['Y','Y','Y','Y','Y','N','N','N']
x1     = [ 0,  0,  1,  1,  0,  1,  0,  1]   # binary attribute 1
x2     = [ 0,  1,  0,  1,  0,  1,  1,  0]   # binary attribute 2

print(f"Root entropy: {entropy(labels):.4f}")

for attr_name, attr in [('x1', x1), ('x2', x2)]:
    # Split into subsets by attribute value
    labels_arr = np.array(labels)
    attr_arr = np.array(attr)
    subsets = [labels_arr[attr_arr == v] for v in np.unique(attr_arr)]

    ig = information_gain(labels, subsets)
    iv = intrinsic_value(subsets, len(labels))
    gr = ig / iv if iv > 0 else 0.0
    print(f"{attr_name}: IG={ig:.4f}, IV={iv:.4f}, GainRatio={gr:.4f}")

Swap in your own labels and attribute arrays to evaluate splits on real data. Add a third attribute to see how gain ratio corrects for attributes with many distinct values that information gain alone would favour.

← All posts