Entropy and Information Gain in Decision Trees

These notes come from a university coursework assignment on decision tree learning.

Entropy

Entropy measures the impurity of a set of data instances, i.e. how mixed the class labels are:

H(S) = -\sum_i p_i \log_2 p_i

For binary classification (positive/negative):

H(S) = -p_+ \log_2 p_+ - p_- \log_2 p_-

Example from the assignment: a dataset that is 5/8 positive, 3/8 negative:

H(S) = -\frac{5}{8}\log_2\frac{5}{8} - \frac{3}{8}\log_2\frac{3}{8} \approx 0.954

This is close to 1, meaning highly impure. A pure set (all one class) has $H = 0$ . Maximum impurity (50/50 split) gives $H = 1$ .

Information Gain

Information gain measures how much a split on attribute $A$ reduces the entropy of $S$ :

\text{IG}(S, A) = H(S) - \sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} H(S_v)

It is the entropy before the split minus the weighted average entropy of the subsets after the split.

From the assignment:

For $x_1$ : $\text{IG}(S, x_1) = 0.01$ (splitting on $x_1$ barely helps)
For $x_2$ : $\text{IG}(S, x_2) = 0.54$ (splitting on $x_2$ reduces uncertainty significantly)

$x_2$ is the better split.

The Problem with Information Gain

Information gain favours attributes with many distinct values. An attribute that gives every example its own unique branch achieves the maximum possible IG but is completely useless. The tree memorises training data and generalises to nothing.

Gain Ratio

Gain ratio corrects for this by penalising attributes that produce many branches:

\text{GR}(S, A) = \frac{\text{IG}(S, A)}{\text{IV}(S, A)}

Split Information (Intrinsic Value)

IV measures how broadly the attribute splits the data:

\text{IV}(S, A) = -\sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}

From the assignment: $\text{IV}(S, x_1) = 1.56$ , $\text{IV}(S, x_2) = 1.5$

Gain ratios:

Attribute $x_1$ : $\text{GR}(x_1) = \frac{0.01}{1.56} = 0.006$
Attribute $x_2$ : $\text{GR}(x_2) = \frac{0.54}{1.5} = 0.86$ ← best split
Attribute $x_3$ : $\text{GR}(x_3) = 0.15$

$x_2$ wins decisively at 0.86.

Worked Tree Example

Root entropy: $H(S) = 0.954$

After splitting on $x_2$ :

H(\{S_1, V_1\}) = -\frac{2}{3}\log_2\frac{2}{3} - \frac{1}{3}\log_2\frac{1}{3} = 0.901

H(\{R, R_2\}) = -\frac{1}{2}\log_2\frac{1}{2} - \frac{1}{2}\log_2\frac{1}{2} = 1.0

\text{Gain}(x_2) = 0.954 - \frac{3}{8}(0.901) - \frac{5}{8}(0.922) = 0.949

$x_2$ is confirmed as the best first split.

Why Not Misclassification Error?

Misclassification error: $1 - \max(p_+, p_-)$

The problem: entropy and Gini index both decrease monotonically as a node gets purer through splits. Misclassification error does not. It plateaus in certain split configurations and fails to guide the tree toward better splits. This makes it a poor criterion for split selection, even though it is the ultimate metric we care about at test time.

Gini Index

An alternative to entropy, same purpose:

\text{Gini}(S) = 1 - \sum_i p_i^2

Both entropy and Gini decrease to zero for a pure node. In practice they produce similar trees; entropy is more commonly used in theoretical treatments.

What You Can Do Now

The code below computes entropy, information gain, and gain ratio from scratch, then identifies the best split attribute from a small dataset.

import numpy as np

def entropy(labels):
    labels = np.array(labels)
    n = len(labels)
    if n == 0: return 0.0
    _, counts = np.unique(labels, return_counts=True)
    probs = counts / n
    return -np.sum(probs * np.log2(probs + 1e-12))

def information_gain(parent_labels, subsets):
    """subsets: list of label arrays for each branch after the split."""
    n = len(parent_labels)
    weighted = sum((len(s) / n) * entropy(s) for s in subsets)
    return entropy(parent_labels) - weighted

def intrinsic_value(subsets, n_total):
    fracs = np.array([len(s) / n_total for s in subsets if len(s) > 0])
    return -np.sum(fracs * np.log2(fracs + 1e-12))

# Dataset: 8 examples, labels Y/N, two binary attributes x1 and x2
labels = ['Y','Y','Y','Y','Y','N','N','N']
x1     = [ 0,  0,  1,  1,  0,  1,  0,  1]   # binary attribute 1
x2     = [ 0,  1,  0,  1,  0,  1,  1,  0]   # binary attribute 2

print(f"Root entropy: {entropy(labels):.4f}")

for attr_name, attr in [('x1', x1), ('x2', x2)]:
    # Split into subsets by attribute value
    labels_arr = np.array(labels)
    attr_arr = np.array(attr)
    subsets = [labels_arr[attr_arr == v] for v in np.unique(attr_arr)]

    ig = information_gain(labels, subsets)
    iv = intrinsic_value(subsets, len(labels))
    gr = ig / iv if iv > 0 else 0.0
    print(f"{attr_name}: IG={ig:.4f}, IV={iv:.4f}, GainRatio={gr:.4f}")

Swap in your own labels and attribute arrays to evaluate splits on real data. Add a third attribute to see how gain ratio corrects for attributes with many distinct values that information gain alone would favour.