Entropy and Information Gain in Decision Trees
A decision tree splits data by asking questions. The right question is the one that reduces uncertainty the most. Entropy measures that uncertainty. Information gain measures how much a split reduces it.
These notes come from a university coursework assignment on decision tree learning.
Entropy
Entropy measures the impurity of a set of data instances, i.e. how mixed the class labels are:
For binary classification (positive/negative):
Example from the assignment: a dataset that is 5/8 positive, 3/8 negative:
This is close to 1, meaning highly impure. A pure set (all one class) has . Maximum impurity (50/50 split) gives .
Information Gain
Information gain measures how much a split on attribute reduces the entropy of :
It is the entropy before the split minus the weighted average entropy of the subsets after the split.
From the assignment:
- For : (splitting on barely helps)
- For : (splitting on reduces uncertainty significantly)
is the better split.
The Problem with Information Gain
Information gain favours attributes with many distinct values. An attribute that gives every example its own unique branch achieves the maximum possible IG but is completely useless. The tree memorises training data and generalises to nothing.
Gain Ratio
Gain ratio corrects for this by penalising attributes that produce many branches:
Split Information (Intrinsic Value)
IV measures how broadly the attribute splits the data:
From the assignment: ,
Gain ratios:
- Attribute :
- Attribute : ← best split
- Attribute :
wins decisively at 0.86.
Worked Tree Example
Root entropy:
After splitting on :
is confirmed as the best first split.
Why Not Misclassification Error?
Misclassification error:
The problem: entropy and Gini index both decrease monotonically as a node gets purer through splits. Misclassification error does not. It plateaus in certain split configurations and fails to guide the tree toward better splits. This makes it a poor criterion for split selection, even though it is the ultimate metric we care about at test time.
Gini Index
An alternative to entropy, same purpose:
Both entropy and Gini decrease to zero for a pure node. In practice they produce similar trees; entropy is more commonly used in theoretical treatments.
What You Can Do Now
The code below computes entropy, information gain, and gain ratio from scratch, then identifies the best split attribute from a small dataset.
import numpy as np
def entropy(labels):
labels = np.array(labels)
n = len(labels)
if n == 0: return 0.0
_, counts = np.unique(labels, return_counts=True)
probs = counts / n
return -np.sum(probs * np.log2(probs + 1e-12))
def information_gain(parent_labels, subsets):
"""subsets: list of label arrays for each branch after the split."""
n = len(parent_labels)
weighted = sum((len(s) / n) * entropy(s) for s in subsets)
return entropy(parent_labels) - weighted
def intrinsic_value(subsets, n_total):
fracs = np.array([len(s) / n_total for s in subsets if len(s) > 0])
return -np.sum(fracs * np.log2(fracs + 1e-12))
# Dataset: 8 examples, labels Y/N, two binary attributes x1 and x2
labels = ['Y','Y','Y','Y','Y','N','N','N']
x1 = [ 0, 0, 1, 1, 0, 1, 0, 1] # binary attribute 1
x2 = [ 0, 1, 0, 1, 0, 1, 1, 0] # binary attribute 2
print(f"Root entropy: {entropy(labels):.4f}")
for attr_name, attr in [('x1', x1), ('x2', x2)]:
# Split into subsets by attribute value
labels_arr = np.array(labels)
attr_arr = np.array(attr)
subsets = [labels_arr[attr_arr == v] for v in np.unique(attr_arr)]
ig = information_gain(labels, subsets)
iv = intrinsic_value(subsets, len(labels))
gr = ig / iv if iv > 0 else 0.0
print(f"{attr_name}: IG={ig:.4f}, IV={iv:.4f}, GainRatio={gr:.4f}")
Swap in your own labels and attribute arrays to evaluate splits on real data. Add a third attribute to see how gain ratio corrects for attributes with many distinct values that information gain alone would favour.