The Perceptron: Linear Separability and Convergence
The perceptron is the simplest linear classifier. It learns a decision boundary by correcting mistakes. If the data is linearly separable, it is guaranteed to converge. The number of mistakes it makes is bounded.
The Linear Threshold Unit
The perceptron is built on the linear threshold unit:
The weight vector encodes the importance of each feature:
- Large → feature is important
- Small → feature is irrelevant
- Positive weight → feature pushes toward the positive class
The Algorithm
Assumption: data is linearly separable.
- Initialise
- For each misclassified point : update
- Repeat until all points are correctly classified
This is an online, mistake-driven algorithm. The weight vector only changes when the current hypothesis makes an error.
What the update does geometrically: the weight vector defines the normal to the decision boundary. Each update rotates the boundary toward the misclassified point, reducing the chance it will be misclassified on the next pass.
The Dot Product and Projection
The classifier scores a point as:
The sign of determines which side of the boundary falls on. Geometrically, this is the projection of onto . Positive projection means one class, negative means the other.
Learning Rate
- Too large → overshoots the correct boundary
- Too small → slow convergence
The perceptron is guaranteed to converge regardless of as long as the data is linearly separable. In the convergence proof, cancels out. In practice is standard.
Convergence Bound
Let:
- Radius , the radius of the data
- Margin , the margin (there exists with such that for all )
Then the number of mistakes the perceptron makes before converging is bounded by:
A larger margin → fewer mistakes before convergence. A larger data spread → more mistakes needed.
Generalisation
For the perceptron to generalise (perform well on unseen data), two conditions must hold:
- : in-sample error must be close to out-of-sample error
- must be small: the hypothesis must fit the training data
The VC dimension of a linear classifier in dimensions is . For sufficiently larger than , holds with high probability.
Since the perceptron converges only when the data is linearly separable (), and for large this implies , it generalises well when its assumptions hold.
What You Can Do Now
The code below trains a perceptron from scratch on a linearly separable dataset, printing the weight update at each mistake and confirming convergence.
import numpy as np
np.random.seed(0)
# Generate linearly separable data: two clusters
X_pos = np.random.randn(10, 2) + np.array([2, 2])
X_neg = np.random.randn(10, 2) + np.array([-2, -2])
X = np.vstack([X_pos, X_neg])
y = np.array([1]*10 + [-1]*10)
# Augment with bias term: x -> [x, 1]
X_aug = np.hstack([X, np.ones((len(X), 1))])
w = np.zeros(X_aug.shape[1]) # initialise weights to zero
eta = 1.0 # learning rate (cancels in convergence proof)
mistakes, max_passes = 0, 100
for pass_num in range(max_passes):
errors_this_pass = 0
for i in range(len(X_aug)):
if y[i] * (w @ X_aug[i]) <= 0: # misclassified
w = w + eta * y[i] * X_aug[i]
mistakes += 1
errors_this_pass += 1
if errors_this_pass == 0:
print(f"Converged after {pass_num+1} pass(es), {mistakes} total mistakes")
break
# Verify: all points correctly classified
preds = np.sign(X_aug @ w)
print(f"Training accuracy: {np.mean(preds == y):.0%}")
print(f"Final weights: {w}")
Try replacing the cluster centres with overlapping ones (e.g. [1, 1] and [-1, -1]) to observe that the perceptron never converges on non-separable data. The bound can be computed from the data to predict the maximum number of mistakes before convergence.