The Perceptron: Linear Separability and Convergence

The Linear Threshold Unit

The perceptron is built on the linear threshold unit:

h(x) = \text{sign}(w^T x)

The weight vector $w$ encodes the importance of each feature:

Large $|w_i|$ → feature $x_i$ is important
Small $|w_i|$ → feature $x_i$ is irrelevant
Positive weight $w_i > 0$ → feature $x_i$ pushes toward the positive class

The Algorithm

Assumption: data is linearly separable.

Initialise $w(0) = 0$
For each misclassified point $(x_i, y_i)$ : update $w(t+1) = w(t) + \eta \, y_i x_i$
Repeat until all points are correctly classified

This is an online, mistake-driven algorithm. The weight vector only changes when the current hypothesis makes an error.

What the update does geometrically: the weight vector defines the normal to the decision boundary. Each update rotates the boundary toward the misclassified point, reducing the chance it will be misclassified on the next pass.

The Dot Product and Projection

The classifier scores a point $x$ as:

s = w^T x = \|w\| \|x\| \cos\theta

The sign of $s$ determines which side of the boundary $x$ falls on. Geometrically, this is the projection of $x$ onto $w$ . Positive projection means one class, negative means the other.

Learning Rate

Too large → overshoots the correct boundary
Too small → slow convergence

The perceptron is guaranteed to converge regardless of $\eta$ as long as the data is linearly separable. In the convergence proof, $\eta$ cancels out. In practice $\eta = 1$ is standard.

Convergence Bound

Let:

Radius $R = \max_i \|x_i\|$ , the radius of the data
Margin $\gamma > 0$ , the margin (there exists $w^*$ with $\|w^*\| = 1$ such that $y_i(w^{*T} x_i) \geq \gamma$ for all $i$ )

Then the number of mistakes the perceptron makes before converging is bounded by:

\text{mistakes} \leq \left(\frac{R}{\gamma}\right)^2

A larger margin $\gamma$ → fewer mistakes before convergence. A larger data spread $R$ → more mistakes needed.

Generalisation

For the perceptron to generalise (perform well on unseen data), two conditions must hold:

$E_{\text{in}}(g) \approx E_{\text{out}}(g)$ : in-sample error must be close to out-of-sample error
$E_{\text{in}}$ must be small: the hypothesis must fit the training data

The VC dimension of a linear classifier in $d$ dimensions is $d + 1$ . For $N$ sufficiently larger than $d + 1$ , $E_{\text{in}} \approx E_{\text{out}}$ holds with high probability.

Since the perceptron converges only when the data is linearly separable ( $E_{\text{in}} = 0$ ), and for large $N$ this implies $E_{\text{out}} \approx 0$ , it generalises well when its assumptions hold.

What You Can Do Now

The code below trains a perceptron from scratch on a linearly separable dataset, printing the weight update at each mistake and confirming convergence.

import numpy as np

np.random.seed(0)

# Generate linearly separable data: two clusters
X_pos = np.random.randn(10, 2) + np.array([2, 2])
X_neg = np.random.randn(10, 2) + np.array([-2, -2])
X = np.vstack([X_pos, X_neg])
y = np.array([1]*10 + [-1]*10)

# Augment with bias term: x -> [x, 1]
X_aug = np.hstack([X, np.ones((len(X), 1))])

w = np.zeros(X_aug.shape[1])   # initialise weights to zero
eta = 1.0                        # learning rate (cancels in convergence proof)

mistakes, max_passes = 0, 100
for pass_num in range(max_passes):
    errors_this_pass = 0
    for i in range(len(X_aug)):
        if y[i] * (w @ X_aug[i]) <= 0:   # misclassified
            w = w + eta * y[i] * X_aug[i]
            mistakes += 1
            errors_this_pass += 1
    if errors_this_pass == 0:
        print(f"Converged after {pass_num+1} pass(es), {mistakes} total mistakes")
        break

# Verify: all points correctly classified
preds = np.sign(X_aug @ w)
print(f"Training accuracy: {np.mean(preds == y):.0%}")
print(f"Final weights: {w}")

Try replacing the cluster centres with overlapping ones (e.g. [1, 1] and [-1, -1]) to observe that the perceptron never converges on non-separable data. The bound $(R/\gamma)^2$ can be computed from the data to predict the maximum number of mistakes before convergence.