Contents
  1. The Linear Threshold Unit
  2. The Algorithm
  3. The Dot Product and Projection
  4. Learning Rate
  5. Convergence Bound
  6. Generalisation
  7. What You Can Do Now
← All posts

The Perceptron: Linear Separability and Convergence

The perceptron is the simplest linear classifier. It learns a decision boundary by correcting mistakes. If the data is linearly separable, it is guaranteed to converge. The number of mistakes it makes is bounded.

The Linear Threshold Unit

The perceptron is built on the linear threshold unit:

h(x)=sign(wTx)h(x) = \text{sign}(w^T x)

The weight vector ww encodes the importance of each feature:

  • Large wi|w_i| → feature xix_i is important
  • Small wi|w_i| → feature xix_i is irrelevant
  • Positive weight wi>0w_i > 0 → feature xix_i pushes toward the positive class

The Algorithm

Assumption: data is linearly separable.

  1. Initialise w(0)=0w(0) = 0
  2. For each misclassified point (xi,yi)(x_i, y_i): update w(t+1)=w(t)+ηyixiw(t+1) = w(t) + \eta \, y_i x_i
  3. Repeat until all points are correctly classified

This is an online, mistake-driven algorithm. The weight vector only changes when the current hypothesis makes an error.

What the update does geometrically: the weight vector defines the normal to the decision boundary. Each update rotates the boundary toward the misclassified point, reducing the chance it will be misclassified on the next pass.

The Dot Product and Projection

The classifier scores a point xx as:

s=wTx=wxcosθs = w^T x = \|w\| \|x\| \cos\theta

The sign of ss determines which side of the boundary xx falls on. Geometrically, this is the projection of xx onto ww. Positive projection means one class, negative means the other.

Learning Rate

  • Too large → overshoots the correct boundary
  • Too small → slow convergence

The perceptron is guaranteed to converge regardless of η\eta as long as the data is linearly separable. In the convergence proof, η\eta cancels out. In practice η=1\eta = 1 is standard.

Convergence Bound

Let:

  • Radius R=maxixiR = \max_i \|x_i\|, the radius of the data
  • Margin γ>0\gamma > 0, the margin (there exists ww^* with w=1\|w^*\| = 1 such that yi(wTxi)γy_i(w^{*T} x_i) \geq \gamma for all ii)

Then the number of mistakes the perceptron makes before converging is bounded by:

mistakes(Rγ)2\text{mistakes} \leq \left(\frac{R}{\gamma}\right)^2

A larger margin γ\gamma → fewer mistakes before convergence. A larger data spread RR → more mistakes needed.

Generalisation

For the perceptron to generalise (perform well on unseen data), two conditions must hold:

  1. Ein(g)Eout(g)E_{\text{in}}(g) \approx E_{\text{out}}(g): in-sample error must be close to out-of-sample error
  2. EinE_{\text{in}} must be small: the hypothesis must fit the training data

The VC dimension of a linear classifier in dd dimensions is d+1d + 1. For NN sufficiently larger than d+1d + 1, EinEoutE_{\text{in}} \approx E_{\text{out}} holds with high probability.

Since the perceptron converges only when the data is linearly separable (Ein=0E_{\text{in}} = 0), and for large NN this implies Eout0E_{\text{out}} \approx 0, it generalises well when its assumptions hold.

What You Can Do Now

The code below trains a perceptron from scratch on a linearly separable dataset, printing the weight update at each mistake and confirming convergence.

import numpy as np

np.random.seed(0)

# Generate linearly separable data: two clusters
X_pos = np.random.randn(10, 2) + np.array([2, 2])
X_neg = np.random.randn(10, 2) + np.array([-2, -2])
X = np.vstack([X_pos, X_neg])
y = np.array([1]*10 + [-1]*10)

# Augment with bias term: x -> [x, 1]
X_aug = np.hstack([X, np.ones((len(X), 1))])

w = np.zeros(X_aug.shape[1])   # initialise weights to zero
eta = 1.0                        # learning rate (cancels in convergence proof)

mistakes, max_passes = 0, 100
for pass_num in range(max_passes):
    errors_this_pass = 0
    for i in range(len(X_aug)):
        if y[i] * (w @ X_aug[i]) <= 0:   # misclassified
            w = w + eta * y[i] * X_aug[i]
            mistakes += 1
            errors_this_pass += 1
    if errors_this_pass == 0:
        print(f"Converged after {pass_num+1} pass(es), {mistakes} total mistakes")
        break

# Verify: all points correctly classified
preds = np.sign(X_aug @ w)
print(f"Training accuracy: {np.mean(preds == y):.0%}")
print(f"Final weights: {w}")

Try replacing the cluster centres with overlapping ones (e.g. [1, 1] and [-1, -1]) to observe that the perceptron never converges on non-separable data. The bound (R/γ)2(R/\gamma)^2 can be computed from the data to predict the maximum number of mistakes before convergence.

← All posts