Contents
  1. Setup
  2. In-Sample Error
  3. Closed-Form Solution: The Pseudoinverse
  4. Why Use Gradient Descent Instead?
  5. Gradient Descent
  6. Stochastic Gradient Descent
  7. Why Squared Loss? A Probabilistic Justification
  8. What You Can Do Now
← All posts

Linear Regression: Least Squares, Pseudoinverse, and Gradient Descent

Linear regression minimises squared error. There is a closed-form solution using the pseudoinverse. Gradient descent solves the same problem iteratively and scales better to large feature sets.

Setup

Given xRdx \in \mathbb{R}^d, wRdw \in \mathbb{R}^d, yRy \in \mathbb{R}. Goal: find ww such that wTxyw^T x \approx y across all training examples.

In-Sample Error

Ein(w)=1mi(wTxiyi)2=1mXwy2E_{\text{in}}(w) = \frac{1}{m}\sum_i (w^T x_i - y_i)^2 = \frac{1}{m}\|Xw - y\|^2

Expanded matrix form:

Ein(w)=1m(wTXTXw2wTXTy+yTy)E_{\text{in}}(w) = \frac{1}{m}(w^T X^T X w - 2w^T X^T y + y^T y)

Closed-Form Solution: The Pseudoinverse

Set Ein(w)=0\nabla E_{\text{in}}(w) = 0:

XTXw=XTyX^T X w = X^T y
w=(XTX)1XTyw = (X^T X)^{-1} X^T y

(XTX)1XT(X^T X)^{-1} X^T is the pseudoinverse of XX. This gives a unique solution when XTXX^T X is invertible, most likely when N>d+1N > d + 1. If XTXX^T X is not invertible, multiple solutions exist.

Why Use Gradient Descent Instead?

Inverting XTXX^T X is an O(d3)O(d^3) operation. For large feature spaces (large dd), this is prohibitively slow. Gradient descent avoids the inversion entirely.

Gradient Descent

E(w)=12Xwy2E(w) = \frac{1}{2}\|Xw - y\|^2
E(w)=XT(Xwy)\nabla E(w) = X^T(Xw - y)

Update rule:

wwηXT(Xwy)w \leftarrow w - \eta \, X^T(Xw - y)

Convergence condition: for small η\eta,

ΔEincEin2<0\Delta E_{\text{in}} \approx -c\|\nabla E_{\text{in}}\|^2 < 0

The error decreases at each step. If η\eta is too large, the O(η2)O(\eta^2) term dominates and EinE_{\text{in}} can increase. Adaptive heuristic: ηt=ηEin1\eta_t = \eta \|\nabla E_{\text{in}}\|^{-1} normalises the step by the gradient magnitude.

Stochastic Gradient Descent

Instead of using all NN points (batch), use one randomly sampled point (xn,yn)(x_n, y_n):

wwη(wTxnyn)xnw \leftarrow w - \eta (w^T x_n - y_n) x_n

Advantages:

  1. Fast for streaming data. Each update is O(d)O(d) not O(Nd)O(Nd)
  2. The noise in the gradient can help escape local minima in non-convex problems

Why Squared Loss? A Probabilistic Justification

Assume each observation has Gaussian noise:

yi=wTxi+εi,εiN(0,σ2)y_i = w^T x_i + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2)

The log-likelihood:

lnL(wD)=12σ2i(yiwTxi)2+const\ln L(w | D) = -\frac{1}{2\sigma^2}\sum_i (y_i - w^T x_i)^2 + \text{const}

Maximising log-likelihood = minimising i(yiwTxi)2\sum_i (y_i - w^T x_i)^2. Squared loss is the maximum likelihood estimator under Gaussian noise. It is the right loss when that assumption holds.

What You Can Do Now

The code below fits a linear model using both the closed-form pseudoinverse and gradient descent, then compares the two solutions to confirm they converge to the same weights.

import numpy as np

np.random.seed(4)
N, d = 50, 2

# Generate data: y = 3*x1 - 2*x2 + 1 + noise
X = np.random.randn(N, d)
w_true = np.array([3.0, -2.0, 1.0])    # includes bias
X_aug = np.hstack([X, np.ones((N, 1))])  # augment with bias column
y = X_aug @ w_true + np.random.randn(N) * 0.5

# --- Closed-form solution (pseudoinverse) ---
w_closed = np.linalg.lstsq(X_aug, y, rcond=None)[0]
# Equivalent: w_closed = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y

# --- Gradient descent ---
def loss(w):      return np.mean((X_aug @ w - y)**2)
def grad(w):      return 2 * X_aug.T @ (X_aug @ w - y) / N

w_gd = np.zeros(d + 1)
eta = 0.1
for step in range(500):
    w_gd = w_gd - eta * grad(w_gd)

print(f"True weights:         {w_true}")
print(f"Closed-form solution: {w_closed}")
print(f"Gradient descent:     {w_gd}")
print(f"\nLoss (closed-form):   {loss(w_closed):.6f}")
print(f"Loss (grad descent):  {loss(w_gd):.6f}")
print(f"Max weight diff:      {np.max(np.abs(w_closed - w_gd)):.6f}")

Increase d to a large number (e.g. 500) and time both methods. The closed-form solution will slow dramatically as it inverts a d×dd \times d matrix, while gradient descent scales linearly with d. Try adding stochastic gradient descent (one random sample per step) to see noisier but faster convergence on large datasets.

← All posts