Linear Regression: Least Squares, Pseudoinverse, and Gradient Descent
Linear regression minimises squared error. There is a closed-form solution using the pseudoinverse. Gradient descent solves the same problem iteratively and scales better to large feature sets.
Setup
Given , , . Goal: find such that across all training examples.
In-Sample Error
Expanded matrix form:
Closed-Form Solution: The Pseudoinverse
Set :
is the pseudoinverse of . This gives a unique solution when is invertible, most likely when . If is not invertible, multiple solutions exist.
Why Use Gradient Descent Instead?
Inverting is an operation. For large feature spaces (large ), this is prohibitively slow. Gradient descent avoids the inversion entirely.
Gradient Descent
Update rule:
Convergence condition: for small ,
The error decreases at each step. If is too large, the term dominates and can increase. Adaptive heuristic: normalises the step by the gradient magnitude.
Stochastic Gradient Descent
Instead of using all points (batch), use one randomly sampled point :
Advantages:
- Fast for streaming data. Each update is not
- The noise in the gradient can help escape local minima in non-convex problems
Why Squared Loss? A Probabilistic Justification
Assume each observation has Gaussian noise:
The log-likelihood:
Maximising log-likelihood = minimising . Squared loss is the maximum likelihood estimator under Gaussian noise. It is the right loss when that assumption holds.
What You Can Do Now
The code below fits a linear model using both the closed-form pseudoinverse and gradient descent, then compares the two solutions to confirm they converge to the same weights.
import numpy as np
np.random.seed(4)
N, d = 50, 2
# Generate data: y = 3*x1 - 2*x2 + 1 + noise
X = np.random.randn(N, d)
w_true = np.array([3.0, -2.0, 1.0]) # includes bias
X_aug = np.hstack([X, np.ones((N, 1))]) # augment with bias column
y = X_aug @ w_true + np.random.randn(N) * 0.5
# --- Closed-form solution (pseudoinverse) ---
w_closed = np.linalg.lstsq(X_aug, y, rcond=None)[0]
# Equivalent: w_closed = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y
# --- Gradient descent ---
def loss(w): return np.mean((X_aug @ w - y)**2)
def grad(w): return 2 * X_aug.T @ (X_aug @ w - y) / N
w_gd = np.zeros(d + 1)
eta = 0.1
for step in range(500):
w_gd = w_gd - eta * grad(w_gd)
print(f"True weights: {w_true}")
print(f"Closed-form solution: {w_closed}")
print(f"Gradient descent: {w_gd}")
print(f"\nLoss (closed-form): {loss(w_closed):.6f}")
print(f"Loss (grad descent): {loss(w_gd):.6f}")
print(f"Max weight diff: {np.max(np.abs(w_closed - w_gd)):.6f}")
Increase d to a large number (e.g. 500) and time both methods. The closed-form solution will slow dramatically as it inverts a matrix, while gradient descent scales linearly with d. Try adding stochastic gradient descent (one random sample per step) to see noisier but faster convergence on large datasets.