Linear Regression: Least Squares, Pseudoinverse, and Gradient Descent

Setup

Given $x \in \mathbb{R}^d$ , $w \in \mathbb{R}^d$ , $y \in \mathbb{R}$ . Goal: find $w$ such that $w^T x \approx y$ across all training examples.

In-Sample Error

E_{\text{in}}(w) = \frac{1}{m}\sum_i (w^T x_i - y_i)^2 = \frac{1}{m}\|Xw - y\|^2

Expanded matrix form:

E_{\text{in}}(w) = \frac{1}{m}(w^T X^T X w - 2w^T X^T y + y^T y)

Closed-Form Solution: The Pseudoinverse

Set $\nabla E_{\text{in}}(w) = 0$ :

X^T X w = X^T y

w = (X^T X)^{-1} X^T y

$(X^T X)^{-1} X^T$ is the pseudoinverse of $X$ . This gives a unique solution when $X^T X$ is invertible, most likely when $N > d + 1$ . If $X^T X$ is not invertible, multiple solutions exist.

Why Use Gradient Descent Instead?

Inverting $X^T X$ is an $O(d^3)$ operation. For large feature spaces (large $d$ ), this is prohibitively slow. Gradient descent avoids the inversion entirely.

Gradient Descent

E(w) = \frac{1}{2}\|Xw - y\|^2

\nabla E(w) = X^T(Xw - y)

Update rule:

w \leftarrow w - \eta \, X^T(Xw - y)

Convergence condition: for small $\eta$ ,

\Delta E_{\text{in}} \approx -c\|\nabla E_{\text{in}}\|^2 < 0

The error decreases at each step. If $\eta$ is too large, the $O(\eta^2)$ term dominates and $E_{\text{in}}$ can increase. Adaptive heuristic: $\eta_t = \eta \|\nabla E_{\text{in}}\|^{-1}$ normalises the step by the gradient magnitude.

Stochastic Gradient Descent

Instead of using all $N$ points (batch), use one randomly sampled point $(x_n, y_n)$ :

w \leftarrow w - \eta (w^T x_n - y_n) x_n

Advantages:

Fast for streaming data. Each update is $O(d)$ not $O(Nd)$
The noise in the gradient can help escape local minima in non-convex problems

Why Squared Loss? A Probabilistic Justification

Assume each observation has Gaussian noise:

y_i = w^T x_i + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2)

The log-likelihood:

\ln L(w | D) = -\frac{1}{2\sigma^2}\sum_i (y_i - w^T x_i)^2 + \text{const}

Maximising log-likelihood = minimising $\sum_i (y_i - w^T x_i)^2$ . Squared loss is the maximum likelihood estimator under Gaussian noise. It is the right loss when that assumption holds.

What You Can Do Now

The code below fits a linear model using both the closed-form pseudoinverse and gradient descent, then compares the two solutions to confirm they converge to the same weights.

import numpy as np

np.random.seed(4)
N, d = 50, 2

# Generate data: y = 3*x1 - 2*x2 + 1 + noise
X = np.random.randn(N, d)
w_true = np.array([3.0, -2.0, 1.0])    # includes bias
X_aug = np.hstack([X, np.ones((N, 1))])  # augment with bias column
y = X_aug @ w_true + np.random.randn(N) * 0.5

# --- Closed-form solution (pseudoinverse) ---
w_closed = np.linalg.lstsq(X_aug, y, rcond=None)[0]
# Equivalent: w_closed = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y

# --- Gradient descent ---
def loss(w):      return np.mean((X_aug @ w - y)**2)
def grad(w):      return 2 * X_aug.T @ (X_aug @ w - y) / N

w_gd = np.zeros(d + 1)
eta = 0.1
for step in range(500):
    w_gd = w_gd - eta * grad(w_gd)

print(f"True weights:         {w_true}")
print(f"Closed-form solution: {w_closed}")
print(f"Gradient descent:     {w_gd}")
print(f"\nLoss (closed-form):   {loss(w_closed):.6f}")
print(f"Loss (grad descent):  {loss(w_gd):.6f}")
print(f"Max weight diff:      {np.max(np.abs(w_closed - w_gd)):.6f}")

Increase d to a large number (e.g. 500) and time both methods. The closed-form solution will slow dramatically as it inverts a $d \times d$ matrix, while gradient descent scales linearly with d. Try adding stochastic gradient descent (one random sample per step) to see noisier but faster convergence on large datasets.