Contents
  1. Three Classifiers Compared
  2. The Sigmoid
  3. Why Squared Loss Comes from Gaussian Noise
  4. Deriving Cross-Entropy
  5. Gradient Descent
  6. Newton’s Method for Faster Convergence
  7. Discriminative vs Generative
  8. What You Can Do Now
← All posts

Logistic Regression: Sigmoid, Cross-Entropy, and Why Squared Loss Fails

Logistic regression applies a sigmoid to a linear score to produce probabilities. The right loss function is cross-entropy, derived from maximum likelihood. Squared loss is the wrong tool for classification.

Three Classifiers Compared

All three start from the same linear score s=wTxs = w^T x, but treat it differently:

ModelOutput
Perceptronh(x)=sign(s)h(x) = \text{sign}(s)
Linear Regressionh(x)=sh(x) = s
Logistic Regressionh(x)=θ(s)h(x) = \theta(s)

The Sigmoid

θ(s)=es1+es\theta(s) = \frac{e^s}{1 + e^s}

The sigmoid squashes any real number to (0,1)(0, 1), producing a probability. When s=0s = 0, θ=0.5\theta = 0.5. As ss \to \infty, θ1\theta \to 1; as ss \to -\infty, θ0\theta \to 0.

Why Squared Loss Comes from Gaussian Noise

To understand why squared loss is wrong for classification, it helps to see where it comes from. Assuming the observations follow a Gaussian:

p(yixi)=12πe12(yiwTxi)2p(y_i | x_i) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}(y_i - w^T x_i)^2}

The log-likelihood over the full dataset:

lnL(wD)=12i(yiwTxi)2+const\ln L(w | D) = -\frac{1}{2}\sum_i (y_i - w^T x_i)^2 + \text{const}

Maximising log-likelihood → minimising i(yiwTxi)2\sum_i (y_i - w^T x_i)^2least squares. So squared loss is the right choice when noise is Gaussian, which is a reasonable assumption for regression. For binary classification, the labels are not Gaussian noise, and squared loss is the wrong model.

Deriving Cross-Entropy

For logistic regression with labels yn{+1,1}y_n \in \{+1, -1\}, the probability of the correct label is:

P(ynxn)=θ(ynwTxn)P(y_n | x_n) = \theta(y_n \cdot w^T x_n)

This works because θ(s)=1θ(s)\theta(-s) = 1 - \theta(s), so the formula handles both classes with one expression.

Log-likelihood:

lnL(wD)=nlnθ(ynwTxn)\ln L(w | D) = \sum_n \ln \theta(y_n w^T x_n)

Normalise and flip sign to get a loss to minimise:

Ein(w)=1Nnlnθ(ynwTxn)E_{\text{in}}(w) = -\frac{1}{N} \sum_n \ln \theta(y_n w^T x_n)

This is the cross-entropy loss, the correct loss for logistic regression.

Gradient Descent

Ein=1Nnynxn1+eynwTxn\nabla E_{\text{in}} = -\frac{1}{N} \sum_n \frac{y_n x_n}{1 + e^{y_n w^T x_n}}

Update:

w(t+1)=w(t)ηEinw^{(t+1)} = w^{(t)} - \eta \nabla E_{\text{in}}

Newton’s Method for Faster Convergence

Instead of a fixed step size, use the Hessian (second derivative) to take a better-sized step:

w(t+1)=w(t)Ein(w(t))2Ein(w(t))w^{(t+1)} = w^{(t)} - \frac{\nabla E_{\text{in}}(w^{(t)})}{\nabla^2 E_{\text{in}}(w^{(t)})}

The denominator is the Hessian matrix. This is Newton-Raphson applied to Ein=0\nabla E_{\text{in}} = 0. Convergence is quadratic, faster than gradient descent, but each step is more expensive.

Discriminative vs Generative

Discriminative models (logistic regression, SVM) directly estimate p(yx)p(y|x). They learn the boundary between classes without modelling how the data was generated.

Generative models (Naive Bayes, discriminant analysis) model the joint distribution p(x,y)p(x, y) and use Bayes’ rule:

p(yx)=p(xy)p(y)p(x)p(y|x) = \frac{p(x|y) \cdot p(y)}{p(x)}

Generative models can work with less data (the prior p(y)p(y) helps) but make stronger assumptions about the data distribution.

What You Can Do Now

The code below implements logistic regression from scratch (sigmoid, cross-entropy loss, and gradient descent) and trains it on a small binary classification problem.

import numpy as np

np.random.seed(1)

# Linearly separable binary data, labels in {+1, -1}
X_pos = np.random.randn(20, 2) + np.array([2, 2])
X_neg = np.random.randn(20, 2) + np.array([-2, -2])
X = np.vstack([X_pos, X_neg])
y = np.array([1]*20 + [-1]*20)

# Augment with bias
X_aug = np.hstack([X, np.ones((len(X), 1))])

def sigmoid(s):
    return 1 / (1 + np.exp(-s))

def cross_entropy_loss(w, X, y):
    # E_in = (1/N) * sum log(1 + exp(-y * w^T x))
    return np.mean(np.log(1 + np.exp(-y * (X @ w))))

def gradient(w, X, y):
    s = y * (X @ w)
    return -np.mean((y[:, None] * X) / (1 + np.exp(s))[:, None], axis=0)

w = np.zeros(X_aug.shape[1])
eta = 0.5

print(f"{'Step':>5}  {'Loss':>10}")
for step in range(200):
    w = w - eta * gradient(w, X_aug, y)
    if step % 40 == 0 or step == 199:
        print(f"{step+1:>5}  {cross_entropy_loss(w, X_aug, y):>10.6f}")

preds = np.sign(X_aug @ w)
print(f"\nTraining accuracy: {np.mean(preds == y):.0%}")

Increase the learning rate eta to see the loss oscillate, or reduce it for slower but stable convergence. Replace the gradient step with Newton’s method (divide by the Hessian) to observe quadratic convergence, which requires far fewer steps to the same accuracy.

← All posts