Logistic Regression: Sigmoid, Cross-Entropy, and Why Squared Loss Fails

Three Classifiers Compared

All three start from the same linear score $s = w^T x$ , but treat it differently:

Model	Output
Perceptron	$h(x) = \text{sign}(s)$
Linear Regression	$h(x) = s$
Logistic Regression	$h(x) = \theta(s)$

The Sigmoid

\theta(s) = \frac{e^s}{1 + e^s}

The sigmoid squashes any real number to $(0, 1)$ , producing a probability. When $s = 0$ , $\theta = 0.5$ . As $s \to \infty$ , $\theta \to 1$ ; as $s \to -\infty$ , $\theta \to 0$ .

Why Squared Loss Comes from Gaussian Noise

To understand why squared loss is wrong for classification, it helps to see where it comes from. Assuming the observations follow a Gaussian:

p(y_i | x_i) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}(y_i - w^T x_i)^2}

The log-likelihood over the full dataset:

\ln L(w | D) = -\frac{1}{2}\sum_i (y_i - w^T x_i)^2 + \text{const}

Maximising log-likelihood → minimising $\sum_i (y_i - w^T x_i)^2$ → least squares. So squared loss is the right choice when noise is Gaussian, which is a reasonable assumption for regression. For binary classification, the labels are not Gaussian noise, and squared loss is the wrong model.

Deriving Cross-Entropy

For logistic regression with labels $y_n \in \{+1, -1\}$ , the probability of the correct label is:

P(y_n | x_n) = \theta(y_n \cdot w^T x_n)

This works because $\theta(-s) = 1 - \theta(s)$ , so the formula handles both classes with one expression.

Log-likelihood:

\ln L(w | D) = \sum_n \ln \theta(y_n w^T x_n)

Normalise and flip sign to get a loss to minimise:

E_{\text{in}}(w) = -\frac{1}{N} \sum_n \ln \theta(y_n w^T x_n)

This is the cross-entropy loss, the correct loss for logistic regression.

Gradient Descent

\nabla E_{\text{in}} = -\frac{1}{N} \sum_n \frac{y_n x_n}{1 + e^{y_n w^T x_n}}

Update:

w^{(t+1)} = w^{(t)} - \eta \nabla E_{\text{in}}

Newton’s Method for Faster Convergence

Instead of a fixed step size, use the Hessian (second derivative) to take a better-sized step:

w^{(t+1)} = w^{(t)} - \frac{\nabla E_{\text{in}}(w^{(t)})}{\nabla^2 E_{\text{in}}(w^{(t)})}

The denominator is the Hessian matrix. This is Newton-Raphson applied to $\nabla E_{\text{in}} = 0$ . Convergence is quadratic, faster than gradient descent, but each step is more expensive.

Discriminative vs Generative

Discriminative models (logistic regression, SVM) directly estimate $p(y|x)$ . They learn the boundary between classes without modelling how the data was generated.

Generative models (Naive Bayes, discriminant analysis) model the joint distribution $p(x, y)$ and use Bayes’ rule:

p(y|x) = \frac{p(x|y) \cdot p(y)}{p(x)}

Generative models can work with less data (the prior $p(y)$ helps) but make stronger assumptions about the data distribution.

What You Can Do Now

The code below implements logistic regression from scratch (sigmoid, cross-entropy loss, and gradient descent) and trains it on a small binary classification problem.

import numpy as np

np.random.seed(1)

# Linearly separable binary data, labels in {+1, -1}
X_pos = np.random.randn(20, 2) + np.array([2, 2])
X_neg = np.random.randn(20, 2) + np.array([-2, -2])
X = np.vstack([X_pos, X_neg])
y = np.array([1]*20 + [-1]*20)

# Augment with bias
X_aug = np.hstack([X, np.ones((len(X), 1))])

def sigmoid(s):
    return 1 / (1 + np.exp(-s))

def cross_entropy_loss(w, X, y):
    # E_in = (1/N) * sum log(1 + exp(-y * w^T x))
    return np.mean(np.log(1 + np.exp(-y * (X @ w))))

def gradient(w, X, y):
    s = y * (X @ w)
    return -np.mean((y[:, None] * X) / (1 + np.exp(s))[:, None], axis=0)

w = np.zeros(X_aug.shape[1])
eta = 0.5

print(f"{'Step':>5}  {'Loss':>10}")
for step in range(200):
    w = w - eta * gradient(w, X_aug, y)
    if step % 40 == 0 or step == 199:
        print(f"{step+1:>5}  {cross_entropy_loss(w, X_aug, y):>10.6f}")

preds = np.sign(X_aug @ w)
print(f"\nTraining accuracy: {np.mean(preds == y):.0%}")

Increase the learning rate eta to see the loss oscillate, or reduce it for slower but stable convergence. Replace the gradient step with Newton’s method (divide by the Hessian) to observe quadratic convergence, which requires far fewer steps to the same accuracy.