Scaling and Normalisation: When It Helps and When It Hurts

Scaling is often treated as a standard preprocessing step that belongs in every pipeline. Apply it, move on. The problem with that approach is that scaling applied to the wrong data does not just fail to help. It actively changes what the data represents. A skewed feature that gets normalised looks different after scaling in a way that misrepresents its underlying distribution. This is a risky step, and it should be treated as one.

Why Some Models Need Scaling

The models that require scaling share a common property: they compute distances, gradients, or variance across features. When features exist on different scales, the model treats the differences in magnitude as meaningful signal, when in most cases they are just an artefact of measurement units.

A practical example: a dataset with age (range 18 to 80) and annual income (range 20,000 to 200,000). In a K-nearest neighbours model, the Euclidean distance between two points is dominated by income because its raw values are orders of magnitude larger. Age contributes almost nothing. The model has effectively discarded a feature, not because it was uninformative, but because it happened to be measured on a smaller scale.

The same issue appears in gradient-based optimisation. When features vary widely in scale, the loss surface becomes elongated and the gradient descends efficiently in one direction while oscillating in others. Scaling brings the features closer to the same range and makes the loss surface more symmetric, which speeds up convergence and reduces instability.

Models that require scaling include: distance-based methods (KNN, K-means, SVM), gradient-based optimisation (logistic regression, linear regression, neural networks), dimensionality reduction (PCA), and regularised models (Lasso, Ridge). Regularisation penalises large coefficients, and if features are on different scales the penalty is applied unevenly.

Models that do not require scaling are tree-based. Decision trees, random forests, XGBoost, and CatBoost split on thresholds, not distances. Multiplying a feature by a constant does not change where the optimal split falls.

The Three Types of Scaling

Normalisation (also called min-max scaling) maps each value to a range of 0 to 1. It preserves the shape of the distribution exactly. A value of 0.8 means the observation is at the 80th percentile of the observed range. The weakness is that it is sensitive to outliers: a single extreme value stretches the range and compresses everything else toward one end.

Standardisation (z-score scaling) subtracts the mean and divides by the standard deviation, producing a distribution with a mean of 0 and a standard deviation of 1. It does not bound values to a range, which makes it more robust to outliers than normalisation. It is the more common choice when normality is a reasonable assumption.

Robust scaling uses the median and the interquartile range instead of the mean and standard deviation. This makes it resistant to outliers by design. It is the appropriate choice when the distribution is skewed or contains values that should not anchor the scaling parameters.

What Can Go Wrong

The most consequential mistake is fitting the scaler on the entire dataset including the test set before splitting. The scaler learns the mean, standard deviation, or range from all available data, which means the test set has influenced the scaling parameters. At deployment, new data will have a different range and the scaling will be inconsistent. The correct approach is to fit the scaler on the training set only and apply it to the test set using the training parameters.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit and transform on train
X_test_scaled  = scaler.transform(X_test)       # transform only on test

The second mistake is scaling highly skewed data. A right-skewed feature with a long upper tail, when normalised, has most of its values compressed into a narrow band near zero and a few large values spread across the rest of the range. The scaling preserves the skewed shape but does not resolve it. The model then sees a numerically bounded version of the same distorted distribution. The correct sequence is to handle the skewness first, then scale if the model requires it.

Categorical data should never be scaled. Label-encoded categories have integer values, but those integers are not magnitudes. Scaling them implies an ordinal relationship that does not exist.

A subtler issue: scaling reduces the absolute variance of features. Observations that were genuine outliers may no longer appear as outliers after scaling, because the outlier’s distance from the center is compressed relative to the new scale. Outlier detection should happen before scaling, not after.

A Note on Prior Assumptions

My earlier assumption was that scaling was a safe default, something that could not hurt. What changed was noticing that scaled skewed features were producing strange model behaviour: features that should have been important were underweighted, and the coefficients in linear models were unstable. Tracing back the issue revealed that the scaling was acting on distributions that had not been understood yet. Scaling amplifies structure that is already there. If the structure is wrong, the scaling preserves and amplifies the wrong structure.

The practical rule I now apply: only scale when you are confident the column represents a clean, base signal. If a column has not been examined for skewness, outliers, or multimodality, it is not ready to be scaled.

What You Can Do Now

The following block compares all three scaler types on a right-skewed column and prints the resulting statistics so you can observe the effect directly:

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

rng = np.random.default_rng(42)
income = pd.Series(rng.lognormal(mean=10.5, sigma=1.2, size=1000), name="income")

scalers = {
    "MinMax":   MinMaxScaler(),
    "Standard": StandardScaler(),
    "Robust":   RobustScaler(),
}

print(f"Original  — skew: {income.skew():.2f}, mean: {income.mean():>10.2f}, std: {income.std():>10.2f}")

for name, scaler in scalers.items():
    scaled = pd.Series(scaler.fit_transform(income.values.reshape(-1, 1)).flatten())
    print(f"{name:<9} — skew: {scaled.skew():.2f}, mean: {scaled.mean():>10.4f}, std: {scaled.std():>10.4f}")

Run this on a skewed column from your own dataset and observe that the skewness value is identical across all three scalers. Scaling does not fix skewness. That is the point.