Outlier Detection: A Map of the Methods

An outlier is an observation that does not fit the pattern of the rest of the data. What makes outlier detection difficult is that this definition is circular until you specify what “pattern” means. Every detection method embeds a different answer to that question, and choosing the wrong method produces a list of flagged observations that has more to do with the method’s assumptions than with anything genuinely unusual in the data.

There are two tiers of methods: explanatory ones that are fast, interpretable, and work well when the distribution is reasonably well understood, and model-based ones designed for complex, high-dimensional, or non-Gaussian distributions.

The Explanatory Methods

The Z-score is the simplest. It measures how many standard deviations an observation sits from the mean. A common threshold is 3 standard deviations. The assumption is that the data is approximately normally distributed. If it is not, the Z-score is unreliable because the mean and standard deviation are themselves distorted by the outliers being sought. This circularity makes the Z-score unsuitable for skewed data.

The modified Z-score addresses this by replacing the mean with the median and the standard deviation with the median absolute deviation (MAD). The MAD measures the median of the absolute deviations from the median, making it robust to both skewness and the influence of outliers. It is a better default than the standard Z-score for most real data.

The IQR method (Tukey’s fences) defines the interquartile range as the difference between the 75th and 25th percentiles. Observations below Q1 minus 1.5 times the IQR, or above Q3 plus 1.5 times the IQR, are flagged. This method is distribution-agnostic, which is its main advantage. It does not assume normality and is not influenced by the extreme values it is designed to detect. An adjusted variant scales the fence multiplier based on skewness, which helps when the distribution is substantially asymmetric.

Percentile-based flagging is the bluntest approach: flag the bottom N% and top N% of observations. This is not a detection method in any statistical sense; it is a policy decision about what proportion of the data to treat as extreme. It is appropriate when a fixed review budget exists, such as reviewing the top 1% of transactions for fraud, but it will always produce flags regardless of whether the flagged values are genuinely unusual.

Winsorization does not detect outliers at all. It caps extreme values at a specified percentile rather than removing them, replacing the 99th percentile and above with the 99th percentile value. This is useful when you want to reduce the influence of extreme values on a model without discarding the observations entirely.

Statistical tests for outliers, including Grubbs, Dixon, and Shapiro-Wilk, assume specific distributions and are designed for small samples. They test whether a specific observation is inconsistent with the assumed distribution. They are rarely used in large-scale machine learning pipelines but are appropriate for small, well-characterised datasets where the distributional assumption holds.

Model-Based Detection

When the distribution is complex, multivariate, or not well characterised, model-based methods are more appropriate.

Isolation Forest works by randomly partitioning the feature space. Outliers, being unusual, require fewer splits to isolate than normal observations. The anomaly score is inversely related to the average path length across many trees. It is fast, scales to high dimensions, and makes no distributional assumptions. It is often a good first model-based choice.

Local Outlier Factor (LOF) compares the local density of an observation to the densities of its neighbours. An observation in a low-density region surrounded by high-density neighbours receives a high outlier score. This makes LOF effective at detecting outliers that are context-dependent: a value that would be normal in one cluster may be anomalous near another. The drawback is that it is slow on large datasets and sensitive to the choice of the number of neighbours.

Mahalanobis distance generalises the Z-score to multiple dimensions by accounting for correlations between features. An observation that is unusual in the joint distribution of several features, even if normal on each individual feature, will have a high Mahalanobis distance. It assumes a multivariate normal distribution, so it shares the Z-score’s sensitivity to skewed data.

DBSCAN is a density-based clustering algorithm that naturally identifies outliers as points that do not belong to any cluster. Points in sparse regions that cannot be assigned to a cluster are labelled as noise. The parameters (epsilon, the neighbourhood radius, and min_samples, the minimum cluster size) require tuning.

One-class SVM trains a boundary around the normal data and classifies anything outside that boundary as anomalous. It works well in high dimensions but is computationally expensive and sensitive to the choice of kernel and regularisation parameters.

Elliptic Envelope fits a multivariate Gaussian to the data and flags observations with a Mahalanobis distance above a threshold. It is more robust than a standard Mahalanobis distance because it uses a robust covariance estimator, but it still assumes an approximately elliptic distribution.

How to Choose

The decision follows from the data, not from preference for a particular method. If the column is univariate and approximately normal, the modified Z-score or IQR method is sufficient and interpretable. If the data is multivariate and the outliers are expected to be unusual in the joint distribution of several features, Mahalanobis distance or Isolation Forest is appropriate. If the data has clusters with varying densities and outliers are expected to be locally unusual rather than globally extreme, LOF is the right tool.

In practice, starting with a simple method and comparing its output to a model-based method reveals whether the detected outliers are consistent. If the two methods flag completely different observations, that is a signal worth investigating before deciding which set to act on.

What You Can Do Now

The following block applies both the IQR method and Isolation Forest to the same column, so you can compare which observations each method flags:

import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest

rng = np.random.default_rng(42)
# Normal observations with a few injected outliers
data = np.concatenate([rng.normal(loc=50, scale=10, size=200), [150, 160, -30]])
df = pd.DataFrame({"value": data})

# IQR method
Q1, Q3 = df["value"].quantile(0.25), df["value"].quantile(0.75)
IQR = Q3 - Q1
iqr_outliers = df[(df["value"] < Q1 - 1.5 * IQR) | (df["value"] > Q3 + 1.5 * IQR)]

# Isolation Forest
iso = IsolationForest(contamination=0.05, random_state=42)
df["iso_score"] = iso.fit_predict(df[["value"]])
iso_outliers = df[df["iso_score"] == -1]

print(f"IQR flagged:              {len(iqr_outliers)} observations")
print(f"Isolation Forest flagged: {len(iso_outliers)} observations")
print(f"\nIQR indices:  {sorted(iqr_outliers.index.tolist())}")
print(f"IsoFor indices: {sorted(iso_outliers.index.tolist())}")

Run this on a column you suspect has outliers. If the two methods agree, the flagged observations are almost certainly genuine. If they disagree substantially, examine the disagreements directly before deciding how to handle them.