Skewness: What the Shape of Your Data Is Telling You

Skewness is one of those statistics that is easy to compute and easy to misread. Most introductions describe it as a measure of asymmetry and move on. What they skip is the more useful part: skewness is a symptom, not a problem in itself, and different root causes require different fixes. Applying a log transform to every skewed column is about as precise as taking the same medication for every kind of pain.

What Skewness Actually Measures

A symmetric distribution has its mean, median, and mode at the same point. Skewness measures how far the distribution departs from that state, and in which direction.

In a right-skewed (positively skewed) distribution, the tail extends toward the high end. The mean is pulled toward those high values, so it sits to the right of the median, which in turn sits to the right of the mode. Income is the canonical example: most people earn near the mode, the median is somewhat higher, and a small number of very high earners pull the mean up further still. The mean overstates the typical value. For any analysis of center, the median is the more honest statistic.

In a left-skewed (negatively skewed) distribution, the tail extends toward the low end. The mean is pulled downward, sitting below the median, which sits below the mode. Age at retirement in a specific industry might show this shape: most people retire around a common age, but a subset exits early due to health or redundancy, dragging the mean down.

Zero skewness means the distribution is symmetric, the mean equals the median, and the mean is a reliable measure of center. This is the condition that many statistical models, including linear regression, assume for their residuals.

Why the Root Cause Matters

Skewness has several distinct origins, and treating them the same way produces poor models. The common ones are worth naming precisely.

The most straightforward cause is a multiplicative scale. When values grow multiplicatively rather than additively, the spacing between them is not constant. A sequence like 10, 15, 25, 50, 100 is right-skewed not because of anything unusual about the data-generating process but because the underlying scale is multiplicative. A log transformation corrects this by converting multiplicative distances into additive ones. Box-Cox and Yeo-Johnson are generalisations that handle negative values and optimise the transformation parameter.

A second cause is a density problem at some threshold. Consider a dataset of customer purchases that includes a 19-year-old with a 5,000-unit luxury item. The value is real, but it represents a different kind of event than the rest of the distribution. A log transform will compress the scale but will not resolve the fact that this observation belongs to a different segment. Here the information is better captured through Weight of Evidence encoding or target encoding, which represent what the value means for the outcome rather than preserving its magnitude.

Boundary constraints produce a different skew pattern. When values are bounded at zero but can extend far in the positive direction, such as claim amounts in insurance, you get a spike at the boundary and a right-skewed continuous portion. A log transform cannot handle the zeros, and removing them loses the information they carry. Tweedie distributions model this directly. Zero-inflated models treat the zero mass and the positive mass as separate components, which reflects the actual data-generating process.

Survivorship bias produces skew through sampling, not through the variable itself. A dataset of startup valuations only contains companies that survived long enough to be valued. The failures, which would anchor the left of the distribution, are absent. No transformation addresses this; the skew is a signal that the dataset is incomplete. The correct response is to acknowledge the limitation and to be precise about the population the model can actually describe.

The final cause is a mixture of two populations occupying the same column. A column measuring session length might combine casual visitors and power users. Neither group is individually skewed, but the combined distribution appears bimodal or skewed because two distinct distributions are superimposed. A log transform on this produces a log of a mixture, not a transformation of a single clean distribution. Gaussian mixture models decompose the column into its component distributions, which can then be modelled separately or used as a feature encoding.

How to Respond to Skewness

The right response depends on the diagnosis. For a straightforward multiplicative scale, a log transform is sufficient and appropriate. The standard approach in practice is to use np.log1p rather than np.log to handle zero values without errors.

For most numerical features, a useful default is to create two representations: the transformed magnitude and a rank or quartile encoding. The magnitude captures scale, and the rank encoding captures relative importance within the distribution. Together they give the model access to both kinds of information without forcing it to rely on the transformed scale alone.

import numpy as np
import pandas as pd
from scipy import stats

# Simulated right-skewed income data
rng = np.random.default_rng(42)
income = rng.lognormal(mean=10.5, sigma=1.2, size=1000)

df = pd.DataFrame({"income": income})

# Check skewness
print(f"Skewness: {df['income'].skew():.3f}")
print(f"Mean:     {df['income'].mean():,.0f}")
print(f"Median:   {df['income'].median():,.0f}")

# Log transform for multiplicative scale
df["income_log"] = np.log1p(df["income"])
print(f"\nAfter log transform, skewness: {df['income_log'].skew():.3f}")

# Quartile rank encoding for relative importance
df["income_quartile"] = pd.qcut(df["income"], q=4, labels=[1, 2, 3, 4])

For a density or threshold problem, target encoding assigns each value a summary of the outcome variable within that range, which captures what the value means for prediction rather than its raw magnitude:

# Box-Cox and Yeo-Johnson for finding the optimal transformation
from sklearn.preprocessing import PowerTransformer

pt_boxcox = PowerTransformer(method="box-cox")    # requires strictly positive values
pt_yj     = PowerTransformer(method="yeo-johnson") # handles zero and negative values

income_bc = pt_boxcox.fit_transform(df[["income"]])
income_yj = pt_yj.fit_transform(df[["income"]])

print(f"Box-Cox skewness:     {pd.Series(income_bc.flatten()).skew():.3f}")
print(f"Yeo-Johnson skewness: {pd.Series(income_yj.flatten()).skew():.3f}")

For a zero-boundary constraint, checking whether the zero mass is substantial is the first diagnostic step:

# Identifying a boundary constraint pattern
claim_amounts = np.concatenate([
    np.zeros(300),                                   # zero-claim events
    rng.exponential(scale=500, size=700)             # positive claims
])

zero_fraction = (claim_amounts == 0).mean()
positive_skew  = pd.Series(claim_amounts[claim_amounts > 0]).skew()

print(f"Zero fraction: {zero_fraction:.1%}")
print(f"Skewness of positive portion: {positive_skew:.3f}")
# If zero_fraction > ~10%, a zero-inflated or Tweedie model is appropriate

A Note on Prior Assumptions

My earlier assumption was that skewness was a preprocessing checkbox: detect it, apply log, proceed. That assumption produced models that were numerically cleaner but not more accurate, because the transformation addressed the symptom rather than the structure.

The shift came from examining residuals and feature importances after transformation and noticing that compressed, log-scaled features were losing discriminative power for the cases that mattered most. A right-skewed feature transformed by log becomes nearly uniform in the upper range, which makes it harder for a model to distinguish the 90th percentile from the 95th. For certain problems, that distinction is exactly what the model is supposed to learn.

What You Can Do Now

Take a numerical column from a dataset you are currently working with and run this diagnostic block before deciding on any transformation:

import numpy as np
import pandas as pd
from scipy import stats

def diagnose_skewness(series: pd.Series, name: str = "column") -> None:
    s = series.dropna()
    skewness = s.skew()
    zero_frac = (s == 0).mean()
    
    print(f"--- {name} ---")
    print(f"  Skewness:      {skewness:.3f}")
    print(f"  Mean:          {s.mean():.4f}")
    print(f"  Median:        {s.median():.4f}")
    print(f"  Mode (approx): {s.mode().iloc[0]:.4f}")
    print(f"  Zero fraction: {zero_frac:.1%}")
    print(f"  Min / Max:     {s.min():.4f} / {s.max():.4f}")
    
    if abs(skewness) < 0.5:
        print("  -> Approximately symmetric. Mean is a reliable center.")
    elif skewness > 0.5:
        print("  -> Right-skewed. Median is a more reliable center.")
        if zero_frac > 0.1:
            print("  -> High zero fraction: consider zero-inflated or Tweedie model.")
        elif s.min() > 0:
            print("  -> All positive: Box-Cox or log transform is applicable.")
        else:
            print("  -> Contains non-positive values: consider Yeo-Johnson.")
    else:
        print("  -> Left-skewed. Median is a more reliable center.")

# Usage
rng = np.random.default_rng(0)
diagnose_skewness(pd.Series(rng.lognormal(3, 1.5, 1000)), name="simulated_income")

Run this on each skewed column and read the output before reaching for a transformation. The diagnostic will not tell you which root cause applies, but it will rule out the ones that do not fit and prompt the right follow-up questions.