Categorical Data and Cardinality: Features, Targets, and the Encoding Problem

Categories are the context for numbers. A transaction amount of 500 means something different depending on whether the merchant category is grocery or luxury retail. Cardinality, the number of unique values a categorical column can take, determines how much of that context is usable and how much of it becomes noise.

Most discussions of cardinality treat it as a single problem with a single solution: encode and move on. The more precise framing is that high cardinality in a feature column (X) and high cardinality in a target column (Y) are structurally different problems with different remedies.

High Cardinality in Features

A zip code column with 40,000 unique values is the canonical high-cardinality feature. One-hot encoding it produces 40,000 binary columns. Most are zero for most rows. The matrix is sparse, the model is learning from almost no signal per category, and the dimensionality has grown to a size that makes training slow and generalisation poor. This is the curse of dimensionality expressed through a single categorical column.

The problem is compounded by the fact that each encoding of a high-cardinality category implicitly makes a claim: that this category is meaningfully different from all others. For rare categories that appear in only a handful of rows, that claim is unsupported by the data.

Frequency encoding converts each category to its count or proportion in the training set. A zip code that appears 5,000 times becomes 5000 (or 0.5 if normalised). This collapses the dimensionality to a single column but loses all information about what the category means beyond how common it is. It is appropriate when frequency itself is predictive.

Feature hashing maps categories to a fixed number of bins using a hash function. Unlike age binning (where 10–15 forms an ordered range), hashed bins have no inherent order or meaning. The dimensionality is bounded, but collisions are possible and the representation is not interpretable.

Target encoding replaces each category with the mean of the target variable for rows in that category. A zip code with a fraud rate of 12% becomes 0.12. This is powerful when the category is genuinely predictive of the outcome, but it is dangerous without smoothing. A zip code with a single fraudulent transaction will have a target-encoded value of 1.0, which the model will learn as a strong signal. That signal is noise from a single data point. Smoothing blends the category-level mean with the global mean, weighted by sample size, so rare categories are pulled toward the prior:

import pandas as pd

def smooth_target_encode(df, col, target, alpha=10):
    global_mean = df[target].mean()
    stats = df.groupby(col)[target].agg(["mean", "count"])
    smoothed = (stats["count"] * stats["mean"] + alpha * global_mean) / (stats["count"] + alpha)
    return df[col].map(smoothed)

Grouping collapses rare categories into an “other” bucket. This is the simplest approach and often the most defensible: 500 small cities that appear once each in the training set carry no individual signal, but grouped together they may.

How algorithms handle high-cardinality features differs. Linear models produce unstable coefficients because each category gets its own coefficient estimated from limited data. Tree-based models handle it better structurally but still overfit at the root when rare categories appear early in a split. Neural networks expect dense, continuous inputs; sparse one-hot representations require embedding layers to convert them into dense representations first.

Low Cardinality in Features

Low cardinality features (gender, education level, product category with 5 options) are easy to encode but carry their own challenges. One-hot encoding is usually the right default. The more interesting opportunity with low-cardinality features is combination: combining age group, gender, and education level creates a compound category that is richer than any individual feature. This produces dense, informative groups that a model can learn from directly.

The risk with low cardinality in a target context (such as fraud detection, where the outcome is binary but imbalanced) is class imbalance, not dimensionality. A dataset where 99% of rows are non-fraudulent has a majority class that dominates training. Models that maximise overall accuracy will simply predict the majority class for everything. The relevant metrics are precision and recall, not accuracy, and training should use stratified cross-validation to preserve the class ratio across folds.

For this setting: logistic regression with a regularisation penalty is a reliable baseline, XGBoost or LightGBM with regularisation parameters achieves higher accuracy, and random forest with bagging handles noisy data with many categorical features well.

High Cardinality in Targets

A target column with 50,000 unique product SKUs is a fundamentally different problem from a high-cardinality feature. The model must output one of 50,000 classes for each prediction. From simple probability, a random guess is correct 0.002% of the time. With limited data per class, the signal per class is thin.

The author’s framing here is useful: in features you move from high cardinality to low cardinality to reduce noise, while in targets you may need to move from low to high cardinality (from category to specific product) to achieve the prediction goal. These are opposite directions.

The practical remedies are: hierarchical classification (first predict the category, then predict the specific item within that category), converting the target to a learned embedding (a dense vector that represents the product, which the model outputs as a regression target and maps back to the nearest product), or binning the target into ordered groups (if predicting number of days until an event, converting to week 1, month 1, quarter 1 is often sufficient and makes the task tractable).

# Hierarchical approach: predict category first, then item within category
from sklearn.linear_model import LogisticRegression

# Stage 1: predict category
cat_model = LogisticRegression(max_iter=1000)
cat_model.fit(X_train, y_train_category)
predicted_category = cat_model.predict(X_test)

# Stage 2: for each predicted category, predict the specific item
# using only rows belonging to that category in training

What You Can Do Now

Take a categorical column from a dataset you are working with and run this cardinality check before deciding on an encoding strategy:

import pandas as pd

def cardinality_report(df: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for col in df.select_dtypes(include=["object", "category"]).columns:
        n_unique = df[col].nunique()
        top_freq = df[col].value_counts(normalize=True).iloc[0]
        rare_count = (df[col].value_counts() == 1).sum()
        rows.append({
            "column": col,
            "unique": n_unique,
            "top_freq_%": round(top_freq * 100, 1),
            "singleton_categories": rare_count,
        })
    return pd.DataFrame(rows).sort_values("unique", ascending=False)

The singleton_categories column tells you how many categories appear exactly once. That number directly predicts how much of a target-encoding bias problem you have and whether grouping rare values into “other” is warranted.