Cross Validation: LOOCV, K-Fold, and When to Use Each

This post is based on coursework from my university statistics module. The notes and structure here reflect what I found worth retaining from that material, extended with practical implementation detail.

Cross validation is a numerical technique for estimating how well a predictive model will generalise to data it has not seen. It does not tell you why a model works or prove anything theoretically. What it gives you is a measured approximation of the gap between training performance and real-world performance, computed from the data you already have.

The core mechanism is always the same: split the available data into a training portion and a validation portion, train the model on one, evaluate it on the other, and repeat the process in different configurations. The prediction scores from each evaluation are aggregated, usually averaged, to produce a single estimate of generalisation performance.

Validation Set, Test Set, and Why Cross Validation Matters

A validation set is used during model development to select between models or tune hyperparameters. A test set is held out entirely and touched only once, at the end, to report final performance. The two serve different purposes and should never be the same data.

In practice, cross validation is always worth doing. A single train-validation split is a gamble: depending on which observations land in which partition, the performance estimate can be optimistic or pessimistic. Cross validation averages over many such splits, producing a more stable estimate. When training data is small or the feature space is large, the gap between training and validation performance tends to be larger, and cross validation gives you a principled way to measure that gap rather than guess at it.

Leave-P-Out and LOOCV

Leave-P-Out Cross Validation (LPOCV) designates P observations as the validation set and trains on the rest. This is repeated across every possible combination of P observations from the dataset. The result is an exhaustive, deterministic process: every observation is eventually used for both training and validation. Because every combination is evaluated, running LPOCV twice on the same dataset produces the same result every time.

Leave-One-Out Cross Validation (LOOCV) is the special case where P equals 1. At each iteration, a single observation is held out for validation and the model is trained on all remaining observations. For a dataset of N observations, LOOCV fits N models. Because each model is trained on N-1 observations, training sets across iterations are nearly identical, sharing N-2 of the same observations. This high overlap between folds produces low bias: each model is close to what you would get from the full dataset. The downside is high variance in the performance estimate, because the near-identical training sets produce near-identical models whose results are highly correlated.

LOOCV is appropriate when the dataset is small enough that losing any observations to a held-out fold would meaningfully impoverish the training data. It is not a good default for larger datasets: the computational cost scales with N, and the high variance of the estimate makes it less informative than K-fold for the same cost.

K-Fold Cross Validation

K-Fold divides the dataset into K equally sized subsets, called folds. One fold serves as the validation set while the remaining K-1 folds form the training set. This repeats K times, with a different fold acting as validation each time. The model is fitted K times in total, and the evaluation metric is averaged across all K runs.

K is a hyperparameter. Common values in practice are 5 and 10. The choice involves a bias-variance tradeoff in the same direction as LOOCV: larger K means each training set contains more observations and is closer to the full dataset (lower bias), but the training sets overlap more across folds, increasing the correlation between evaluation scores and raising variance. Smaller K reduces overlap and variance but introduces more pessimistic bias because each model trains on less data.

The value K=10 has become the standard default because it consistently yields low bias and modest variance across a wide range of dataset sizes and model types. K=5 is a reasonable choice when computational cost is a constraint. K=2 is rarely useful: the bias is high because each model trains on half the data.

Unlike LOOCV, K-fold is stochastic. The assignment of observations to folds is random. Running the same K-fold procedure twice on the same dataset can produce different fold assignments and therefore slightly different performance estimates. For a stable reported result, repeated K-fold (running the full procedure multiple times with different random seeds and averaging all results) is the more rigorous approach.

from sklearn.model_selection import KFold, LeaveOneOut, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import numpy as np

# 200 samples, 10 features — small enough that LOOCV is not absurd computationally
X, y = make_classification(n_samples=200, n_features=10, random_state=42)
model = LogisticRegression(max_iter=1000)

# --- 10-Fold CV ---
# shuffle=True randomises fold assignment; random_state makes it reproducible
kf = KFold(n_splits=10, shuffle=True, random_state=42)
kf_scores = cross_val_score(model, X, y, cv=kf, scoring="accuracy")
# kf_scores is an array of 10 accuracy values, one per fold
print(f"10-Fold CV — mean: {kf_scores.mean():.4f}, std: {kf_scores.std():.4f}")
# A small std here means fold assignments did not significantly affect the result

# --- LOOCV ---
# No shuffle or random_state needed — LOOCV is deterministic by definition
loo = LeaveOneOut()
loo_scores = cross_val_score(model, X, y, cv=loo, scoring="accuracy")
# loo_scores has 200 values, each 0.0 or 1.0 (wrong or correct on that one held-out sample)
# std will always look large here — this is expected, not a sign of instability
print(f"LOOCV     — mean: {loo_scores.mean():.4f}, std: {loo_scores.std():.4f}")

# Run both a second time with a different random_state on K-fold to confirm stochastic nature
kf2 = KFold(n_splits=10, shuffle=True, random_state=99)
kf_scores2 = cross_val_score(model, X, y, cv=kf2, scoring="accuracy")
print(f"10-Fold CV (seed 99) — mean: {kf_scores2.mean():.4f}")
# Mean will be close but not identical — confirms K-fold is stochastic
# LOOCV mean would be exactly the same if re-run — confirms it is deterministic

Other Variants

Stratified K-fold preserves the class distribution in each fold, so a fold from an imbalanced dataset will have approximately the same class ratio as the full dataset. For classification problems with imbalanced classes, this is the correct default. Standard K-fold can produce folds where the minority class is severely underrepresented, which makes the evaluation unreliable.

Monte Carlo cross validation (also called shuffle split) draws random training and validation splits repeatedly without the constraint that every observation must appear in a validation set exactly once. The split proportion is fixed, but because splits are random and independent there is no guarantee of exhaustive coverage. It is flexible and useful when you want many evaluation rounds at a fixed computation budget.

Time series cross validation is structurally different from the others. Standard cross validation shuffles observations randomly, which for temporal data means a model can be trained on future observations and validated on past ones. That produces optimistic and unrealistic estimates. Time series cross validation preserves chronological order: the training set always precedes the validation set in time. Each fold extends the training window forward and validates on the next time window. This mimics the actual deployment scenario where you predict the future from the past.

from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit

# --- Stratified K-fold ---
# Uses the same X, y and model from the block above
# shuffle=True still randomises fold order within the stratification constraint
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
skf_scores = cross_val_score(model, X, y, cv=skf, scoring="accuracy")
# Each fold will have approximately the same class ratio as the full dataset
print(f"Stratified 10-Fold — mean: {skf_scores.mean():.4f}, std: {skf_scores.std():.4f}")

# --- Time Series Split ---
# n_splits=5 produces 5 forward-chaining folds
# Training window grows with each fold — fold 1 is smallest, fold 5 is largest
tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
    print(
        f"Fold {fold+1}: "
        f"train [{train_idx[0]}..{train_idx[-1]}] size={len(train_idx)}, "
        f"val   [{val_idx[0]}..{val_idx[-1]}]  size={len(val_idx)}"
    )
    # val indices always come after all train indices — no future leakage

10-Fold vs LOOCV

	LOOCV	10-Fold
Models fitted	N	10
Bias	Low	Low to moderate
Variance of estimate	High	Modest
Deterministic	Yes	No (stochastic)
Suitable dataset size	Small only	Any
Default choice	No	Yes

LOOCV’s low bias is offset by its high variance and computational cost. For the vast majority of problems, 10-fold cross validation is the better choice. LOOCV is reserved for situations where N is small enough that spending N model fits is acceptable and you cannot afford to leave out more than one observation per fold.

Where to Start: Beginner, Intermediate, Advanced

If you are starting out: use 10-fold cross validation for everything. If your classification target is imbalanced, use stratified 10-fold. These two cover most problems correctly.

If you are comfortable with the basics: adjust K based on dataset size. For large datasets, smaller K (5) is faster with acceptable accuracy. For small datasets, larger K or LOOCV is more appropriate. Apply stratified splitting automatically for any classification task.

If you work on specialised problems: use time series cross validation for any temporal data without exception. Consider nested cross validation when you need to tune hyperparameters and report generalisation performance independently. Outer folds estimate generalisation, inner folds tune the model, and the two never share data.

What You Can Do Now

The block below is self-contained. Paste it and run it as-is. Here is what each part does before you read the code:

A fake dataset is created with 500 rows, 10 feature columns, and a heavily imbalanced target (90% class 0, 10% class 1). This simulates a fraud or anomaly detection scenario.
The dataset is inspected (shape and class counts) so you can see the imbalance before any splitting happens.
A Random Forest model is defined. It is not trained yet. cross_val_score will train and evaluate it separately on each fold.
The split objects (kf_splitter, skf_splitter) are created as lists of fold pairs. Each pair is (train_indices, val_indices). They are printed so you can see exactly which rows go where.
The per-fold class 1 fraction is printed for both standard and stratified splitting. This is where the difference becomes visible: standard folds will have uneven class fractions, stratified folds will all be close to 10%.
The three CV scores are compared. The standard score will be slightly inflated or noisy. The stratified score is more reliable. The repeated score has a smaller standard deviation, meaning it is the most stable estimate of the three.

Replace X and y with your own data to use this as a diagnostic on any classification problem.

from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score, RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np

# Imbalanced dataset: 90% class 0, 10% class 1
X, y = make_classification(
    n_samples=500, n_features=10,
    weights=[0.9, 0.1],
    random_state=42
)

# --- Inspect the dataset before running any CV ---
print(f"Dataset shape:       {X.shape}")                          # (500, 10)
print(f"Class 0 count:       {(y == 0).sum()} ({(y==0).mean():.0%})")  # ~90%
print(f"Class 1 count:       {(y == 1).sum()} ({(y==1).mean():.0%})")  # ~10%

# --- Inspect the model before fitting ---
model = RandomForestClassifier(n_estimators=100, random_state=42)
# n_estimators=100: 100 decision trees in the ensemble
# random_state=42: makes tree construction reproducible (affects feature/split sampling)
# No max_depth set — trees grow until all leaves are pure (may overfit, fine for CV demo)
# cross_val_score will refit this model from scratch on each fold's training set
print(f"Model: {model}")  # prints all default hyperparameters — useful to know what you are actually running

# --- Inspect per-fold class distribution to see what each CV method sees ---

# kf_splitter is a generator of (train_indices, val_indices) tuples — one per fold
# convert to list so we can inspect it, index into it, and reuse it
kf_splitter  = list(KFold(n_splits=10, shuffle=True, random_state=42).split(X, y))
skf_splitter = list(StratifiedKFold(n_splits=10, shuffle=True, random_state=42).split(X, y))

# each element is a tuple: (train_idx array, val_idx array)
print(f"Number of folds:          {len(kf_splitter)}")           # 10
print(f"Type of one element:      {type(kf_splitter[0])}")       # tuple
print(f"Train indices (fold 1):   {kf_splitter[0][0][:5]} ...")  # first 5 train indices
print(f"Val   indices (fold 1):   {kf_splitter[0][1][:5]} ...")  # first 5 val indices
print(f"Train size (fold 1):      {len(kf_splitter[0][0])}")     # ~450
print(f"Val   size (fold 1):      {len(kf_splitter[0][1])}")     # ~50

print("\n--- Standard K-Fold: class 1 fraction per fold ---")
for fold, (train_idx, val_idx) in enumerate(kf_splitter):
    val_class1_frac = y[val_idx].mean()
    print(f"  Fold {fold+1}: val size={len(val_idx)}, class 1 fraction={val_class1_frac:.2%}")
# Fractions will vary fold to fold — some folds may have very few class 1 samples

print("\n--- Stratified K-Fold: class 1 fraction per fold ---")
for fold, (train_idx, val_idx) in enumerate(skf_splitter):
    val_class1_frac = y[val_idx].mean()
    print(f"  Fold {fold+1}: val size={len(val_idx)}, class 1 fraction={val_class1_frac:.2%}")
# Fractions will be consistent (~10%) across all folds — this is the point of stratification

# --- Compare CV scores ---
# pass the splitter list directly to cv= so the same fold assignments are reused
kf_scores  = cross_val_score(model, X, y, cv=kf_splitter)
skf_scores = cross_val_score(model, X, y, cv=skf_splitter)

rskf = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=42)
repeated_scores = cross_val_score(model, X, y, cv=rskf)

print(f"\nStandard   10-fold: {kf_scores.mean():.4f} ± {kf_scores.std():.4f}")
print(f"Stratified 10-fold: {skf_scores.mean():.4f} ± {skf_scores.std():.4f}")
print(f"Repeated   10-fold: {repeated_scores.mean():.4f} ± {repeated_scores.std():.4f}")
# Standard and stratified means will diverge — standard is inflated by folds
# that happened to have fewer minority class samples in validation.
# Repeated std will be smaller than single-run std — more stable estimate.