Contents
  1. Algorithms as Random Variables
  2. Bagging
  3. Random Forests
  4. Bias-Variance Decomposition
  5. AdaBoost
  6. Bagging vs Boosting
← All posts

Ensemble Learning: Bagging, Random Forests, and AdaBoost

Bagging reduces variance by averaging over bootstrapped models. Random forests decorrelate the trees further by subsampling features. AdaBoost reduces bias sequentially by upweighting misclassified examples.

Algorithms as Random Variables

Treat each trained model as a random variable xix_i with variance σ2\sigma^2. The average of nn independent models has variance:

Var(xˉ)=Var ⁣(1nxi)=σ2n\text{Var}(\bar{x}) = \text{Var}\!\left(\frac{1}{n}\sum x_i\right) = \frac{\sigma^2}{n}

Averaging reduces variance by nn. This is the core motivation for ensemble methods.

When models are identically distributed but correlated with correlation ρ\rho:

Var(xˉ)=ρσ2+1ρnσ2\text{Var}(\bar{x}) = \rho\sigma^2 + \frac{1-\rho}{n}\sigma^2

As nn \to \infty, the second term vanishes but ρσ2\rho\sigma^2 remains. Reducing correlation between models (ρ\rho \downarrow) further reduces variance.

Bagging

Bagging (Bootstrap Aggregating) draws BB bootstrap samples (sampling with replacement) from the training data, fits a model on each, and averages predictions.

  • Drives down ρ\rho between models.
  • Increases MM (number of models), reducing variance.
  • Trade-off: bootstrap samples are correlated with each other, so bias increases slightly as the models become increasingly similar.

Decision trees have high variance and low bias, making them good candidates for bagging.

Random Forests

Random forests extend bagging by also subsampling features at each split. This further decorrelates the trees.

Algorithm (for each tree b=1,,Bb = 1, \ldots, B):

  1. Draw a bootstrap sample of size NN from the data (with replacement).
  2. Grow a tree recursively until minimum node size, at each split:
    • Select mm variables at random from pp total variables.
    • Pick the best variable and split point among the mm candidates.
    • Split the node to children.
  3. For classification: majority vote across all trees. For regression: average predictions.

As forest size increases, overfitting can increase because decision boundaries from individual trees may overlap more. Overlapping margin areas, combined with noise, lead to more errors.

Key hyperparameter: mm, the number of features sampled per split. Smaller mm reduces correlation further but may hurt individual tree accuracy.

Bias-Variance Decomposition

E[Eerr]=E[(g^p(x)gˉ(x))2]+(gˉ(x)f(x))2+σ2\mathbb{E}[E_\text{err}] = \mathbb{E}\left[(\hat{g}^p(x) - \bar{g}(x))^2\right] + (\bar{g}(x) - f(x))^2 + \sigma^2

The three terms are variance, bias squared, and irreducible noise. Bagging addresses variance. Boosting addresses bias.

AdaBoost

AdaBoost builds an ensemble sequentially. Each weak learner corrects the errors of the previous one by upweighting misclassified examples.

Algorithm:

  1. Initialise weights wi=1/Nw_i = 1/N for all NN training points.
  2. For each iteration m=1,,Mm = 1, \ldots, M:
    • Choose the weak classifier hmh_m that minimises weighted error εm\varepsilon_m.
    • Compute classifier weight: αm=12ln ⁣(1εmεm)\alpha_m = \frac{1}{2} \ln\!\left(\frac{1 - \varepsilon_m}{\varepsilon_m}\right)
    • Update sample weights: increase weight for misclassified points, decrease for correct, then renormalise.
  3. Final classifier: H(x)=sign ⁣[m=1Mαmhm(x)]H(x) = \text{sign}\!\left[\sum_{m=1}^{M} \alpha_m h_m(x)\right]

For regression, use averaging instead of the sign function.

A weak learner is any classifier that performs better than random guessing (accuracy >0.5> 0.5). Do not confuse with lazy learners (e.g. k-NN), which are a different concept.

Bagging vs Boosting

PropertyBaggingBoosting
Trees trainedIn parallelSequentially
TargetsVariance reductionBias reduction
Sample weightingUniform (bootstrap)Adaptive (error-based)
ExampleRandom ForestAdaBoost, Gradient Boosting
Feature samplingYes (Random Forest)No (standard AdaBoost)
← All posts