Ensemble Learning: Bagging, Random Forests, and AdaBoost

Algorithms as Random Variables

Treat each trained model as a random variable $x_i$ with variance $\sigma^2$ . The average of $n$ independent models has variance:

\text{Var}(\bar{x}) = \text{Var}\!\left(\frac{1}{n}\sum x_i\right) = \frac{\sigma^2}{n}

Averaging reduces variance by $n$ . This is the core motivation for ensemble methods.

When models are identically distributed but correlated with correlation $\rho$ :

\text{Var}(\bar{x}) = \rho\sigma^2 + \frac{1-\rho}{n}\sigma^2

As $n \to \infty$ , the second term vanishes but $\rho\sigma^2$ remains. Reducing correlation between models ( $\rho \downarrow$ ) further reduces variance.

Bagging

Bagging (Bootstrap Aggregating) draws $B$ bootstrap samples (sampling with replacement) from the training data, fits a model on each, and averages predictions.

Drives down $\rho$ between models.
Increases $M$ (number of models), reducing variance.
Trade-off: bootstrap samples are correlated with each other, so bias increases slightly as the models become increasingly similar.

Decision trees have high variance and low bias, making them good candidates for bagging.

Random Forests

Random forests extend bagging by also subsampling features at each split. This further decorrelates the trees.

Algorithm (for each tree $b = 1, \ldots, B$ ):

Draw a bootstrap sample of size $N$ from the data (with replacement).
Grow a tree recursively until minimum node size, at each split:
- Select $m$ variables at random from $p$ total variables.
- Pick the best variable and split point among the $m$ candidates.
- Split the node to children.
For classification: majority vote across all trees. For regression: average predictions.

As forest size increases, overfitting can increase because decision boundaries from individual trees may overlap more. Overlapping margin areas, combined with noise, lead to more errors.

Key hyperparameter: $m$ , the number of features sampled per split. Smaller $m$ reduces correlation further but may hurt individual tree accuracy.

Bias-Variance Decomposition

\mathbb{E}[E_\text{err}] = \mathbb{E}\left[(\hat{g}^p(x) - \bar{g}(x))^2\right] + (\bar{g}(x) - f(x))^2 + \sigma^2

The three terms are variance, bias squared, and irreducible noise. Bagging addresses variance. Boosting addresses bias.

AdaBoost

AdaBoost builds an ensemble sequentially. Each weak learner corrects the errors of the previous one by upweighting misclassified examples.

Algorithm:

Initialise weights $w_i = 1/N$ for all $N$ training points.
For each iteration $m = 1, \ldots, M$ $m = 1, \dots, M$ :
- Choose the weak classifier $h_m$ that minimises weighted error $\varepsilon_m$ .
- Compute classifier weight: $\alpha_m = \frac{1}{2} \ln\!\left(\frac{1 - \varepsilon_m}{\varepsilon_m}\right)$
- Update sample weights: increase weight for misclassified points, decrease for correct, then renormalise.
Final classifier: $H(x) = \text{sign}\!\left[\sum_{m=1}^{M} \alpha_m h_m(x)\right]$

For regression, use averaging instead of the sign function.

A weak learner is any classifier that performs better than random guessing (accuracy $> 0.5$ ). Do not confuse with lazy learners (e.g. k-NN), which are a different concept.

Bagging vs Boosting

Property	Bagging	Boosting
Trees trained	In parallel	Sequentially
Targets	Variance reduction	Bias reduction
Sample weighting	Uniform (bootstrap)	Adaptive (error-based)
Example	Random Forest	AdaBoost, Gradient Boosting
Feature sampling	Yes (Random Forest)	No (standard AdaBoost)