Ensemble Learning: Bagging, Random Forests, and AdaBoost
Bagging reduces variance by averaging over bootstrapped models. Random forests decorrelate the trees further by subsampling features. AdaBoost reduces bias sequentially by upweighting misclassified examples.
Algorithms as Random Variables
Treat each trained model as a random variable with variance . The average of independent models has variance:
Averaging reduces variance by . This is the core motivation for ensemble methods.
When models are identically distributed but correlated with correlation :
As , the second term vanishes but remains. Reducing correlation between models () further reduces variance.
Bagging
Bagging (Bootstrap Aggregating) draws bootstrap samples (sampling with replacement) from the training data, fits a model on each, and averages predictions.
- Drives down between models.
- Increases (number of models), reducing variance.
- Trade-off: bootstrap samples are correlated with each other, so bias increases slightly as the models become increasingly similar.
Decision trees have high variance and low bias, making them good candidates for bagging.
Random Forests
Random forests extend bagging by also subsampling features at each split. This further decorrelates the trees.
Algorithm (for each tree ):
- Draw a bootstrap sample of size from the data (with replacement).
- Grow a tree recursively until minimum node size, at each split:
- Select variables at random from total variables.
- Pick the best variable and split point among the candidates.
- Split the node to children.
- For classification: majority vote across all trees. For regression: average predictions.
As forest size increases, overfitting can increase because decision boundaries from individual trees may overlap more. Overlapping margin areas, combined with noise, lead to more errors.
Key hyperparameter: , the number of features sampled per split. Smaller reduces correlation further but may hurt individual tree accuracy.
Bias-Variance Decomposition
The three terms are variance, bias squared, and irreducible noise. Bagging addresses variance. Boosting addresses bias.
AdaBoost
AdaBoost builds an ensemble sequentially. Each weak learner corrects the errors of the previous one by upweighting misclassified examples.
Algorithm:
- Initialise weights for all training points.
- For each iteration :
- Choose the weak classifier that minimises weighted error .
- Compute classifier weight:
- Update sample weights: increase weight for misclassified points, decrease for correct, then renormalise.
- Final classifier:
For regression, use averaging instead of the sign function.
A weak learner is any classifier that performs better than random guessing (accuracy ). Do not confuse with lazy learners (e.g. k-NN), which are a different concept.
Bagging vs Boosting
| Property | Bagging | Boosting |
|---|---|---|
| Trees trained | In parallel | Sequentially |
| Targets | Variance reduction | Bias reduction |
| Sample weighting | Uniform (bootstrap) | Adaptive (error-based) |
| Example | Random Forest | AdaBoost, Gradient Boosting |
| Feature sampling | Yes (Random Forest) | No (standard AdaBoost) |