Random forest is a statistical ensemble algorithm commonly used in machine learning for classification and regression tasks. It builds a collection of decision trees and combines their outputs to produce a single prediction, which reduces variance compared with an individual tree and often improves accuracy.
Basic idea
During training, many trees are grown from different samples of the original data. Each tree is allowed to split on a random subset of the available variables, which makes the trees diverse. For a new case, every tree gives a prediction and the forest aggregates them — by majority vote for classification or by averaging for regression — to assign a label or value to a new point of data.
How it is constructed
- Draw a bootstrap sample (random sample with replacement) from the training set.
- Grow a decision tree on this sample. At each split, choose the best split among a random subset of predictors rather than all predictors.
- Repeat steps 1–2 to produce a large number (tens to thousands) of trees.
- Aggregate tree predictions: use majority vote for classification or mean prediction for regression.
Typical features and diagnostics
- Out-of-bag (OOB) error: because each tree is trained on a bootstrap sample, the observations not included in that sample can be used as a validation set to estimate prediction error without a separate hold-out set.
- Variable importance: measures derived from the forest indicate which predictors contribute most to predictive performance.
- Handles high-dimensional inputs and many correlated features reasonably well, thanks to feature subsampling at splits.
- Can accommodate missing values and mixed types (categorical and numerical) in many implementations.
Advantages
- Robust to overfitting in many practical problems due to averaging across trees.
- Works well with large numbers of predictors and complex interactions without heavy feature engineering.
- Provides internal estimates of error and variable importance, simplifying model assessment.
Limitations and cautions
- Less interpretable than a single decision tree; the ensemble does not provide a simple global model.
- Can be biased when classes are highly imbalanced; special care (resampling, class weights) may be needed.
- May require substantial memory and compute time when forests are very large or data are enormous.
- Not ideal for extrapolation beyond the range of the training data in regression problems.
Applications
Random forests are used across many domains, including bioinformatics, remote sensing, finance, and any area where reliable predictive performance and handling of many input variables are important. They are a standard off-the-shelf method when a balance between accuracy and ease of use is desired.
Practical notes
- Key tuning choices include the number of trees, the number of variables considered at each split, and tree depth.
- Many software packages provide efficient implementations and diagnostics; users should monitor OOB error and variable importance to guide modeling decisions.
- Although random forests often perform well with default settings, model validation with cross-validation or a separate test set is still recommended.