Written in March 2022.
To pay homage to my ecology background, it is only right that I begin discussing machine learning methods with random forest.
Random forest is one of my favorite algorithms - it’s effective and elegant. We know that a forest is made up of many trees in real life. Same goes to random forest - it is an ensemble of decision trees trained independently of each other to ultimately cast a majority vote. To answer the question that the title of this post begs, we must start by investigating:
What’s random about random forests?
There are two elements of randomness in a random forest:
1. random sampling of training data points when building trees
2. random subsets of features considered
During training, samples are drawn with replacement (bootstrapping). Some samples are used multiple times in a tree. Only approximately 2/3 of data instances are used during the training process. In other words, about 1/3 of instances in your training data are left out. Each tree learns from a random, bootstrapped sample of training data. Therefore, each tree learns different parts of training data. Random subsets of features are considered when splitting nodes. At test time, random forest makes predictions by taking average predictions of each tree (aggregating). This is why you often hear the term “bagging” — it is simply a combination of two processes: bootstrap and aggregating.
Why is randomness powerful then?
The random selection of features considered can minimize the correlation effect between inputs while maintaining the strength of the tree. According to Leo Breiman (2001), the author of random forest paper, random selection features help the resulting forest to be relatively robust to outliers and noise. Moreover, the randomness nature of feature selection allows this algorithm to be parallelized to reduce training time.
Random sampling of data points allows us to generate estimates of ongoing generalization errors, strength, and correlation (since every tree’s knowledge within the ensemble is different, we can truly test if the forest can generalize to unseen data).
It is perhaps obvious at this point, Breiman also pointed out that the accuracy of random forests depends on 2 things: the strength of individual trees and a measure of dependence between them.
It’s worth noting that while random forest can help reduce the problem of multicollinearity of features, the variable importance of the correlated features will be reduced. It is still important to use your domain knowledge, if possible, to inform you which feature to use/remove. If you lack domain expertise, it is worth running different random forest experiments to test different combination of features and see for yourself the differences in model performances and variable importance values.
Increasing number of trees (alone) does not result in overfitting
This is perhaps a shocking and confusing statement to you. Many of you know that as model complexity increases, the tendency of overfitting also increases. However, in the context of random forests, the number of trees built is not a problem because each tree is built independently of another, meaning each tree is always starting from scratch. If you need math to convince you, Leo Breiman (2001) proved this using the Strong Law of Large Numbers: error rates converge to a certain value, Simply put, while keeping all other hyperparameters constant, beyond a sufficiently large value of number of trees, error rates plateau out and do not increase anymore. Following up on Breiman’s paper, Probst and Boulesteix (2018) also empirically concluded that we should not tune the number of trees and more trees do not degrade performance. In other words, theoretically, you should pick the number of trees based on the consideration of computational cost.
However, Probst and Boulesteix (2018) also found that empirically 100 trees are sufficient to lead to the largest performance gain. Oftentimes, we see that the testing performance might decrease when we increase the number of trees, even if all else stays constant. When you see this, you should note this as a sign of the distributional differences between the training data and the testing data. If both training and testing data share the same distribution, perhaps your trees are too “fully grown”, which means your trees are too deep — then, it’s time to reduce the depth of trees. Generally, you need more trees when you have:
lower sample size for each tree while training
bigger constraints on tree depth
more variables (since they lead to less correlated trees)
to reach convergence of error rate.
Conclusion
The name of random forests gives away the reason why they are so powerful. It’s because of the embedded randomness within the algorithm. Don’t blame on the number of trees alone when your model overfits. Check for the tree depth, sample size for each tree, and the number of variables.
References
Breiman, Leo (2001), Random Forests, https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
Probst, Phillip and Boulesteix, Anne-Laure (2018), To Tune or Not to Tune the
Number of Trees in Random Forest, http://www.jmlr.org/papers/volume18/17-269/17-269.pdf