Optimizing hyperparameters for machine learning models is a key step in making accurate predictions. Hyperparameters define characteristics of the model that can impact model accuracy and computational efficiency. They are typically set prior to fitting the model to the data. In contrast, parameters are values estimated during the training process that allow the model to fit the data. Hyperparameters are often optimized through trial and error; multiple models are fit with a variety of hyperparameter values, and their performance is compared.
[Related Article: The Beginner’s Guide to Scikit-Learn]
For random forest algorithms, one can manipulate a variety of key attributes that define model structure. A comprehensive list can be found under the documentation for scikit-learn’s random forest classifier found here. The following five hyperparameters are commonly adjusted:
N_estimators
Random forest models are ensembles of decision trees and we can define the number of decision trees in the forest. Additional decision trees typically improve model accuracy because predictions are made based on a larger number of “votes” from diverse trees, however, large numbers of trees are computationally expensive.
Max_features
Random forest models randomly resample features prior to determining the best split. Max_features determines the number of features to resample. Larger max_feature values can result in improved model performance because trees have a larger selection of features from which choose the best split, but can also cause trees to be less diverse and induce overfitting. The common theme here is one needs to identify an optimal value that balances overfitting and under-fitting. Common choices include:
‘auto’: places no restrictions on the number of features,
‘sqrt’: square root of the total number of features,
‘log2’: base two logarithm of the total number of features.
Max_depth
Each tree in the random forest model makes multiple splits to isolate homogeneous groups of outcomes. Larger numbers of splits allowed in each tree enables the trees to explain more variation in the data, however, trees with many splits may overfit the data. A range of depth values should be evaluated, including “None” where trees are split until all the leaves are pure.
Min_samples_split
We can control the minimum number of samples required to split each node. Values too large may cause under-fitting, as the trees won’t be able to split enough times to achieve node purity. This hyperparameter should be based on the number of records in the training dataset.
Min_samples_leaf
Much like stopping the growth of trees once a minimum number of samples per split is reached, we can set the minimum number of samples required for each leaf. With values too large, the trees may not be able to split enough to capture sufficient variation in the data. Optimal values for this hyperparameter are dependent on the size of the training set.
Cross-validation is often used to determine the optimal values for hyperparameters; we want to identify a model structure that performs the best on records it has not been trained on. A variety of hyperparameter values should be considered. For example, below are some candidate hyperparameters.
To evaluate the total number of possible combinations of each candidate hyperparameter, 2,250 cross validation tests would be necessary. Using k-fold cross validation with 5 folds requires 11,250 models to be evaluated, a large computational expense. To overcome the computation time required to evaluate each candidate model, a common technique is to randomly sample a specified number of possible candidate models. This is possible using scikit-learn’s function “RandomizedSearchCV”. First set up a dictionary of the candidate hyperparameter values.
Next, define the model type, in this case a random forest regressor. RandomizedSearchCV will take the model object, candidate hyperparameters, the number of random candidate models to evaluate, and the number of folds for the cross validation.
Finally, fit the RandomizedSearchCV object to the data frames containing features and labels and print the optimal hyperparameter values. Depending on the hyperparameter values chosen, number of iterations, and the number of cross validation folds, this step can take a few minutes.
[Related Article: Exploring Scikit-Learn Further: The Bells and Whistles of Preprocessing]
Random forest models typically perform well with default hyperparameter values, however, to achieve maximum accuracy, optimization techniques can be worthwhile. There are additional hyperparameters available to tune that can improve model accuracy and computational efficiency; this article touches on five hyperparameters that are commonly optimized. To learn more about tuning random forest models, see scikit-learn’s documentation here.