In this article, we are going to see the how to solve overfitting in Random Forest in Sklearn Using Python.
What is overfitting?
Overfitting is a common phenomenon you should look out for any time you are training a machine learning model. Overfitting happens when a model learns the pattern as well as the noise of the data on which the model is trained. Specifically, the model picks up on patterns that are specific to the observations in the training data but do not generalize to other observations. And hence the model is able to make great predictions on the data it was trained on but is not able to make good predictions on data it did not see during training.
Why is overfitting a problem?
Overfitting is a problem because machine learning models are generally trained with the intention of making predictions on unseen data. Models which overfit their training data set are not able to make good predictions on new data that they did not see during training, so they are not able to make predictions on unseen data.
How do you check whether your model is overfitting to the training data?
In order to check whether your model is overfitting to the training data, you should make sure to split your dataset into a training dataset that is used to train your model and a test dataset that is not touched at all during model training. This way you will have a dataset available that the model did not see at all during training that you can use to assess whether your model is overfitting.
You should generally allocate around 70% of your data to the training dataset and 30% of your data to the test dataset. Only after you train your model on the training dataset and optimize and hyperparameters you plan to optimize should you use your test dataset. At that point, you can use your model to make predictions on both the test data and the training data and then compare the performance metrics on the test and training data.
If your model is overfitting to the training data, you will notice that the performance metrics on the training data are much better than the performance metrics on the test data.
How to prevent overfitting in random forests of python sklearn?
Hyperparameter tuning is the answer for any such question where we want to boost the performance of a model without any change in the dataset available. But before exploring which hyperparameters can help us let’s understand how the random forest model works.
A random forest model is a stack of multiple decision trees and by combining the results of each decision tree accuracy shot up drastically. Based on this simple explanation of the random forest model there are multiple hyperparameters that we can tune while loading an instance of the random forest model which helps us to prune overfitting.
- max_depth: This controls how deep or the number of layers deep we will have our decision trees up to.
- n_estimators: This controls the number of decision trees that will be there in each layer. This and the previous parameter solves the problem of overfitting up to a great extent.
- criterion: While training a random forest data is split into parts and this parameter controls how these splits will occur.
- min_samples_leaf: This determines the minimum number of leaf nodes.
- min_samples_split: This determines the minimum number of samples required to split the code.
- max_leaf_nodes: This determines the maximum number of leaf nodes.
There are more parameters that we can tune to prune the overfitting problem but the parameters mentioned above are more effective in serving the purpose most of the time.
Note:-
A random forest model can be loaded without thinking about these hyperparameters as well because some default value is always assigned to these parameters and we can control them explicitly to serve our purpose.
Now lets us explore these hyperparameters a bit using datasets.
Importing Libraries
Python libraries simplify data handling and operation-related tasks up to a great extent.
Python3
from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn import metrics |
We will load the dummy dataset for a classification task from sklearn.
Python3
X, y = datasets.make_classification() X_train, X_val, Y_train, Y_val = train_test_split(X, y, test_size = 0.2 , random_state = 2022 ) print (X_train.shape, X_val.shape) |
Output:
(80, 20) (20, 20)
Let’s train a RandomForestClassifier on this dataset without using any hyperparameters.
Python3
model = RandomForestClassifier() model.fit(X_train, Y_train) print ( 'Training Accuracy : ' , metrics.accuracy_score(Y_train, model.predict(X_train)) * 100 ) print ( 'Validation Accuracy : ' , metrics.accuracy_score(Y_val, model.predict(X_val)) * 100 ) |
Output:
Training Accuracy : 100.0 Validation Accuracy : 75.0
Here we can see that the training accuracy is 100% but the validation accuracy is just 75% which is less compared to the case of training accuracy which means that the model is overfitting to the training data. To solve this problem first let’s use the parameter max_depth.
Python3
model = RandomForestClassifier(max_depth = 2 , random_state = 22 ) model.fit(X_train, Y_train) print ( 'Training Accuracy : ' , metrics.accuracy_score(Y_train, model.predict(X_train)) * 100 ) print ( 'Validation Accuracy : ' , metrics.accuracy_score(Y_val, model.predict(X_val)) * 100 ) |
Output:
Training Accuracy : 95.0 Validation Accuracy : 75.0
From a difference of 25%, we have achieved a difference of 20% by just tuning the value o one hyperparameter. Similarly, let’s use the n_estimators.
Python3
model = RandomForestClassifier(n_estimators = 30 , random_state = 22 ) model.fit(X_train, Y_train) print ( 'Training Accuracy : ' , metrics.accuracy_score(Y_train, model.predict(X_train)) * 100 ) print ( 'Validation Accuracy : ' , metrics.accuracy_score(Y_val, model.predict(X_val)) * 100 ) |
Output:
Training Accuracy : 100.0 Validation Accuracy : 85.0
Again by pruning another hyperparameter, we are able to solve the problem of overfitting even more.
Python3
model = RandomForestClassifier( max_depth = 2 , n_estimators = 30 , min_samples_split = 3 , max_leaf_nodes = 5 , random_state = 22 ) model.fit(X_train, Y_train) print ( 'Training Accuracy : ' , metrics.accuracy_score( Y_train, model.predict(X_train)) * 100 ) print ( 'Validation Accuracy : ' , metrics.accuracy_score( Y_val, model.predict(X_val)) * 100 ) |
Output:
Training Accuracy : 95.0 Validation Accuracy : 80.0
As shown above we can use multiple parameters as well to prune the overfitting easily.
Conclusion
Hyperparameter tuning is all about achieving better performance with the same amount of data. And in this article, we have seen how can we improve the performance of a RandomForestClassifier along with solving the problem of overfitting.