In machine learning, for building solid and reliable models prediction accuracy is the key factor. Ensemble learning is a supervised machine-learning technique that combines multiple models to build a more powerful and robust model. The idea is that by combining the strengths of multiple models, we can create a model that is more robust and less likely to overfit the data. It can be used for both classifications and regression tasks.
Ensemble learning techniques can be categorized in three ways:
- Bagging (Bootstrap Aggregating)
- Boosting
- Stacking (Stacked Generalization)
Bagging is a supervised machine-learning technique, and it can be used for both regression and classification tasks, In this article we will discuss the bagging classifier.
Bagging Classifier
Bagging (or Bootstrap aggregating) is a type of ensemble learning in which multiple base models are trained independently in parallel on different subsets of the training data. Each subset is generated using bootstrap sampling, in which data points are picked at random with replacement. In the case of the Bagging classifier, the final prediction is made by aggregating the predictions of the all-base model, using majority voting. In the case of regression, the final prediction is made by averaging the predictions of the all-base model, and that is known as bagging regression.
Bagging helps improve accuracy and reduce overfitting, especially in models that have high variance.
How does Bagging Classifier Work?
The basic steps of how a bagging classifier works are as follows:
- Bootstrap Sampling: In Bootstrap Sampling randomly ‘n’ subsets of original training data are sampled with replacement. This step ensures that the base models are trained on diverse subsets of the data, as some samples may appear multiple times in the new subset, while others may be omitted. It reduces the risks of overfitting and improves the accuracy of the model.
Let's break it down step by step:
Original training dataset: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Resampled training set 1: [2, 3, 3, 5, 6, 1, 8, 10, 9, 1]
Resampled training set 2: [1, 1, 5, 6, 3, 8, 9, 10, 2, 7]
Resampled training set 3: [1, 5, 8, 9, 2, 10, 9, 7, 5, 4]
- Base Model Training: In bagging, multiple base models are used. After the Bootstrap Sampling, each base model is independently trained using a specific learning algorithm, such as decision trees, support vector machines, or neural networks on a different bootstrapped subset of data. These models are typically called “Weak learners” because they may not be highly accurate on their own. Since the base model is trained independently of different subsets of data. To make the model computationally efficient and less time-consuming, the base models can be trained in parallel.
- Aggregation: Once all the base models are trained, it is used to make predictions on the unseen data i.e. the subset of data on which that base model is not trained. In the bagging classifier, the predicted class label for the given instance is chosen based on the majority voting. The class which has the majority voting is the prediction of the model.
- Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training subset of particular base models during the bootstrapping method. These “out-of-bag” samples can be used to estimate the model’s performance without the need for cross-validation.
- Final Prediction: After aggregating the predictions from all the base models, Bagging produces a final prediction for each instance.
Algorithm for the Bagging classifier:
Classifier generation: Let N be the size of the training set. for each of t iterations: sample N instances with replacement from the original training set. apply the learning algorithm to the sample. store the resulting classifier. Classification: for each of the t classifiers: predict class of instance using classifier. return class that was predicted most often.
Python implementation of the Bagging classifier algorithm:
BaggingClassifier
Define the BaggingClassifier class with the base_classifier and n_estimators as input parameters for the constructor.
- Initialize the class attributes base_classifier, n_estimators, and an empty list classifiers to store the trained classifiers.
- Define the fit method to train the bagging classifiers:
- For each iteration from 0 to n_estimators – 1:
- Perform bootstrap sampling with replacement by randomly selecting len(X) indices from the range of len(X) with replacement.
- Create new subsets X_sampled and y_sampled by using the selected indices.
- Create a new instance of the base_classifier to create a new classifier model for this iteration.
- Train the classifier on the sampled data X_sampled and y_sampled.
- Append the trained classifier to the list classifiers.
- Return the list of trained classifiers.
- Perform bootstrap sampling with replacement by randomly selecting len(X) indices from the range of len(X) with replacement.
- For each iteration from 0 to n_estimators – 1:
- Define the predict method to make predictions using the ensemble of classifiers:
- For each classifier in the classifiers list:
- Use the trained classifier to predict the classes of the input data X.
- Aggregate the predictions using majority voting to get the final predictions.
- Use the trained classifier to predict the classes of the input data X.
- Return the final predictions.
- For each classifier in the classifiers list:
Implementations
Python3
class BaggingClassifier: def __init__( self , base_classifier, n_estimators): self .base_classifier = base_classifier self .n_estimators = n_estimators self .classifiers = [] def fit( self , X, y): for _ in range ( self .n_estimators): # Bootstrap sampling with replacement indices = np.random.choice( len (X), len (X), replace = True ) X_sampled = X[indices] y_sampled = y[indices] # Create a new base classifier and train it on the sampled data classifier = self .base_classifier.__class__() classifier.fit(X_sampled, y_sampled) # Store the trained classifier in the list of classifiers self .classifiers.append(classifier) return self .classifiers def predict( self , X): # Make predictions using all the base classifiers predictions = [classifier.predict(X) for classifier in self .classifiers] # Aggregate predictions using majority voting majority_votes = np.apply_along_axis( lambda x: np.bincount(x).argmax(), axis = 0 , arr = predictions) return majority_votes |
Example
Python3
# Import the necessary libraries from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load the dataset digit = load_digits() X, y = digit.data, digit.target # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 42 ) # Create the base classifier dc = DecisionTreeClassifier() model = BaggingClassifier(base_classifier = dc, n_estimators = 10 ) classifiers = model.fit(X_train, y_train) # Make predictions on the test set y_pred = model.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print ( "Accuracy:" , accuracy) |
Output:
Accuracy: 0.9472222222222222
Let’s check the result of each classifier
Python3
for i, clf in enumerate (classifiers): y_pred = clf.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print ( "Accuracy:" + str (i + 1 ), ':' , accuracy) |
Output:
Accuracy:1 : 0.8833333333333333
Accuracy:2 : 0.8361111111111111
Accuracy:3 : 0.85
Accuracy:4 : 0.85
Accuracy:5 : 0.8388888888888889
Accuracy:6 : 0.8388888888888889
Accuracy:7 : 0.8472222222222222
Accuracy:8 : 0.8222222222222222
Accuracy:9 : 0.8527777777777777
Accuracy:10 : 0.8111111111111111
Python3
from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load the dataset digit = load_digits() X, y = digit.data, digit.target # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 42 ) # Create the base classifier base_classifier = DecisionTreeClassifier() # Number of base models (iterations) n_estimators = 10 # Create the Bagging classifier bagging_classifier = BaggingClassifier(base_estimator = base_classifier, n_estimators = n_estimators) # Train the Bagging classifier bagging_classifier.fit(X_train, y_train) # Make predictions on the test set y_pred = bagging_classifier.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print ( "Accuracy:" , accuracy) |
Output:
Accuracy: 0.9361111111111111
Advantages of Bagging Classifier
The advantages of using a Bagging Classifier are as follows:
- Improved Predictive Performance: Bagging Classifier often outperforms single classifiers by reducing overfitting and increasing predictive accuracy. By combining multiple base models, it can better generalize to unseen data.
- Robustness: Bagging reduces the impact of outliers and noise in the data by aggregating predictions from multiple models. This enhances the overall stability and robustness of the model.
- Reduced Variance: Since each base model is trained on different subsets of the data, the aggregated model’s variance is significantly reduced compared to an individual model.
- Parallelization: Bagging allows for parallel processing, as each base model can be trained independently. This makes it computationally efficient, especially for large datasets.
- Flexibility: Bagging Classifier is a versatile technique that can be applied to a wide range of machine learning algorithms, including decision trees, random forests, and support vector machines.
Applications of Bagging Classifier
Bagging Classifier can be applied in various real-world tasks:
- Fraud Detection: Bagging Classifier can be used to detect fraudulent transactions by aggregating predictions from multiple fraud detection models.
- Spam filtering: Bagging classifier can be used to filter spam emails by aggregating predictions from multiple spam filters trained on different subsets of the spam emails.
- Credit scoring: Bagging classifier can be used to improve the accuracy of credit scoring models by combining the predictions of multiple models trained on different subsets of the credit data.
- Image Classification: Bagging classifier can be used to improve the accuracy of image classification tasks by combining the predictions of multiple classifiers trained on different subsets of the training images.
- Natural language processing: In NLP tasks, the bagging classifier can combine predictions from multiple language models to achieve better text classification results.
Conclusion
Bagging Classifier, as an ensemble learning technique, offers a powerful solution for improving predictive performance and model robustness. Bagging Classifier avoids overfitting, improves generalisation, and gives solid predictions for a wide range of applications by using the collective wisdom of numerous base models.