Ensemble means a group of elements viewed as a whole rather than individually. An Ensemble method creates multiple models and combines them to solve it. Ensemble methods help to improve the robustness/generalizability of the model. In this article, we will discuss some methods with their implementation in Python. For this, we choose a dataset from the UCI repository.
Basic ensemble methods
1. Averaging method: It is mainly used for regression problems. The method consists of building multiple models independently and returning the average of the prediction of all the models. In general, the combined output is better than an individual output because variance is reduced.
In the below example, three regression models (linear regression, xgboost, and random forest) are trained and their predictions are averaged. The final prediction output is pred_final.
Python3
# importing utility modules import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # importing machine learning models for prediction from sklearn.ensemble import RandomForestRegressor import xgboost as xgb from sklearn.linear_model import LinearRegression # loading train data set in dataframe from train_data.csv file df = pd.read_csv( "train_data.csv" ) # getting target data from the dataframe target = df[ "target" ] # getting train data from the dataframe train = df.drop( "target" ) # Splitting between train data into training and validation dataset X_train, X_test, y_train, y_test = train_test_split( train, target, test_size = 0.20 ) # initializing all the model objects with default parameters model_1 = LinearRegression() model_2 = xgb.XGBRegressor() model_3 = RandomForestRegressor() # training all the model on the training dataset model_1.fit(X_train, y_target) model_2.fit(X_train, y_target) model_3.fit(X_train, y_target) # predicting the output on the validation dataset pred_1 = model_1.predict(X_test) pred_2 = model_2.predict(X_test) pred_3 = model_3.predict(X_test) # final prediction after averaging on the prediction of all 3 models pred_final = (pred_1 + pred_2 + pred_3) / 3.0 # printing the mean squared error between real value and predicted value print (mean_squared_error(y_test, pred_final)) |
Output:
4560
2. Max voting: It is mainly used for classification problems. The method consists of building multiple models independently and getting their individual output called ‘vote’. The class with maximum votes is returned as output.
In the below example, three classification models (logistic regression, xgboost, and random forest) are combined using sklearn VotingClassifier, that model is trained and the class with maximum votes is returned as output. The final prediction output is pred_final. Please note it’s a classification, not regression, so the loss may be different from other types of ensemble methods.
Python
# importing utility modules import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import log_loss # importing machine learning models for prediction from sklearn.ensemble import RandomForestClassifier from xgboost import XGBClassifier from sklearn.linear_model import LogisticRegression # importing voting classifier from sklearn.ensemble import VotingClassifier # loading train data set in dataframe from train_data.csv file df = pd.read_csv( "train_data.csv" ) # getting target data from the dataframe target = df[ "Weekday" ] # getting train data from the dataframe train = df.drop( "Weekday" ) # Splitting between train data into training and validation dataset X_train, X_test, y_train, y_test = train_test_split( train, target, test_size = 0.20 ) # initializing all the model objects with default parameters model_1 = LogisticRegression() model_2 = XGBClassifier() model_3 = RandomForestClassifier() # Making the final model using voting classifier final_model = VotingClassifier( estimators = [( 'lr' , model_1), ( 'xgb' , model_2), ( 'rf' , model_3)], voting = 'hard' ) # training all the model on the train dataset final_model.fit(X_train, y_train) # predicting the output on the test dataset pred_final = final_model.predict(X_test) # printing log loss between actual and predicted value print (log_loss(y_test, pred_final)) |
Output:
231
Let’s have a look at a bit more advanced ensemble methods
Advanced ensemble methods
Ensemble methods are extensively used in classical machine learning. Examples of algorithms using bagging are random forest and bagging meta-estimator and examples of algorithms using boosting are GBM, XGBM, Adaboost, etc.
As a developer of a machine learning model, it is highly recommended to use ensemble methods. The ensemble methods are used extensively in almost all competitions and research papers.
1. Stacking: It is an ensemble method that combines multiple models (classification or regression) via meta-model (meta-classifier or meta-regression). The base models are trained on the complete dataset, then the meta-model is trained on features returned (as output) from base models. The base models in stacking are typically different. The meta-model helps to find the features from base models to achieve the best accuracy.
Algorithm:
- Split the train dataset into n parts
- A base model (say linear regression) is fitted on n-1 parts and predictions are made for the nth part. This is done for each one of the n part of the train set.
- The base model is then fitted on the whole train dataset.
- This model is used to predict the test dataset.
- The Steps 2 to 4 are repeated for another base model which results in another set of predictions for the train and test dataset.
- The predictions on train data set are used as a feature to build the new model.
- This final model is used to make the predictions on test dataset
Stacking is a bit different from the basic ensembling methods because it has first-level and second-level models. Stacking features are first extracted by training the dataset with all the first-level models. A first-level model is then using the train stacking features to train the model than this model predicts the final output with test stacking features.
Python3
# importing utility modules import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # importing machine learning models for prediction from sklearn.ensemble import RandomForestRegressor import xgboost as xgb from sklearn.linear_model import LinearRegression # importing stacking lib from vecstack import stacking # loading train data set in dataframe from train_data.csv file df = pd.read_csv( "train_data.csv" ) # getting target data from the dataframe target = df[ "target" ] # getting train data from the dataframe train = df.drop( "target" ) # Splitting between train data into training and validation dataset X_train, X_test, y_train, y_test = train_test_split( train, target, test_size = 0.20 ) # initializing all the base model objects with default parameters model_1 = LinearRegression() model_2 = xgb.XGBRegressor() model_3 = RandomForestRegressor() # putting all base model objects in one list all_models = [model_1, model_2, model_3] # computing the stack features s_train, s_test = stacking(all_models, X_train, X_test, y_train, regression = True , n_folds = 4 ) # initializing the second-level model final_model = model_1 # fitting the second level model with stack features final_model = final_model.fit(s_train, y_train) # predicting the final output using stacking pred_final = final_model.predict(X_test) # printing the mean squared error between real value and predicted value print (mean_squared_error(y_test, pred_final)) |
Output:
4510
2. Blending: It is similar to the stacking method explained above, but rather than using the whole dataset for training the base-models, a validation dataset is kept separate to make predictions.
Algorithm:
- Split the training dataset into train, test and validation dataset.
- Fit all the base models using train dataset.
- Make predictions on validation and test dataset.
- These predictions are used as features to build a second level model
- This model is used to make predictions on test and meta-features
Python3
# importing utility modules import pandas as pd from sklearn.metrics import mean_squared_error # importing machine learning models for prediction from sklearn.ensemble import RandomForestRegressor import xgboost as xgb from sklearn.linear_model import LinearRegression # importing train test split from sklearn.model_selection import train_test_split # loading train data set in dataframe from train_data.csv file df = pd.read_csv( "train_data.csv" ) # getting target data from the dataframe target = df[ "target" ] # getting train data from the dataframe train = df.drop( "target" ) #Splitting between train data into training and validation dataset X_train, X_test, y_train, y_test = train_test_split(train, target, test_size = 0.20 ) # performing the train test and validation split train_ratio = 0.70 validation_ratio = 0.20 test_ratio = 0.10 # performing train test split x_train, x_test, y_train, y_test = train_test_split( train, target, test_size = 1 - train_ratio) # performing test validation split x_val, x_test, y_val, y_test = train_test_split( x_test, y_test, test_size = test_ratio / (test_ratio + validation_ratio)) # initializing all the base model objects with default parameters model_1 = LinearRegression() model_2 = xgb.XGBRegressor() model_3 = RandomForestRegressor() # training all the model on the train dataset # training first model model_1.fit(x_train, y_train) val_pred_1 = model_1.predict(x_val) test_pred_1 = model_1.predict(x_test) # converting to dataframe val_pred_1 = pd.DataFrame(val_pred_1) test_pred_1 = pd.DataFrame(test_pred_1) # training second model model_2.fit(x_train, y_train) val_pred_2 = model_2.predict(x_val) test_pred_2 = model_2.predict(x_test) # converting to dataframe val_pred_2 = pd.DataFrame(val_pred_2) test_pred_2 = pd.DataFrame(test_pred_2) # training third model model_3.fit(x_train, y_train) val_pred_3 = model_1.predict(x_val) test_pred_3 = model_1.predict(x_test) # converting to dataframe val_pred_3 = pd.DataFrame(val_pred_3) test_pred_3 = pd.DataFrame(test_pred_3) # concatenating validation dataset along with all the predicted validation data (meta features) df_val = pd.concat([x_val, val_pred_1, val_pred_2, val_pred_3], axis = 1 ) df_test = pd.concat([x_test, test_pred_1, test_pred_2, test_pred_3], axis = 1 ) # making the final model using the meta features final_model = LinearRegression() final_model.fit(df_val, y_val) # getting the final output final_pred = final_model.predict(df_test) #printing the mean squared error between real value and predicted value print (mean_squared_error(y_test, pred_final)) |
Output:
4790
3. Bagging: It is also known as a bootstrapping method. Base models are run on bags to get a fair distribution of the whole dataset. A bag is a subset of the dataset along with a replacement to make the size of the bag the same as the whole dataset. The final output is formed after combining the output of all base models.
Algorithm:
- Create multiple datasets from the train dataset by selecting observations with replacements
- Run a base model on each of the created datasets independently
- Combine the predictions of all the base models to each the final output
Bagging normally uses only one base model (XGBoost Regressor used in the code below).
Python
# importing utility modules import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # importing machine learning models for prediction import xgboost as xgb # importing bagging module from sklearn.ensemble import BaggingRegressor # loading train data set in dataframe from train_data.csv file df = pd.read_csv( "train_data.csv" ) # getting target data from the dataframe target = df[ "target" ] # getting train data from the dataframe train = df.drop( "target" ) # Splitting between train data into training and validation dataset X_train, X_test, y_train, y_test = train_test_split( train, target, test_size = 0.20 ) # initializing the bagging model using XGboost as base model with default parameters model = BaggingRegressor(base_estimator = xgb.XGBRegressor()) # training model model.fit(X_train, y_train) # predicting the output on the test dataset pred = model.predict(X_test) # printing the mean squared error between real value and predicted value print (mean_squared_error(y_test, pred_final)) |
Output:
4666
4. Boosting: Boosting is a sequential method–it aims to prevent a wrong base model from affecting the final output. Instead of combining the base models, the method focuses on building a new model that is dependent on the previous one. A new model tries to remove the errors made by its previous one. Each of these models is called weak learners. The final model (aka strong learner) is formed by getting the weighted mean of all the weak learners.
Algorithm:
- Take a subset of the train dataset.
- Train a base model on that dataset.
- Use third model to make predictions on the whole dataset.
- Calculate errors using the predicted values and actual values.
- Initialize all data points with same weight.
- Assign higher weight to incorrectly predicted data points.
- Make another model, make predictions using the new model in such a way that errors made by the previous model are mitigated/corrected.
- Similarly, create multiple models–each successive model correcting the errors of the previous model.
- The final model (strong learner) is the weighted mean of all the previous models (weak learners).
Python3
# importing utility modules import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # importing machine learning models for prediction from sklearn.ensemble import GradientBoostingRegressor # loading train data set in dataframe from train_data.csv file df = pd.read_csv( "train_data.csv" ) # getting target data from the dataframe target = df[ "target" ] # getting train data from the dataframe train = df.drop( "target" ) # Splitting between train data into training and validation dataset X_train, X_test, y_train, y_test = train_test_split( train, target, test_size = 0.20 ) # initializing the boosting module with default parameters model = GradientBoostingRegressor() # training the model on the train dataset model.fit(X_train, y_train) # predicting the output on the test dataset pred_final = model.predict(X_test) # printing the mean squared error between real value and predicted value print (mean_squared_error(y_test, pred_final)) |
Output:
4789
Note: The scikit-learn provides several modules/methods for ensemble methods. Please note the accuracy of a method does not suggest one method is superior to another. The article aims to give a brief introduction to ensemble methods–not to compare between them. The programmer must use a method that suits the data.