Class imbalance is common in real-world datasets. For example, a dataset with examples of credit card fraud will often have exponentially more records of non-fraudulent activity than those of fraudulent cases. In many applications, training your model on imbalanced classes can inhibit model functionality if predictive accuracy for minority classes is of interest. By training your model on imbalanced classes, the model is often much more likely to predict the majority class simply because it has seen far more examples of that class. We’ll explore this phenomenon and demonstrate common techniques for addressing class imbalance including oversampling, undersampling, and synthetic minority over-sampling technique (SMOTE) in Python.
The barplot below illustrates an example of a typical class imbalance within a training data set. In this application, say we want to predict all classes with equal capacity. A little background, we have 64 features that offer predictive capacity for where the fish samples came from (Gulf, West, or East).When we run a random forest classifier on this training dataset we get an overall accuracy of 75% on the test set (test set has balanced classes). Let’s look at the confusion matrix.
With imbalanced classes, the class-specific accuracy is highly variable; not surprisingly, it is highest for the majority class, Gulf. We can also see the model erroneously predicted Gulf origin for many East samples. To address this issue, there are a few techniques we can apply.
Random Oversampling
One method is to randomly resample from the minority classes (West and East) in our training dataset to meet the highest class-specific sample size, essentially copying random minority records. We’ll use the module “imbalanced-learn” that has some useful functions for this purpose. We’re going to randomly resample our training dataset with the response and features labeled “train_y” and “train_X” respectively.
We now have an equal number of samples per class.After training the model on this new dataset, we got 76% accuracy, a marginal improvement. The class-specific accuracy for East group increased a bit, but accuracy for the West class decreased.
This dataset is very small, so oversampling likely induced some overfitting. Luckily there are other options.
Synthetic Minority Over-Sampling Technique (SMOTE)
SMOTE is a way to oversample our minority classes in a manner that does not simply duplicate random records. The technique described in more detail here uses the k nearest neighbors algorithm to find samples among the minority class that are similar, and creates synthetic data points at a random interval between a given data point and its nearest neighbor.
The advantage is that we can synthesize some variation in the resampled data to reduce overfitting. We can apply this technique using imbalanced-learn again to resample our training set.
Following SMOTE resampling, the model performed with 79% accuracy, a small improvement over the 76% accuracy from simply oversampling.
The confusion matrix shows that while Gulf-specific accuracy decreased by 3% relative to the predictions made on unbalanced data, the East class accuracy increased by 13%.
Random Undersampling
Instead of resampling minority records, we can instead randomly undersample the majority class to create a balanced dataset.
Each class now has a sample size equal to the minority class.
After training the model we get 83% accuracy on the test set. An 8% increase from the original model!
Class-specific accuracy increased by 31% for the East group, and was marginally reduced for the remaining classes. Interestingly, we built a more accurate model by reducing the amount of data we trained it on. This speaks to the critical nature of having balanced classes.
There are additional methods for addressing class imbalance; these three are simply a few common approaches. The code for this exercise can be found here.