Class imbalance is common in real-world datasets. For example, a dataset with examples of credit card fraud will often have exponentially more records of non-fraudulent activity than those of fraudulent cases. In many applications, training your model on imbalanced classes can inhibit model functionality if predictive accuracy for minority classes is of interest. By training your model on imbalanced classes, the model is often much more likely to predict the majority class simply because it has seen far more examples of that class. We’ll explore this phenomenon and demonstrate common techniques for addressing class imbalance including oversampling, undersampling, and synthetic minority over-sampling technique (SMOTE) in Python.
The barplot below illustrates an example of a typical class imbalance within a training data set. In this application, say we want to predict all classes with equal capacity. A little background, we have 64 features that offer predictive capacity for where the fish samples came from (Gulf, West, or East).

Random Oversampling
One method is to randomly resample from the minority classes (West and East) in our training dataset to meet the highest class-specific sample size, essentially copying random minority records. We’ll use the module “imbalanced-learn” that has some useful functions for this purpose. We’re going to randomly resample our training dataset with the response and features labeled “train_y” and “train_X” respectively.



Synthetic Minority Over-Sampling Technique (SMOTE)
SMOTE is a way to oversample our minority classes in a manner that does not simply duplicate random records. The technique described in more detail here uses the k nearest neighbors algorithm to find samples among the minority class that are similar, and creates synthetic data points at a random interval between a given data point and its nearest neighbor.
The advantage is that we can synthesize some variation in the resampled data to reduce overfitting. We can apply this technique using imbalanced-learn again to resample our training set.


Random Undersampling
Instead of resampling minority records, we can instead randomly undersample the majority class to create a balanced dataset.



There are additional methods for addressing class imbalance; these three are simply a few common approaches. The code for this exercise can be found here.
