Imbalanced-Learn module in Python

27 July 2024

0

Imbalanced-Learn is a Python module that helps in balancing the datasets which are highly skewed or biased towards some classes. Thus, it helps in resampling the classes which are otherwise oversampled or undesampled. If there is a greater imbalance ratio, the output is biased to the class which has a higher number of examples. The following dependencies need to be installed to use imbalanced-learn:

scipy(>=0.19.1)
numpy(>=1.13.3)
scikit-learn(>=0.23)
joblib(>=0.11)
keras 2 (optional)
tensorflow (optional)

To install imbalanced-learn just type in :

pip install imbalanced-learn

The resampling of data is done in 2 parts:

Estimator: It implements a fit method which is derived from scikit-learn. The data and targets are both in the form of a 2D array

estimator = obj.fit(data, targets)

Resampler: The fit_resample method resample the data and targets into a dictionary with a key-value pair of data_resampled and targets_resampled.

data_resampled, targets_resampled = obj.fit_resample(data, targets)

The Imbalanced Learn module has different algorithms for oversampling and undersampling:

We will use the built-in dataset called the make_classification dataset which return

x: a matrix of n_samples*n_features and
y: an array of integer labels.

Click dataset to get the dataset used.

Python3

# import required modules
from sklearn.datasets import make_classification
  
# define dataset
x, y = make_classification(n_samples=10000, 
                           weights=[0.99], 
                           flip_y=0)
print('x:\n', X)
print('y:\n', y)

Output:

Below are some programs in which depict how to apply oversampling and undersampling to the dataset:

Oversampling

Random Over Sampler: It is a naive method where classes that have low examples are generated and randomly resampled.

Syntax:

from imblearn.over_sampling import RandomOverSampler

Parameters(optional): sampling_strategy=’auto’, return_indices=False, random_state=None, ratio=None

Implementation:
oversample = RandomOverSampler(sampling_strategy=’minority’)
X_oversample,Y_oversample=oversample.fit_resample(X,Y)

Return Type:a matrix with the shape of n_samples*n_features

Example:

Python3

# import required modules
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
  
# define dataset
x, y = make_classification(n_samples=10000, 
                           weights=[0.99], 
                           flip_y=0)
  
oversample = RandomOverSampler(sampling_strategy='minority')
x_over, y_over = oversample.fit_resample(x, y)
  
# print the features and the labels
print('x_over:\n', x_over)
print('y_over:\n', y_over)

Output:

SMOTE, ADASYN: Synthetic Minority Oversampling Technique (SMOTE) and the Adaptive Synthetic (ADASYN) are 2 methods used in oversampling. These also generate low examples but ADASYN takes into account the density of distribution to distribute the data points evenly.

Syntax:

from imblearn.over_sampling import SMOTE, ADASYN

Parameters(optional):*, sampling_strategy=’auto’, random_state=None, n_neighbors=5, n_jobs=None

Implementation:
smote = SMOTE(ratio=’minority’)
X_smote,Y_smote=smote.fit_resample(X,Y)

Return Type:a matrix with the shape of n_samples*n_features

Example:

Python3

# import required modules
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
  
# define dataset
x, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)
smote = SMOTE()
x_smote, y_smote = smote.fit_resample(x, y)
  
# print the features and the labels
print('x_smote:\n', x_smote)
print('y_smote:\n', y_smote)

Output:

Undersampling

Edited Nearest Neighbours: This algorithm removes any sample which has labels different from those of its adjoining classes.

Syntax:

from imblearn.under_sampling import EditedNearestNeighbours

Parameters(optional): sampling_strategy=’auto’, return_indices=False, random_state=None, n_neighbors=3, kind_sel=’all’, n_jobs=1, ratio=None

Implementation:
en = EditedNearestNeighbours()
X_en,Y_en=en.fit_resample(X, y)

Return Type:a matrix with the shape of n_samples*n_features

Example:

Python3

# import required modules
from sklearn.datasets import make_classification
from imblearn.under_sampling import EditedNearestNeighbours
  
# define dataset
x, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)
en = EditedNearestNeighbours()
x_en, y_en = en.fit_resample(x, y)
  
# print the features and the labels
print('x_en:\n', x_en)
print('y_en:\n', y_en)

Output:

Random Under Sampler: It involves sampling any random class with or without any replacement.

Syntax:

from imblearn.under_sampling import RandomUnderSampler
Parameters(optional): sampling_strategy=’auto’, return_indices=False, random_state=None, replacement=False, ratio=None

Implementation:
undersample = RandomUnderSampler()
X_under, y_under = undersample.fit_resample(X, y)

Return Type: a matrix with the shape of n_samples*n_features

Example:

Python3

# import required modules
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler
  
# define dataset
x, y = make_classification(n_samples=10000, 
                           weights=[0.99], 
                           flip_y=0)
undersample = RandomUnderSampler()
x_under, y_under = undersample.fit_resample(x, y)
  
# print the features and the labels
print('x_under:\n', x_under)
print('y_under:\n', y_under)

Output:

Imbalanced-Learn module in Python

Python3

Oversampling

Python3

Python3

Undersampling

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Interview With Willem Dewulf – CEO of ProBackup by Shauli Zacks

Recent Comments

EDITOR PICKS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR POSTS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR CATEGORY

ABOUT US

FOLLOW US