Robust Regression for Machine Learning in Python

22 July 2024

0

Simple linear regression aims to find the best fit line that describes the linear relationship between some input variables(denoted by X) and the target variable(denoted by y). This has some limitations as in real-world problems, there is a high probability that the dataset may have outliers. This results in biased model fitting. To overcome this limitation of the biased fitted model, robust regression was introduced. In this article, we will learn about some state-of-the-art machine learning models which are robust to outliers.

One of the most used algorithms for Robust Regression is Random Sample Consensus (RANSAC). It is an iterative and non-deterministic method that is used to estimate the values of parameters used to build a machine-learning model from a set of observed data that contains outliers. When outliers are to be accorded, there is no influence on the estimated value. Hence, it also can be interpreted as a method for outlier detection. It produces a reasonable result only with a certain probability. This probability increases as more iterations are allowed.

Importing Libraries and Dataset

Here we will import a dataset and use it with some of the robust linear regression models. Python libraries make it easy for us to handle the data and perform typical and complex tasks with a single line of code.

Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
Matplotlib/Seaborn – This library is used to draw visualizations.
Sklearn – This module contains multiple libraries are having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.

Python3

import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 
from sklearn import datasets 
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression, RANSACRegressor 
from sklearn.metrics import r2_score, mean_squared_error 
  
import warnings 
warnings.filterwarnings('ignore') 

Sklearn datasets library contains some exemplary datasets for testing ideas and illustrations.

Python3

# Load the Boston Housing dataset  
# for training 
boston = datasets.load_boston() 
  
# Load the columns present in the dataset 
df = pd.DataFrame(boston.data) 
df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 
              'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
              'TAX', 'PTRATIO', 'B', 'LSTAT'] 
# Set target as column "MEDV" 
df['MEDV'] = boston.target 
  
# Select Avg. No of rooms per dwelling 
# as one feature 
X = df['RM'].to_numpy().reshape(-1, 1) 
y = df['MEDV'].to_numpy().reshape(-1, 1) 

RANSAC Regressor

In this model first data is separated into inliers and outliers then the model is trained on the inlier’s data. Training model in this way helps the model to learn patterns instead of any noises.

Python3

# Create a model 
model = RANSACRegressor(base_estimator=LinearRegression(), 
                        min_samples=50, max_trials=100, 
                        loss='absolute_loss', random_state=42, 
                        residual_threshold=10) 
  
# Fit the model 
model.fit(X, y) 

Output:

RANSACRegressor(base_estimator=LinearRegression(), loss=’absolute_loss’,

min_samples=50, random_state=42, residual_threshold=10)

Now let’s check the mean absolute error of the model.

Python3

y_pred = model.predict(X) 
print(metrics.mean_absolute_error(y, y_pred)) 

Output:

4.475672221331006

Theil Sen Regressor

This model is somehow similar to the random forest model in which we train multiple decision trees and average their results. This helps us to eliminate the problem of overfitting. This model also trains multiple regression models on the subsets of the training data and then the coefficients of those models are combined. This averaging step of the coefficient is exactly the step where the model becomes robust to the outliers.

Python3

from sklearn.linear_model import TheilSenRegressor 
  
# Create a model 
model = TheilSenRegressor(random_state=42) 
  
# Fit the model 
model.fit(X, y) 

Output:

TheilSenRegressor(max_subpopulation=10000, random_state=42)

Now let’s check the mean absolute error of the model.

Python3

y_pred = model.predict(X) 
print(metrics.mean_absolute_error(y, y_pred)) 

Output:

4.442032221450043

Huber Regressor

In this model, weights are optimized by giving higher preferences to the data points which are inliers. This again helps to learn the patterns specifically instead of the noise present in the data.

Python3

from sklearn.linear_model import HuberRegressor 
  
# Create a model 
model = HuberRegressor() 
  
# Fit the model 
model.fit(X, y)

Now let’s check the mean absolute error of the model.

Python3

y_pred = model.predict(X) 
print(metrics.mean_absolute_error(y, y_pred)) 

Output:

4.437123637682936

If the dataset you are using contains a lot of outliers then do try to train above mentioned robust regression models and choose the best out of them by using the validation dataset.

Robust Regression for Machine Learning in Python

Importing Libraries and Dataset

Python3

Python3

RANSAC Regressor

Python3

Python3

Theil Sen Regressor

Python3

Python3

Huber Regressor

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Interview With Bill Reed – CEO at RemotelyMe by Shauli Zacks

Samsung’s Galaxy S24 FE plummets to the price it should have been at launch

Samsung’s new periscope camera fits telephoto lenses into an even slimmer design

OnePlus’ decision to ditch Samsung’s OLED screens could backfire in the US

Recent Comments

EDITOR PICKS

Interview With Bill Reed – CEO at RemotelyMe by Shauli Zacks

Samsung’s Galaxy S24 FE plummets to the price it should have been at launch

Samsung’s new periscope camera fits telephoto lenses into an even slimmer design

POPULAR POSTS

Interview With Bill Reed – CEO at RemotelyMe by Shauli Zacks

Samsung’s Galaxy S24 FE plummets to the price it should have been at launch

Samsung’s new periscope camera fits telephoto lenses into an even slimmer design

POPULAR CATEGORY

ABOUT US

FOLLOW US