How to use Pandas filter with IQR?

27 July 2024

0

The IQR or Inter Quartile Range is a statistical measure used to measure the variability in a given data. In naive terms, it tells us inside what range the bulk of our data lies. It can be calculated by taking the difference between the third quartile and the first quartile within a dataset.

IQR = Q3 - Q1

Where, Q3 = the 75th percentile value (it is the middle value between the median and the largest value inside a dataset). Q1 = the 25th percentile value (it is the middle value between the median and the smallest value inside a dataset). Also, Q2 denotes the 50th percentile i.e., the median of a dataset. For more information about IQR please read https://www.geeksforgeeks.org/interquartile-range-iqr/.

In this article, we will be knowing how to filter a dataset using Pandas with the help of IQR.

The Inter Quartile Range (IQR) is a methodology that is generally used to filter outliers in a dataset. Outliers are extreme values that lie far from the regular observations that can possibly be got generated because of variability in measurement or experimental error. Many a time we want to identify these outliers and filter them out to reduce errors. Here, we will be showing an example to detect outliers and filter them out using Pandas in Python programming language.

Let’s first begin by importing important libraries that we will require to identify and filter the outliers.

Python

# Importing important libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('seaborn')

Now, we will read the dataset in which we want to detect and filter outliers. The dataset can be downloaded from https://tinyurl.com/gfgdata. It can be done using the read_csv() method present in the Pandas library and can be written as:

Python

# Reading the dataset
data = pd.read_csv('Dataset.csv')
print("The shape of the dataframe is: ", data.shape)

Output:

 The shape of the dataframe is:  (20, 4)

Printing the dataset

We can print the dataset to have a look at the data.

Python

print(data)

Our dataset looks like this:

We can observe some statistical information about this dataset using data.describe() method, which can be done as:

Python

data.describe()

Output:

It can be observed that features such as ‘Height’, ‘Width’, ‘Area’ have very deferred maximum value as compared to the 75% value, thus we can say there are certain observations that act as outliers in the dataset. Similarly, the minimum value in these columns differs greatly from the 25% value, so it signifies the presence of outliers.

It can be verified by plotting a box plot of the above features, here I’m plotting the box plot for the Height column and in the same manner box plot for other features can be plotted.

Python

plt.figure(figsize=(6,4))
sns.boxplot(data['Height (in cm)'])
plt.show()

Output:

We can observe the presence of outliers beyond the first quartile and the third quartile in the box plot.

To find out and filter such outliers in the dataset we will create a custom function that will help us remove outliers. In the function, we first need to find out the IQR value that can be calculated by finding the difference between the third and first quartile values. Secondly, we will write a query to select observations that lie outside the lower_range and upper_range IQR region and remove them. It can be written as:

Python

# Removing the outliers
def removeOutliers(data, col):
    Q3 = np.quantile(data[col], 0.75)
    Q1 = np.quantile(data[col], 0.25)
    IQR = Q3 - Q1
 
    print("IQR value for column %s is: %s" % (col, IQR))
    global outlier_free_list
    global filtered_data
 
    lower_range = Q1 - 1.5 * IQR
    upper_range = Q3 + 1.5 * IQR
    outlier_free_list = [x for x in data[col] if (
        (x > lower_range) & (x < upper_range))]
    filtered_data = data.loc[data[col].isin(outlier_free_list)]
 
 
for i in data.columns:
      if i == data.columns[0]:
      removeOutliers(data, i)
    else:
      removeOutliers(filtered_data, i)
 
  
# Assigning filtered data back to our original variable
data = filtered_data
print("Shape of data after outlier removal is: ", data.shape)

Output:

IQR value for column Height (in cm) is: 9.5
IQR value for column Width (in cm) is: 16.75
IQR value for column Area (in cm2) is: 706.0
Shape of data after outlier removal is:  (18, 3)

Printing the data afterward we can notice two of our extreme observations which were acting as outliers get removed.

Python

print(data)

Output:

We can observe the rows with index numbers 7 and 15 got removed from the original dataset.

How to use Pandas filter with IQR?

Python

Python

Python

Python

Python

Python

Python

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

Google Messages can now show your profile exactly how it’s supposed to be

Recent Comments

EDITOR PICKS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR POSTS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR CATEGORY

ABOUT US

FOLLOW US