How To Calculate Mahalanobis Distance in Python

27 July 2024

1

Mahalanobis distance is defined as the distance between two given points provided that they are in multivariate space. This distance is used to determine statistical analysis that contains a bunch of variables.

The user needs to install and import the following libraries for calculating Mahalanobis Distance in Python:

numpy
pandas
scipy

Syntax to install all the above packages:

pip3 install numpy pandas scipy

Step 1: The first step is to import all the libraries installed above.

Python3

# Importing libraries 
  
import numpy as np 
import pandas as pd  
import scipy as stats

Step 2: Creating a dataset. Consider a data of 10 cars of different brands. The data has five sections:

Price
Distance
Emission generated
Performance
Mileage

Python3

# data  
data = { 'Price': [100000, 800000, 650000, 700000, 
                   860000, 730000, 400000, 870000, 
                   780000, 400000], 
         'Distance': [16000, 60000, 300000, 10000, 
                      252000, 350000, 260000, 510000, 
                      2000, 5000], 
         'Emission': [300, 400, 1230, 300, 400, 104, 
                      632, 221, 142, 267], 
         'Performance': [60, 88, 90, 87, 83, 81, 72,  
                         91, 90, 93], 
         'Mileage': [76, 89, 89, 57, 79, 84, 78, 99,  
                     97, 99] 
           } 
  
# Creating dataset 
df = pd.DataFrame(data,columns=['Price', 'Distance', 
                                'Emission','Performance', 
                                'Mileage']) 

Step 3: Determining the Mahalanobis distance for each observation.

Python3

# Importing libraries 
  
import numpy as np 
import pandas as pd  
import scipy as stats 
  
# calculateMahalanobis function to calculate 
# the Mahalanobis distance 
def calculateMahalanobis(y=None, data=None, cov=None): 
  
    y_mu = y - np.mean(data) 
    if not cov: 
        cov = np.cov(data.values.T) 
    inv_covmat = np.linalg.inv(cov) 
    left = np.dot(y_mu, inv_covmat) 
    mahal = np.dot(left, y_mu.T) 
    return mahal.diagonal() 
  
# create new column in dataframe that contains  
# Mahalanobis distance for each row 
df['calculateMahalanobis'] = mahalanobis(x=df, data=df[['Price', 'Distance', 
                                                        'Emission','Performance', 
                                                        'Mileage']])

Combining all steps:

Example:

Python3

# Importing libraries 
  
import numpy as np 
import pandas as pd 
import scipy as stats 
  
# calculateMahalanobis function to calculate 
# the Mahalanobis distance 
def calculateMahalanobis(y=None, data=None, cov=None): 
  
    y_mu = y - np.mean(data) 
    if not cov: 
        cov = np.cov(data.values.T) 
    inv_covmat = np.linalg.inv(cov) 
    left = np.dot(y_mu, inv_covmat) 
    mahal = np.dot(left, y_mu.T) 
    return mahal.diagonal() 
  
# data 
data = { 'Price': [100000, 800000, 650000, 700000,  
                   860000, 730000, 400000, 870000, 
                   780000, 400000], 
         'Distance': [16000, 60000, 300000, 10000,  
                      252000, 350000, 260000, 510000,  
                      2000, 5000], 
         'Emission': [300, 400, 1230, 300, 400, 104, 
                      632, 221, 142, 267], 
         'Performance': [60, 88, 90, 87, 83, 81, 72,  
                         91, 90, 93], 
         'Mileage': [76, 89, 89, 57, 79, 84, 78, 99,  
                     97, 99] 
           } 
  
# Creating dataset 
df = pd.DataFrame(data,columns=['Price', 'Distance', 
                                'Emission','Performance',  
                                'Mileage']) 
  
# Creating a new column in the dataframe that holds 
# the Mahalanobis distance for each row 
df['calculateMahalanobis'] = calculateMahalanobis(y=df, data=df[[ 
  'Price', 'Distance', 'Emission','Performance', 'Mileage']]) 
  
# Display the dataframe 
print(df) 

Output:

Computing the p-value for every Mahalanobis distance

Now let us compute the p-value for every Mahalanobis distance of each observation of the dataset. As you from the above output, some of the Mahalanobis distances are significantly larger than other values. To compute whether some of the distances are statistically significant we need to find their p-value. The p-value for each of the distances is the same as the p-value that belongs to the Chi-Square statistic of the Mahalanobis distance having degrees of freedom equal to k-1, where k = number of variables. So, in this case, we’ll use a degree of freedom of 5-1 = 4.

Example:

Python3

# Importing libraries 
  
import numpy as np 
import pandas as pd 
import scipy as stats 
from scipy.stats import chi2 
  
# calculateMahalanobis Function to calculate 
# the Mahalanobis distance 
def calculateMahalanobis(y=None, data=None, cov=None): 
  
    y_mu = y - np.mean(data) 
    if not cov: 
        cov = np.cov(data.values.T) 
    inv_covmat = np.linalg.inv(cov) 
    left = np.dot(y_mu, inv_covmat) 
    mahal = np.dot(left, y_mu.T) 
    return mahal.diagonal() 
  
# data 
data = { 'Price': [100000, 800000, 650000, 700000, 
                   860000, 730000, 400000, 870000, 
                   780000, 400000], 
         'Distance': [16000, 60000, 300000, 10000,  
                      252000, 350000, 260000, 510000, 
                      2000, 5000], 
         'Emission': [300, 400, 1230, 300, 400, 104, 
                      632, 221, 142, 267], 
         'Performance': [60, 88, 90, 87, 83, 81, 72, 
                         91, 90, 93], 
         'Mileage': [76, 89, 89, 57, 79, 84, 78, 99, 
                     97, 99] 
           } 
  
# Creating dataset 
df = pd.DataFrame(data,columns=['Price', 'Distance', 
                                'Emission','Performance', 
                                'Mileage']) 
  
# Creating a new column in the dataframe that holds 
# the Mahalanobis distance for each row 
df['Mahalanobis'] = calculateMahalanobis(y=df, data=df[[ 
  'Price', 'Distance', 'Emission','Performance', 'Mileage']]) 
  
# calculate p-value for each mahalanobis distance 
df['p'] = 1 - chi2.cdf(df['Mahalanobis'], 3) 
  
# display first five rows of dataframe 
print(df) 

Output:

Interpretation:

Generally, the observation having a p-value less than 0.001 is assumed to be an outlier. In this example, there is no outlier as all the p-values are greater than 0.001.

How To Calculate Mahalanobis Distance in Python

Python3

Python3

Python3

Python3

Computing the p-value for every Mahalanobis distance

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

5 Best Password Managers for Seniors in 2025: 100% Safe by Manual Thomas

5 Best Password Managers With Auto-Fill in 2025 by Manual Thomas

10 Best LastPass Alternatives 2025: Secure + Cheap by Katarina Glamoslija

LastPass vs. Keeper 2025: Which One Is Better? by Manual Thomas

Recent Comments

EDITOR PICKS

5 Best Password Managers for Seniors in 2025: 100% Safe by Manual Thomas

5 Best Password Managers With Auto-Fill in 2025 by Manual Thomas

10 Best LastPass Alternatives 2025: Secure + Cheap by Katarina Glamoslija

POPULAR POSTS

5 Best Password Managers for Seniors in 2025: 100% Safe by Manual Thomas

5 Best Password Managers With Auto-Fill in 2025 by Manual Thomas

10 Best LastPass Alternatives 2025: Secure + Cheap by Katarina Glamoslija

POPULAR CATEGORY

ABOUT US

FOLLOW US