Mahalanobis distance is defined as the distance between two given points provided that they are in multivariate space. This distance is used to determine statistical analysis that contains a bunch of variables.
The user needs to install and import the following libraries for calculating Mahalanobis Distance in Python:
- numpy
- pandas
- scipy
Syntax to install all the above packages:
pip3 install numpy pandas scipy
Step 1: The first step is to import all the libraries installed above.
Python3
# Importing libraries import numpy as np import pandas as pd import scipy as stats |
Step 2: Creating a dataset. Consider a data of 10 cars of different brands. The data has five sections:
- Price
- Distance
- Emission generated
- Performance
- Mileage
Python3
# data data = { 'Price' : [ 100000 , 800000 , 650000 , 700000 , 860000 , 730000 , 400000 , 870000 , 780000 , 400000 ], 'Distance' : [ 16000 , 60000 , 300000 , 10000 , 252000 , 350000 , 260000 , 510000 , 2000 , 5000 ], 'Emission' : [ 300 , 400 , 1230 , 300 , 400 , 104 , 632 , 221 , 142 , 267 ], 'Performance' : [ 60 , 88 , 90 , 87 , 83 , 81 , 72 , 91 , 90 , 93 ], 'Mileage' : [ 76 , 89 , 89 , 57 , 79 , 84 , 78 , 99 , 97 , 99 ] } # Creating dataset df = pd.DataFrame(data,columns = [ 'Price' , 'Distance' , 'Emission' , 'Performance' , 'Mileage' ]) |
Step 3: Determining the Mahalanobis distance for each observation.
Python3
# Importing libraries import numpy as np import pandas as pd import scipy as stats # calculateMahalanobis function to calculate # the Mahalanobis distance def calculateMahalanobis(y = None , data = None , cov = None ): y_mu = y - np.mean(data) if not cov: cov = np.cov(data.values.T) inv_covmat = np.linalg.inv(cov) left = np.dot(y_mu, inv_covmat) mahal = np.dot(left, y_mu.T) return mahal.diagonal() # create new column in dataframe that contains # Mahalanobis distance for each row df[ 'calculateMahalanobis' ] = mahalanobis(x = df, data = df[[ 'Price' , 'Distance' , 'Emission' , 'Performance' , 'Mileage' ]]) |
Combining all steps:
Example:
Python3
# Importing libraries import numpy as np import pandas as pd import scipy as stats # calculateMahalanobis function to calculate # the Mahalanobis distance def calculateMahalanobis(y = None , data = None , cov = None ): y_mu = y - np.mean(data) if not cov: cov = np.cov(data.values.T) inv_covmat = np.linalg.inv(cov) left = np.dot(y_mu, inv_covmat) mahal = np.dot(left, y_mu.T) return mahal.diagonal() # data data = { 'Price' : [ 100000 , 800000 , 650000 , 700000 , 860000 , 730000 , 400000 , 870000 , 780000 , 400000 ], 'Distance' : [ 16000 , 60000 , 300000 , 10000 , 252000 , 350000 , 260000 , 510000 , 2000 , 5000 ], 'Emission' : [ 300 , 400 , 1230 , 300 , 400 , 104 , 632 , 221 , 142 , 267 ], 'Performance' : [ 60 , 88 , 90 , 87 , 83 , 81 , 72 , 91 , 90 , 93 ], 'Mileage' : [ 76 , 89 , 89 , 57 , 79 , 84 , 78 , 99 , 97 , 99 ] } # Creating dataset df = pd.DataFrame(data,columns = [ 'Price' , 'Distance' , 'Emission' , 'Performance' , 'Mileage' ]) # Creating a new column in the dataframe that holds # the Mahalanobis distance for each row df[ 'calculateMahalanobis' ] = calculateMahalanobis(y = df, data = df[[ 'Price' , 'Distance' , 'Emission' , 'Performance' , 'Mileage' ]]) # Display the dataframe print (df) |
Output:
Computing the p-value for every Mahalanobis distance
Now let us compute the p-value for every Mahalanobis distance of each observation of the dataset. As you from the above output, some of the Mahalanobis distances are significantly larger than other values. To compute whether some of the distances are statistically significant we need to find their p-value. The p-value for each of the distances is the same as the p-value that belongs to the Chi-Square statistic of the Mahalanobis distance having degrees of freedom equal to k-1, where k = number of variables. So, in this case, we’ll use a degree of freedom of 5-1 = 4.
Example:
Python3
# Importing libraries import numpy as np import pandas as pd import scipy as stats from scipy.stats import chi2 # calculateMahalanobis Function to calculate # the Mahalanobis distance def calculateMahalanobis(y = None , data = None , cov = None ): y_mu = y - np.mean(data) if not cov: cov = np.cov(data.values.T) inv_covmat = np.linalg.inv(cov) left = np.dot(y_mu, inv_covmat) mahal = np.dot(left, y_mu.T) return mahal.diagonal() # data data = { 'Price' : [ 100000 , 800000 , 650000 , 700000 , 860000 , 730000 , 400000 , 870000 , 780000 , 400000 ], 'Distance' : [ 16000 , 60000 , 300000 , 10000 , 252000 , 350000 , 260000 , 510000 , 2000 , 5000 ], 'Emission' : [ 300 , 400 , 1230 , 300 , 400 , 104 , 632 , 221 , 142 , 267 ], 'Performance' : [ 60 , 88 , 90 , 87 , 83 , 81 , 72 , 91 , 90 , 93 ], 'Mileage' : [ 76 , 89 , 89 , 57 , 79 , 84 , 78 , 99 , 97 , 99 ] } # Creating dataset df = pd.DataFrame(data,columns = [ 'Price' , 'Distance' , 'Emission' , 'Performance' , 'Mileage' ]) # Creating a new column in the dataframe that holds # the Mahalanobis distance for each row df[ 'Mahalanobis' ] = calculateMahalanobis(y = df, data = df[[ 'Price' , 'Distance' , 'Emission' , 'Performance' , 'Mileage' ]]) # calculate p-value for each mahalanobis distance df[ 'p' ] = 1 - chi2.cdf(df[ 'Mahalanobis' ], 3 ) # display first five rows of dataframe print (df) |
Output:
Interpretation:
Generally, the observation having a p-value less than 0.001 is assumed to be an outlier. In this example, there is no outlier as all the p-values are greater than 0.001.