In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense, correlation is any statistical association, although it commonly refers to the degree to which a pair of variables are related linearly.
Known examples of dependent phenomena include the correlation between the height of parents and their children and the correlation between the price of a good and the quantity that consumers are willing to buy, as represented by the so-called demand curve. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice.
For example, an electric utility may produce less energy on a warm day based on the correlation between electricity demand and climate. In this example, there is a causal relationship because extreme weather causes people to use more electricity to heat or cool themselves
However, in general, the presence of a correlation is not sufficient to infer the presence of a causal relationship (i.e., correlation does not imply causality). Formally, random variables are dependent if they do not satisfy a mathematical property of probabilistic independence. In informal language, correlation is synonymous with dependence.
Essentially, correlation is the measure of how two or more variables relate to each other. There are several correlation coefficients. The most common of these is Pearson’s correlation coefficient, which is sensitive only to a linear relationship between two variables (which may be present even when one variable is a non-linear function of the other)
Other correlation coefficients — such as Spearman’s range correlation — have been developed to be more robust than Pearson’s, i.e. more sensitive to non-linear relationships. Mutual information can also be applied to measure the dependence between two variables. Here we can see correlations with a value of 0
, but that there is indeed some kind of correlation:
Correlations are scored from -1 to 1 and indicate whether there is a strong linear relationship — either in a positive or negative direction. However, there are many non-linear relationships that this type of score simply will not detect. In addition, the correlation is only defined for the numerical columns. So, we leave out all the categorical columns.
The same will happen if you transform the categorical columns because they are not ordinal and if we do OneHotEncoding
we will end up with an array with many different values (with high cardinality). The symmetry in the correlations means that the correlation is the same whether we calculate the correlation of A and B or the correlation of B and A. However, relationships in the real world are rarely symmetrical. More often, relationships are asymmetrical
A quick example: a column with 2 unique values (True
or False
for example) will never be able to perfectly predict another column with 100
unique values. But the opposite could be true. Clearly, asymmetry is important because it is very common in the real world.
Have you ever asked:
- Is there a score that tells us if there is any relationship between two columns — no matter if the relationship is linear, non-linear, Gaussian, or some other type of relationship?
- Of course, the score should be asymmetrical because I want to detect all the strange relationships between two variables.
- The score should be
0
if there is no relationship and the score should be1
if there is a perfect relationship - And that the score helps to answer the question Are there correlations between the columns? with a correlation matrix, then you make a scatter plot over the two columns to compare them and see if there is indeed a strong correlation.
- And like the icing on the cake, the score should be able to handle both categorical and numerical columns by default.
In short, an asymmetric and data-type agnostic score for predictive relationships between two columns ranging from 0
to 1
. Well, there is the Predictive Power Score library and it can be found at the following link: Predictive Power Score
So, let’s work at the library in this notebook!
Note: you can download the notebook here.
First, we need to install it
In [1]:
!pip3 install ppscore
Calculating the Predictive Power Score
First of all, there is no single way to calculate the Predictive Power Score. In fact, there are many possible ways to calculate a score that meets the requirements mentioned above. So, let’s rather think of the Predictive Power Score as a framework for a family of scores. Let’s say we have two columns and we want to calculate the PPS of X
predicting Y
. In this case, we treat Y
as our target variable and X
as our (only) characteristic.
We can now calculate a cross-validated Decision Tree and calculate an appropriate evaluation metric.
- When the objective is numerical we can use a Regression Decision Tree and calculate the Mean Absolute Error (MAE).
- When the objective is categorical, we can use a Classification Decision Tree and calculate the weighted F1
You can also use other scores like ROC, etc. but let’s leave those doubts aside for a second because we have another problem. Most evaluation metrics do not make sense if you do not compare them to a baseline. It doesn’t matter if we have a score of 0.9
if there are possible scores of 0.95
. And it would matter a lot if you are the first person who achieves a score higher than 0.7
. Therefore, we need to “normalize” our evaluation score. And how do you normalize a score? You define a lower and an upper limit and put the score in perspective
So what should be the upper and lower limits? Let’s start with the upper limit because this is usually easier: a perfect F1
is 1
. A perfect MAE
is 0
.
But what about the lower limit? Actually, we cannot answer this in absolute terms. The lower limit depends on the evaluation metric and its data set. It is the value reached by a “naïve” predictor.
But what is a naive model? For a classification problem, always predicting the most common class is quite naive. For a regression problem, always predicting the median value is quite naive.
Predictive Power Score VS Correlation
To get a better idea of the Predictive Power Score and its differences with the correlation let’s see this versus. We now have the correlations between x
and y
and vice versa
Let’s do with this equation from PPS
In [2]:
import pandas as pd
import numpy as np
import ppscore as pps
We’ve imported Pandas, Numpy, and PPS
In [3]:
df = pd.DataFrame()
We’ve now created an empty Pandas DataFrame
In [4]:
df
Out[4]:
According to the formula above we need to create the values of features X
, ranging from -2
to +2
, and we do it as a uniform distribution with Numpy, and we’ll create 10.000
samples and we assign these values to a new column of the empty dataframe called X
Following the same formula also we will need to create a new column called error
by assigning the values from -0.5
to 0.5
as uniform distribution and with the same number of samples. We will do the same with Numpy
In [5]:
df["x"] = np.random.uniform(-2, 2, 10000)
In [7]:
df.head()
In [8]:
df["error"] = np.random.uniform(-0.5, 0.5, 10000)
In [9]:
df.head()
Out[9]:
Your data will be different, because is randomly generated.
Great! We have the first half of the formula re-created. Now we’ll need to replicate and create Y
In [10]:
df["y"] = df["x"] * df["x"] + df["error"]
In [11]:
df.head()
Out[11]: