Using Predictive Power Score to Pinpoint Non-linear Correlations

20 July 2024

4

Using Predictive Power Score to Pinpoint Non-linear Correlations — Image by Author

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense, correlation is any statistical association, although it commonly refers to the degree to which a pair of variables are related linearly.

Known examples of dependent phenomena include the correlation between the height of parents and their children and the correlation between the price of a good and the quantity that consumers are willing to buy, as represented by the so-called demand curve. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice.

For example, an electric utility may produce less energy on a warm day based on the correlation between electricity demand and climate. In this example, there is a causal relationship because extreme weather causes people to use more electricity to heat or cool themselves

However, in general, the presence of a correlation is not sufficient to infer the presence of a causal relationship (i.e., correlation does not imply causality). Formally, random variables are dependent if they do not satisfy a mathematical property of probabilistic independence. In informal language, correlation is synonymous with dependence.

Essentially, correlation is the measure of how two or more variables relate to each other. There are several correlation coefficients. The most common of these is Pearson’s correlation coefficient, which is sensitive only to a linear relationship between two variables (which may be present even when one variable is a non-linear function of the other)

Predictive Power Score to Pinpoint Non-linear Correlations — Image by Author

Other correlation coefficients — such as Spearman’s range correlation — have been developed to be more robust than Pearson’s, i.e. more sensitive to non-linear relationships. Mutual information can also be applied to measure the dependence between two variables. Here we can see correlations with a value of 0, but that there is indeed some kind of correlation:

correlation power plot — Image by Author

Correlations are scored from -1 to 1 and indicate whether there is a strong linear relationship — either in a positive or negative direction. However, there are many non-linear relationships that this type of score simply will not detect. In addition, the correlation is only defined for the numerical columns. So, we leave out all the categorical columns.

The same will happen if you transform the categorical columns because they are not ordinal and if we do OneHotEncoding we will end up with an array with many different values (with high cardinality). The symmetry in the correlations means that the correlation is the same whether we calculate the correlation of A and B or the correlation of B and A. However, relationships in the real world are rarely symmetrical. More often, relationships are asymmetrical

A quick example: a column with 2 unique values (True or False for example) will never be able to perfectly predict another column with 100 unique values. But the opposite could be true. Clearly, asymmetry is important because it is very common in the real world.

Have you ever asked:

Is there a score that tells us if there is any relationship between two columns — no matter if the relationship is linear, non-linear, Gaussian, or some other type of relationship?
Of course, the score should be asymmetrical because I want to detect all the strange relationships between two variables.
The score should be 0 if there is no relationship and the score should be 1 if there is a perfect relationship
And that the score helps to answer the question Are there correlations between the columns? with a correlation matrix, then you make a scatter plot over the two columns to compare them and see if there is indeed a strong correlation.
And like the icing on the cake, the score should be able to handle both categorical and numerical columns by default.

In short, an asymmetric and data-type agnostic score for predictive relationships between two columns ranging from 0 to 1. Well, there is the Predictive Power Score library and it can be found at the following link: Predictive Power Score

So, let’s work at the library in this notebook!

Note: you can download the notebook here.

First, we need to install it

In [1]:

!pip3 install ppscore

Calculating the Predictive Power Score

First of all, there is no single way to calculate the Predictive Power Score. In fact, there are many possible ways to calculate a score that meets the requirements mentioned above. So, let’s rather think of the Predictive Power Score as a framework for a family of scores. Let’s say we have two columns and we want to calculate the PPS of X predicting Y. In this case, we treat Y as our target variable and X as our (only) characteristic.

We can now calculate a cross-validated Decision Tree and calculate an appropriate evaluation metric.

When the objective is numerical we can use a Regression Decision Tree and calculate the Mean Absolute Error (MAE).
When the objective is categorical, we can use a Classification Decision Tree and calculate the weighted F1

You can also use other scores like ROC, etc. but let’s leave those doubts aside for a second because we have another problem. Most evaluation metrics do not make sense if you do not compare them to a baseline. It doesn’t matter if we have a score of 0.9 if there are possible scores of 0.95. And it would matter a lot if you are the first person who achieves a score higher than 0.7. Therefore, we need to “normalize” our evaluation score. And how do you normalize a score? You define a lower and an upper limit and put the score in perspective

So what should be the upper and lower limits? Let’s start with the upper limit because this is usually easier: a perfect F1 is 1. A perfect MAE is 0.

But what about the lower limit? Actually, we cannot answer this in absolute terms. The lower limit depends on the evaluation metric and its data set. It is the value reached by a “naïve” predictor.

But what is a naive model? For a classification problem, always predicting the most common class is quite naive. For a regression problem, always predicting the median value is quite naive.

Predictive Power Score VS Correlation

To get a better idea of the Predictive Power Score and its differences with the correlation let’s see this versus. We now have the correlations between x and y and vice versa

Let’s do with this equation from PPS

In [2]:

import pandas as pd
import numpy as np
import ppscore as pps

We’ve imported Pandas, Numpy, and PPS

In [3]:

df = pd.DataFrame()

We’ve now created an empty Pandas DataFrame

In [4]:

df

Out[4]:

According to the formula above we need to create the values of features X, ranging from -2 to +2, and we do it as a uniform distribution with Numpy, and we’ll create 10.000 samples and we assign these values to a new column of the empty dataframe called X

Following the same formula also we will need to create a new column called error by assigning the values from -0.5 to 0.5 as uniform distribution and with the same number of samples. We will do the same with Numpy

Predictive Power Score numpy — Image by Author

In [5]:

df["x"] = np.random.uniform(-2, 2, 10000)

In [7]:

df.head()

In [8]:

df["error"] = np.random.uniform(-0.5, 0.5, 10000)

In [9]:

df.head()

Out[9]:

Your data will be different, because is randomly generated.

Great! We have the first half of the formula re-created. Now we’ll need to replicate and create Y

In [10]:

df["y"] = df["x"] * df["x"] + df["error"]

In [11]:

df.head()

Out[11]:

Using Predictive Power Score to Pinpoint Non-linear Correlations

Calculating the Predictive Power Score

Predictive Power Score VS Correlation

Run Local AWS Cloud Stack using LocalStack on Linux

Learn Terraform Automation in 3 days using Video Courses

How To Expose Ansible AWX Service using Nginx Ingress

LEAVE A REPLY Cancel reply

Most Popular

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Recent Comments

EDITOR PICKS

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

POPULAR POSTS

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

POPULAR CATEGORY

ABOUT US

FOLLOW US