Thursday, December 26, 2024
Google search engine
HomeLanguagesY Scrambling for Model Validation

Y Scrambling for Model Validation

Y Scrambling is a method that one can use in order to test whether the predictions made by the model aren’t made just by chance. It is used in the validation of multi linear regression QSPR models. It has many names Y-Scrambling, Y-Randomization, Y-Permutation, etc. This process is amazingly simple to execute, and we’ll learn about it in detail. 

Steps for Y-Scrambling:

The intuition behind Y-Scrambling is very simple first you train your model over the original data and note its performance metric. The next thing you do is to shuffle the target column so that the correct feature-target pairs are now replaced with the new incorrect feature-target pairs. Now you need to train your model over this data and note down its performance metric. You re-shuffle the target column and repeat the steps. What we expect is that the model performs well over the original data and poorly on the shuffled data. If that’s not the case and the metric doesn’t vary much then that means the predictions aren’t robust. The step-wise process is as follows:-

  1. Train Model over original feature-target pair.
  2. Note the performance metric.
  3. Repeat till a certain amount of iteration
    • Shuffle the target column.
    • Train model over new feature-target pair.
    • Note the performance metric
  4. Analyze the metrics of original pairs with the shuffled ones.

Implementing Y-Scrambling:

For this tutorial, I’ll be using the Boston house pricing dataset present in sklearn’s datasets module which will return a dictionary in which features will be present under data key and targets under target key. Let’s start by importing the data:-

import numpy as np
from sklearn.datasets import load_boston

data = load_boston()
X = data.data
Y = data.target

Now that we have the features and target let’s execute the first 2 steps of Y-scrambling i.e. training the model and noting the performance metric.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

reg = LinearRegression()
reg.fit(X,Y)

ypred = reg.predict(X)
original_r2 = r2_score(Y,ypred)
print(original_r2)

The original_r2 came out to be 0.74064. With this, we’ve completed our first 2 steps. Now we’ll proceed to the next step i.e. shuffling the target array, training the model and storing the performance metric in a loop. These steps have to be repeated for certain no. of iteration which I took as 100 for this tutorial. One thing to note is that 

shuffled_r2 = []
from tqdm.notebook import trange
for i in trange(100):
    np.random.shuffle(Y)
    
    reg = LinearRegression()
    reg.fit(X,Y)
    
    ypred = reg.predict(X)
    shuffled_r2.append(r2_score(Y,ypred))

If you print shuffled_r2 you’ll see that the model performed awful. The first few values of shuffled_r2 are as follows:-

>>> shuffled_r2[:20]
[0.015336761335013271,
 0.0176654793204013,
 0.01740534118134418,
 0.02319807700450416,
 0.018487786525668626,
 0.02251746334707183,
 0.03766952947632973,
 0.01854475963361435,
 0.03570134149232318,
 0.022607830815118635,
 0.016603896471999002,
 0.0386838401376941,
 0.024355424374905343,
 0.04058673452547956,
 0.014581835385169217,
 0.03193842111822809,
 0.03366492627548756,
 0.02274120932669821,
 0.04335824299249236,
 0.02665799106621214]

Code:

Python3




import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
  
# LOADING THE DATA
data = load_boston()
X = data.data
Y = data.target
  
#TRAINING OVER ORIGINAL TARGET
reg = LinearRegression()
reg.fit(X,Y)
  
ypred = reg.predict(X)
original_r2 = r2_score(Y,ypred)
print(original_r2)
  
# TRAINING OVER SHUFFLED TARGET
shuffled_r2 = []
  
for i in range(100):
    np.random.shuffle(Y)
      
    reg = LinearRegression()
    reg.fit(X,Y)
      
    ypred = reg.predict(X)
    shuffled_r2.append(r2_score(Y,ypred))
  
print(shuffled_r2[:20])


Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments