RMSE: Root Mean Square Error is the measure of how well a regression line fits the data points. RMSE can also be construed as Standard Deviation in the residuals.
Consider the given data points: (1, 1), (2, 2), (2, 3), (3, 6).
Let us break the above data points into 1-d lists.
Input:
x = [1, 2, 2, 3] y = [1, 2, 3, 6]
Code: Regression Graph
Python
import matplotlib.pyplot as plt import math # plotting the points plt.plot(x, y) # naming the x axis plt.xlabel( 'x - axis' ) # naming the y axis plt.ylabel( 'y - axis' ) # giving a title to my graph plt.title( 'Regression Graph' ) # function to show the plot plt.show() |
Output:
Code: Mean Calculation
Python
# in the next step we will find the equation of the best fit line # we will use Linear algebra's Point slope form to find regression line equation # point-slope form is represented by y = mx + c # where m is slope means (change in y) / (change in x) # c is constant, it represents at which point line will intercept y-axis # slope m can be formulated as below: ''' n m =? (xi - Xmean) (yi - Ymean)/?(xi - Xmean)^2 i = 1 ''' # calculate Xmean and Ymean ct = len (x) sum_x = 0 sum_y = 0 for i in x: sum_x = sum_x + i x_mean = sum_x / ct print ( 'Value of X mean' , x_mean) for i in y: sum_y = sum_y + i y_mean = sum_y / ct print ( 'value of Y mean' , y_mean) # we have the values of x mean and y_mean |
Output :
Value of X mean 2.0 value of Y mean 3.0
Code: Line Equation
Python
# below is the process of finding line equation in mathematical terms # slope of our line is 2.5 # calculate c to find out the equation m = 2.5 c = y_mean - m * x_mean print ( 'Intercept' , c) |
Output:
Intercept -2.0
Code: Mean Squared Error
Python
# equation of our Regression line comes out to be as below: # y_pred = 2.5x-2.0 # we call the line y_pred # paste regression line graph from sklearn.metrics import mean_squared_error # y_pred for our exusting data points is as below y = [ 1 , 2 , 3 , 6 ] y_pred = [ 0.5 , 3 , 3 , 5.5 ] |
Output:
Python
# root mean square calculated by sklearn package mse1 = math.sqrt(mean_squared_error(y, y_pred)) print ( 'Root mean square error' , mse1) # where as the another way to find RMSE # is by adding squared attribute as false in mean_squared_error mse2 = mean_squared_error(y, y_pred, squared = False ) print ( 'Root mean square error' , mse2) |
Output:
Root mean square error 0.6123724356957945
Code : RMSE Calculation
Python
# lets check how the Root mean square is calculated mathematically # lets introduce a term called residuals # residual are basically the distance of data point from the regression line # residuals are denoted by red marked line in below graph # root mean square and residuals are calculated as below # we have 4 data points ''' r = 1, ri = yi-y_pred y_pred is mx + c ri = yi-(mx + c) e.g. x = 1, we have value of y as 1 we want to evaluate what exactly our model has predicted for x = 1 (1, 1)r1 = 1, x = 2 ''' # y_pred1 = 1-(2.5 * 1-2.0)= 0.5 r1 = 1 - ( 2.5 * 1 - 2.0 ) #(2, 2) r2 = 2, x = 2 # y_pred2 = 2-(2.5 * 2-2.0)=-1 r2 = 2 - ( 2.5 * 2 - 2.0 ) #(2, 3) r3 = 3, x = 2 # y_pred3 = 3-(2.5 * 2-2.0)= 0 r3 = 3 - ( 2.5 * 2 - 2.0 ) #(3, 6) r4 = 4, x = 3 # y_pred4 = 6-(2.5 * 3-2.0)=.5 r4 = 6 - ( 2.5 * 3 - 2.0 ) # from above calculation we have values of residuals residuals = [ 0.5 , - 1 , 0 , . 5 ] # now calculate root mean square error # N = 4 data points N = 4 rmse = math.sqrt((r1 * * 2 + r2 * * 2 + r3 * * 2 + r4 * * 2 ) / N) print ( 'Root Mean square error using maths' , rmse) # root mean square actually calculated using mathematics # both of RMSE calculated are same |
Output:
Output:
Root Mean square error using maths 0.6123724356957945
R-squared Error or Coefficient of Determination
R2 error answers the below question.
How much y varies with variation in x.Basically the % variation of y on variation with x
Code: R-Squared Error
Python
# SEline =(y1-(mx1 + b)**2 + y2-(mx2 + b)**2...+yn-(mxn + b)**2) # SE_line =(1-(2.5 * 1+(-2))**2 + (2-(2.5 * 2+(-2))**2) +(3-(2.5*(2)+(-2))**2) + (6-(2.5*(3)+(-2))**2)) val1 = ( 1 - ( 2.5 * 1 + ( - 2 ))) * * 2 val2 = ( 2 - ( 2.5 * 2 + ( - 2 ))) * * 2 val3 = ( 3 - ( 2.5 * 2 + ( - 2 ))) * * 2 val4 = ( 6 - ( 2.5 * 3 + ( - 2 ))) * * 2 SE_line = val1 + val2 + val3 + val4 print ( 'val' , val1, val2, val3, val4) # next to calculate total variation in Y from mean value # variation in y is calculated as # y_var =(y1-ymean)**2+(y2-ymean)**2...+(yn-ymean)2 y = [ 1 , 2 , 3 , 6 ] y_var = ( 1 - 3 ) * * 2 + ( 2 - 3 ) * * 2 + ( 3 - 3 ) * * 2 + ( 6 - 3 ) * * 2 SE_mean = y_var # by calculating y_var we are calculating the distance # between y data points and mean value of y # so answer to our question, % of the total variation # of wrt x is denoted as below: r_squared = 1 - (SE_line / SE_mean) # [SE_line / SE_mean] -->tells us the what % of variation # in y is not described by regression line # 1-(SE_line / SE_mean) --> gives us the exact value of # how much % y varies with variation in x print ( 'Rsquared error' , r_squared) |
('val', 0.25, 1.0, 0.0, 0.25) ('Rsquared error', 0.8928571428571429)
Code: R-Squared Error with sklearn
Python
from sklearn.metrics import r2_score # r2 error calculated by sklearn is similar # to ours mathematically calculated r2 error # calculate r2 error using sklearn r2_score(y, y_pred) |
Output:
0.8928571428571429