In linear or multiple regression, it is not enough to just fit the model into the dataset. But, it may not give the desired result. To apply the linear or multiple regression efficiently to the dataset. There are some assumptions that we need to check on the dataset that made linear/multiple regression efficient and generate better accuracy.
Assumptions of Regression
Regression analysis requires some assumptions to be followed by the dataset. These assumptions are:
- Observations are independent of each other. It should be correlated to another observation.
- Data is normally distributed.
- The relationship b/w the independent variable and the mean of the dependent variable is linear.
- The data is in homoscedasticity, which means the variance of the residual is the same for each value of the dependent variable.
To perform a good linear regression analysis, we also need to check whether these assumptions are violated:
- If the data contain non-linear trends then it will not be properly fitted by linear regression resulting in a high residual or error rate.
- To check for the normality in the dataset, draw a Q-Q plot on the data.
- The presence of correlation between observations is known as autocorrelation. We can check for the autocorrelation plot.
- The presence of homoscedasticity can be estimated with the plots such as the Scale Location plot, and the Residual vs Legacy plot.
Regression Diagnostic Plots
The above plots can be used to validate and test the above assumptions are part of Regression Diagnostic. This diagnostic can be used to check whether the assumptions. Before we discuss the diagnostic plot one by one let’s discuss some important terms:
- Outliers: Outliers are the points that are distinct and deviant from the bulk of the dataset. In general, the outliers have high residual values means that the difference is greater than the b/w observed and predicted value.
- Leverage Points: A leverage point is defined as an observation that has a value of x that is far away from the mean of x.
- Influential Points: An influential observation is defined as an observation that has a large influence on the fit of the model. One method to find influential points is to compare the fit of the model with and without each observation.
Below are the plots that we used in the diagnostic plot:
- Residual vs fitted plot: The residual can be calculated as:
This plot is used to check for linearity and homoscedasticity, if the model meets the condition of linear relationship then it should have a horizontal line with much deviation. If the model meets the condition for homoscedasticity, the graph should be equally spread around the y=0 line.
- Q-Q plot: This plot is used to check for the normality of the dataset, if there is normality that exists in the dataset then, the scatter points will be distributed along the 45 degrees dashed line.
- Scale-Location plot: It is a plot of square rooted standardized value vs predicted value. This plot is used for checking the homoscedasticity of residuals. Equally spread residuals across the horizontal line indicate the homoscedasticity of residuals.
- Residual vs Leverage plot/ Cook’s distance plot: The 4th point is the cook’s distance plot, which is used to measure the influence of the different plots. The Cook’s distance statistic for every observation measures the extent of change in model estimates when that particular observation is omitted. Cook distance plot the cook distance measure of each observation. whereas, Residual vs Leverage plot is the plot between standardized residuals and leverage points of the points.
Implementation
In this implementation, we will be plotting different diagnostic plots. For that, we use the Real-Estate dataset and apply the Ordinary Least Square (OLS) Regression. We then plot the regression diagnostic plot and Cook distance plot.
Python3
# imports import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm import statsmodels.formula.api as smf # Load Real State Data data = pd.read_csv( '/content/Real estate.csv' ) data.head() # Fit a OLS regression variable model = smf.ols(formula = ' Y ~ X3 + X2' , data = data ) results = model.fit() print (results.summary()) # Get different Variables for diagnostic residuals = results.resid fitted_value = results.fittedvalues stand_resids = results.resid_pearson influence = results.get_influence() leverage = influence.hat_matrix_diag # PLot different diagnostic plots plt.rcParams[ "figure.figsize" ] = ( 20 , 15 ) fig, ax = plt.subplots(nrows = 2 , ncols = 2 ) plt.style.use( 'seaborn' ) # Residual vs Fitted Plot sns.scatterplot(x = fitted_value, y = residuals, ax = ax[ 0 , 0 ]) ax[ 0 , 0 ].axhline(y = 0 , color = 'grey' , linestyle = 'dashed' ) ax[ 0 , 0 ].set_xlabel( 'Fitted Values' ) ax[ 0 , 0 ].set_ylabel( 'Residuals' ) ax[ 0 , 0 ].set_title( 'Residuals vs Fitted Fitted' ) # Normal Q-Q plot sm.qqplot(residuals, fit = True , line = '45' ,ax = ax[ 0 , 1 ], c = '#4C72B0' ) ax[ 0 , 1 ].set_title( 'Normal Q-Q' ) # Scale-Location Plot sns.scatterplot(x = fitted_value, y = residuals, ax = ax[ 1 , 0 ]) ax[ 1 , 0 ].axhline(y = 0 , color = 'grey' , linestyle = 'dashed' ) ax[ 1 , 0 ].set_xlabel( 'Fitted values' ) ax[ 1 , 0 ].set_ylabel( 'Sqrt(standardized residuals)' ) ax[ 1 , 0 ].set_title( 'Scale-Location Plot' ) # Residual vs Leverage Plot sns.scatterplot(x = leverage, y = stand_resids, ax = ax[ 1 , 1 ]) ax[ 1 , 1 ].axhline(y = 0 , color = 'grey' , linestyle = 'dashed' ) ax[ 1 , 1 ].set_xlabel( 'Leverage' ) ax[ 1 , 1 ].set_ylabel( 'Sqrt(standardized residuals)' ) ax[ 1 , 1 ].set_title( 'Residuals vs Leverage Plot' ) plt.tight_layout() plt.show() # PLot Cook's distance plot sm.graphics.influence_plot(results, criterion = "cooks" ) |
------------ # data No X1 X2 X3 X4 X5 X6 Y 0 1 2012.917 32.0 84.87882 10 24.98298 121.54024 37.9 1 2 2012.917 19.5 306.59470 9 24.98034 121.53951 42.2 2 3 2013.583 13.3 561.98450 5 24.98746 121.54391 47.3 3 4 2013.500 13.3 561.98450 5 24.98746 121.54391 54.8 4 5 2012.833 5.0 390.56840 5 24.97937 121.54245 43.1 ------------ OLS Regression Results ============================================================================== Dep. Variable: Y R-squared: 0.491 Model: OLS Adj. R-squared: 0.489 Method: Least Squares F-statistic: 198.3 Date: Thu, 19 Aug 2021 Prob (F-statistic): 5.07e-61 Time: 17:56:17 Log-Likelihood: -1527.9 No. Observations: 414 AIC: 3062. Df Residuals: 411 BIC: 3074. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 49.8856 0.968 51.547 0.000 47.983 51.788 X3 -0.0072 0.000 -18.997 0.000 -0.008 -0.006 X2 -0.2310 0.042 -5.496 0.000 -0.314 -0.148 ============================================================================== Omnibus: 161.397 Durbin-Watson: 2.130 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1297.792 Skew: 1.443 Prob(JB): 1.54e-282 Kurtosis: 11.180 Cond. No. 3.37e+03 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.37e+03. This might indicate that there are strong multicollinearity or other numerical problems. --------------