Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth-degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y | x).
What is a Polynomial Regression?
- There are some relationships that a researcher will hypothesize is curvilinear. Clearly, such types of cases will include a polynomial term.
- Inspection of residuals. If we try to fit a linear model to curved data, a scatter plot of residuals (Y-axis) on the predictor (X-axis) will have patches of many positive residuals in the middle. Hence in such a situation, it is not appropriate.
- An assumption in the usual multiple linear regression analysis is that all the independent variables are independent. In the polynomial regression model, this assumption is not satisfied.
How does a Polynomial Regression work?
If we observe closely then we will realize that to evolve from linear regression to polynomial regression. We are just supposed to add the higher-order terms of the dependent features in the feature space. This is sometimes also known as feature engineering but not exactly.
Application of Polynomial Regression
The reason behind the vast use cases of the polynomial regression is that approximately all of the real-world data is non-linear in nature and hence when we fit a non-linear model on the data or a curvilinear regression line then the results that we obtain are far better than what we can achieve with the standard linear regression. Some of the use cases of the Polynomial regression are as stated below:
- The growth rate of tissues.
- Progression of disease epidemics
- Distribution of carbon isotopes in lake sediments
The basic goal of regression analysis is to model the expected value of a dependent variable y in terms of the value of an independent variable x. In simple linear regression, we used the following equation –
y = a + bx + e
Here y is a dependent variable, a is the y-intercept, b is the slope and e is the error rate. In many cases, this linear model will not work out For example if we analyze the production of chemical synthesis in terms of the temperature at which the synthesis takes place in such cases we use a quadratic model
y = a + b1x + b22 + e
Here y is the dependent variable on x, a is the y-intercept and e is the error rate. In general, we can model it for the nth value.
y = a + b1x + b2x2 +....+ bnxn
Since the regression function is linear in terms of unknown variables, hence these models are linear from the point of estimation. Hence through the Least Square technique, let’s compute the response value that is y.
Polynomial Regression in Python
To get the Dataset used for the analysis of Polynomial Regression, click here. Import the important libraries and the dataset we are using to perform Polynomial Regression.
Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.
- Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
- Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
- Matplotlib/Seaborn – This library is used to draw visualizations.
- Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.
Python3
# Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset datas = pd.read_csv( 'data.csv' ) datas |
Output:
Our feature variable that is X will contain the Column between 1st and the target variable that is y will contain the 2nd column.
Python3
X = datas.iloc[:, 1 : 2 ].values y = datas.iloc[:, 2 ].values |
Now let’s fit a linear regression model on the data at hand.
Python3
# Features and the target variables X = datas.iloc[:, 1 : 2 ].values y = datas.iloc[:, 2 ].values # Fitting Linear Regression to the dataset from sklearn.linear_model import LinearRegression lin = LinearRegression() lin.fit(X, y) |
Fitting the Polynomial Regression model on two components X and y.
Python3
# Fitting Polynomial Regression to the dataset from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree = 4 ) X_poly = poly.fit_transform(X) poly.fit(X_poly, y) lin2 = LinearRegression() lin2.fit(X_poly, y) |
In this step, we are Visualising the Linear Regression results using a scatter plot.
Python3
# Visualising the Linear Regression results plt.scatter(X, y, color = 'blue' ) plt.plot(X, lin.predict(X), color = 'red' ) plt.title( 'Linear Regression' ) plt.xlabel( 'Temperature' ) plt.ylabel( 'Pressure' ) plt.show() |
Output:
Visualize the Polynomial Regression results using a scatter plot.
Python3
# Visualising the Polynomial Regression results plt.scatter(X, y, color = 'blue' ) plt.plot(X, lin2.predict(poly.fit_transform(X)), color = 'red' ) plt.title( 'Polynomial Regression' ) plt.xlabel( 'Temperature' ) plt.ylabel( 'Pressure' ) plt.show() |
Output:
Predict new results with both Linear and Polynomial Regression. Note that the input variable must be in a Numpy 2D array.
Python3
# Predicting a new result with Linear Regression # after converting predict variable to 2D array pred = 110.0 predarray = np.array([[pred]]) lin.predict(predarray) |
Output:
array([0.20675333])
Python3
# Predicting a new result with Polynomial Regression # after converting predict variable to 2D array pred2 = 110.0 pred2array = np.array([[pred2]]) lin2.predict(poly.fit_transform(pred2array)) |
Output:
array([0.43295877])
Overfitting Vs Under-fitting
While dealing with the polynomial regression one thing that we face is the problem of overfitting this happens because while we increase the order of the polynomial regression to achieve better and better performance model gets overfit on the data and does not perform on the new data points.
Due to this reason only while using the polynomial regression, do we try to penalize the weights of the model to regularize the effect of the overfitting problem. Regularization techniques like Lasso regression and Ridge regression methodologies are used whenever we deal with a situation in which the model may overfit the data at hand.
Bias Vs Variance Tradeoff
This technique is the generalization of the approach that is used to avoid the problem of overfitting and underfitting. Here as well this technique helps us to avoid the problem of overfitting by helping us select the appropriate value for the degree of the polynomial we are trying to fit our data on. For example, this is achieved when after increasing the degree of polynomial after a certain level the gap between the training and the validation metrics starts increasing.
Advantages of using Polynomial Regression
- A broad range of functions can be fit under it.
- Polynomial basically fits a wide range of curvatures.
- Polynomial provides the best approximation of the relationship between dependent and independent variables.
Disadvantages of using Polynomial Regression
- These are too sensitive to outliers.
- The presence of one or two outliers in the data can seriously affect the results of nonlinear analysis.
- In addition, there are unfortunately fewer model validation tools for the detection of outliers in nonlinear regression than there are for linear regression.