In this article, let’s learn about multiple linear regression using scikit-learn in the Python programming language.
Regression is a statistical method for determining the relationship between features and an outcome variable or result. Machine learning, it’s utilized as a method for predictive modeling, in which an algorithm is employed to forecast continuous outcomes. Multiple linear regression, often known as multiple regression, is a statistical method that predicts the result of a response variable by combining numerous explanatory variables. Multiple regression is a variant of linear regression (ordinary least squares) in which just one explanatory variable is used.
Mathematical Imputation:
To improve prediction, more independent factors are combined. The following is the linear relationship between the dependent and independent variables:
here, y is the dependent variable.
- x1, x2,x3,… are independent variables.
- b0 =intercept of the line.
- b1, b2, … are coefficients.
for a simple linear regression line is of the form :
y = mx+c
for example if we take a simple example, :
feature 1: TV
feature 2: radio
feature 3: Newspaper
output variable: sales
Independent variables are the features feature1 , feature 2 and feature 3. Dependent variable is sales. The equation for this problem will be:
y = b0+b1x1+b2x2+b3x3
x1, x2 and x3 are the feature variables.
In this example, we use scikit-learn to perform linear regression. As we have multiple feature variables and a single outcome variable, it’s a Multiple linear regression. Let’s see how to do this step-wise.
Stepwise Implementation
Step 1: Import the necessary packages
The necessary packages such as pandas, NumPy, sklearn, etc… are imported.
Python3
# importing modules and packages import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, mean_absolute_error from sklearn import preprocessing |
Step 2: Import the CSV file:
The CSV file is imported using pd.read_csv() method. To access the CSV file click here. The ‘No ‘ column is dropped as an index is already present. df.head() method is used to retrieve the first five rows of the dataframe. df.columns attribute returns the name of the columns. The column names starting with ‘X’ are the independent features in our dataset. The column ‘Y house price of unit area’ is the dependent variable column. As the number of independent or exploratory variables is more than one, it is a Multilinear regression.
To view and download the CSV file click here.
Python3
# importing data df = pd.read_csv( 'Real estate.csv' ) df.drop( 'No' , inplace = True ,axis = 1 ) print (df.head()) print (df.columns) |
Output:
X1 transaction date X2 house age … X6 longitude Y house price of unit area
0 2012.917 32.0 … 121.54024 37.9
1 2012.917 19.5 … 121.53951 42.2
2 2013.583 13.3 … 121.54391 47.3
3 2013.500 13.3 … 121.54391 54.8
4 2012.833 5.0 … 121.54245 43.1
[5 rows x 7 columns]
Index([‘X1 transaction date’, ‘X2 house age’,
‘X3 distance to the nearest MRT station’,
‘X4 number of convenience stores’, ‘X5 latitude’, ‘X6 longitude’,
‘Y house price of unit area’],
dtype=’object’)
Step 3: Create a scatterplot to visualize the data:
A scatterplot is created to visualize the relation between the ‘X4 number of convenience stores’ independent variable and the ‘Y house price of unit area’ dependent feature.
Python3
# plotting a scatterplot sns.scatterplot(x = 'X4 number of convenience stores' , y = 'Y house price of unit area' , data = df) |
Output:
Step 4: Create feature variables:
To model the data we need to create feature variables, X variable contains independent variables and y variable contains a dependent variable. X and Y feature variables are printed to see the data.
Python3
# creating feature variables X = df.drop( 'Y house price of unit area' ,axis = 1 ) y = df[ 'Y house price of unit area' ] print (X) print (y) |
Output:
X1 transaction date X2 house age … X5 latitude X6 longitude
0 2012.917 32.0 … 24.98298 121.54024
1 2012.917 19.5 … 24.98034 121.53951
2 2013.583 13.3 … 24.98746 121.54391
3 2013.500 13.3 … 24.98746 121.54391
4 2012.833 5.0 … 24.97937 121.54245
.. … … … … …
409 2013.000 13.7 … 24.94155 121.50381
410 2012.667 5.6 … 24.97433 121.54310
411 2013.250 18.8 … 24.97923 121.53986
412 2013.000 8.1 … 24.96674 121.54067
413 2013.500 6.5 … 24.97433 121.54310
[414 rows x 6 columns]
0 37.9
1 42.2
2 47.3
3 54.8
4 43.1
…
409 15.4
410 50.0
411 40.6
412 52.5
413 63.9
Name: Y house price of unit area, Length: 414, dtype: float64
Step 5: Split data into train and test sets:
Here, train_test_split() method is used to create train and test sets, the feature variables are passed in the method. test size is given as 0.3, which means 30% of the data goes into test sets, and train set data contains 70% data. the random state is given for data reproducibility.
Python3
# creating train and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.3 , random_state = 101 ) |
Step 6: Create a linear regression model
A simple linear regression model is created. LinearRegression() class is used to create a simple regression model, the class is imported from sklearn.linear_model package.
Python3
# creating a regression model model = LinearRegression() |
Step 7: Fit the model with training data.
After creating the model, it fits with the training data. The model gains knowledge about the statistics of the training model. fit() method is used to fit the data.
Python3
# fitting the model model.fit(X_train,y_train) |
Step 8: Make predictions on the test data set.
In this model.predict() method is used to make predictions on the X_test data, as test data is unseen data and the model has no knowledge about the statistics of the test set.
Python3
# making predictions predictions = model.predict(X_test) |
Step 9: Evaluate the model with metrics.
The multi-linear regression model is evaluated with mean_squared_error and mean_absolute_error metric. when compared with the mean of the target variable, we’ll understand how well our model is predicting. mean_squared_error is the mean of the sum of residuals. mean_absolute_error is the mean of the absolute errors of the model. The less the error, the better the model performance is.
mean absolute error = it’s the mean of the sum of the absolute values of residuals.
mean square error = it’s the mean of the sum of the squares of residuals.
- y= actual value
- y hat = predictions
Python3
# model evaluation print ( 'mean_squared_error : ' , mean_squared_error(y_test, predictions)) print ( 'mean_absolute_error : ' , mean_absolute_error(y_test, predictions)) |
Output:
mean_squared_error : 46.21179783493418 mean_absolute_error : 5.392293684756571
For data collection, there should be a significant discrepancy between the numbers. If you want to ignore outliers in your data, MAE is a preferable alternative, but if you want to account for them in your loss function, MSE/RMSE is the way to go. MSE is always higher than MAE in most cases, MSE equals MAE only when the magnitudes of the errors are the same.
Code:
Here, is the full code together, combining the above steps.
Python3
# importing modules and packages import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, mean_absolute_error from sklearn import preprocessing # importing data df = pd.read_csv( 'Real estate.csv' ) df.drop( 'No' , inplace = True , axis = 1 ) print (df.head()) print (df.columns) # plotting a scatterplot sns.scatterplot(x = 'X4 number of convenience stores' , y = 'Y house price of unit area' , data = df) # creating feature variables X = df.drop( 'Y house price of unit area' , axis = 1 ) y = df[ 'Y house price of unit area' ] print (X) print (y) # creating train and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.3 , random_state = 101 ) # creating a regression model model = LinearRegression() # fitting the model model.fit(X_train, y_train) # making predictions predictions = model.predict(X_test) # model evaluation print ( 'mean_squared_error : ' , mean_squared_error(y_test, predictions)) print ( 'mean_absolute_error : ' , mean_absolute_error(y_test, predictions)) |
Output:
X1 transaction date X2 house age … X6 longitude Y house price of unit area
0 2012.917 32.0 … 121.54024 37.9
1 2012.917 19.5 … 121.53951 42.2
2 2013.583 13.3 … 121.54391 47.3
3 2013.500 13.3 … 121.54391 54.8
4 2012.833 5.0 … 121.54245 43.1
[5 rows x 7 columns]
Index([‘X1 transaction date’, ‘X2 house age’,
‘X3 distance to the nearest MRT station’,
‘X4 number of convenience stores’, ‘X5 latitude’, ‘X6 longitude’,
‘Y house price of unit area’],
dtype=’object’)
X1 transaction date X2 house age … X5 latitude X6 longitude
0 2012.917 32.0 … 24.98298 121.54024
1 2012.917 19.5 … 24.98034 121.53951
2 2013.583 13.3 … 24.98746 121.54391
3 2013.500 13.3 … 24.98746 121.54391
4 2012.833 5.0 … 24.97937 121.54245
.. … … … … …
409 2013.000 13.7 … 24.94155 121.50381
410 2012.667 5.6 … 24.97433 121.54310
411 2013.250 18.8 … 24.97923 121.53986
412 2013.000 8.1 … 24.96674 121.54067
413 2013.500 6.5 … 24.97433 121.54310
[414 rows x 6 columns]
0 37.9
1 42.2
2 47.3
3 54.8
4 43.1
…
409 15.4
410 50.0
411 40.6
412 52.5
413 63.9
Name: Y house price of unit area, Length: 414, dtype: float64
mean_squared_error : 46.21179783493418
mean_absolute_error : 5.392293684756571