In this article, we will learn how can we build a fuel efficiency predicting model by using TensorFlow API. The dataset we will be using contain features like the distance engine has traveled, the number of cylinders in the car, and other relevant feature.
Importing Libraries
- Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
- Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
- Matplotlib – This library is used to draw visualizations.
- Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.
- OpenCV – This is an open-source library mainly focused on image processing and handling.
- Tensorflow – This is an open-source library that is used for Machine Learning and Artificial intelligence and provides a range of functions to achieve complex functionalities with single lines of code.
Python3
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sb import tensorflow as tf from tensorflow import keras from keras import layers import warnings warnings.filterwarnings( 'ignore' ) |
The dataset can be downloaded from here.
Python3
df = pd.read_csv( 'auto-mpg.csv' ) df.head() |
Output:
Let’s check the shape of the data.
Python3
df.shape |
Output:
(398, 9)
Now, check the datatypes of the columns.
Python3
df.info() |
Output:
Here we can observe one discrepancy the horsepower is given in the object datatype whereas it should be in the numeric datatype.
Python3
df.describe() |
Output:
Exploratory Data Analysis
As per the df.info() part first we will deal with the horsepower column and then we will move toward the analysis part.
Python3
df[ 'horsepower' ].unique() |
Output:
Here we can observe that instead of the null they have been replaced by the string ‘?’ due to this, the data of this column has been provided in the object datatype.
Python3
print (df.shape) df = df[df[ 'horsepower' ] ! = '?' ] print (df.shape) |
Output:
(398, 9) (392, 9)
So, there were 6 such rows with a question mark.
Python3
df[ 'horsepower' ] = df[ 'horsepower' ].astype( int ) df.isnull(). sum () |
Output:
mpg 0 cylinders 0 displacement 0 horsepower 0 weight 0 acceleration 0 model year 0 origin 0 car name 0 dtype: int64
Python3
df.nunique() |
Output:
mpg 127 cylinders 5 displacement 81 horsepower 93 weight 346 acceleration 95 model year 13 origin 3 car name 301 dtype: int64
Python3
plt.subplots(figsize = ( 15 , 5 )) for i, col in enumerate ([ 'cylinders' , 'origin' ]): plt.subplot( 1 , 2 , i + 1 ) x = df.groupby(col).mean()[ 'mpg' ] x.plot.bar() plt.xticks(rotation = 0 ) plt.tight_layout() plt.show() |
Output:
Here we can observe that the mpg values are highest for the origin 3.
Python3
plt.figure(figsize = ( 8 , 8 )) sb.heatmap(df.corr() > 0.9 , annot = True , cbar = False ) plt.show() |
Output:
If we will remove the displacement feature then the problem of high collinearity will be removed.
Python3
df.drop( 'displacement' , axis = 1 , inplace = True ) |
Data Input Pipeline
Python3
from sklearn.model_selection import train_test_split features = df.drop([ 'mpg' , 'car name' ], axis = 1 ) target = df[ 'mpg' ].values X_train, X_val, \ Y_train, Y_val = train_test_split(features, target, test_size = 0.2 , random_state = 22 ) X_train.shape, X_val.shape |
Output:
((313, 6), (79, 6))
Python3
AUTO = tf.data.experimental.AUTOTUNE train_ds = ( tf.data.Dataset .from_tensor_slices((X_train, Y_train)) .batch( 32 ) .prefetch(AUTO) ) val_ds = ( tf.data.Dataset .from_tensor_slices((X_val, Y_val)) .batch( 32 ) .prefetch(AUTO) ) |
Model Architecture
We will implement a model using the Sequential API of Keras which will contain the following parts:
- We will have two fully connected layers.
- We have included some BatchNormalization layers to enable stable and fast training and a Dropout layer before the final layer to avoid any possibility of overfitting.
- The final layer is the output layer.
Python3
model = keras.Sequential([ layers.Dense( 256 , activation = 'relu' , input_shape = [ 6 ]), layers.BatchNormalization(), layers.Dense( 256 , activation = 'relu' ), layers.Dropout( 0.3 ), layers.BatchNormalization(), layers.Dense( 1 , activation = 'relu' ) ]) |
While compiling a model we provide these three essential parameters:
- optimizer – This is the method that helps to optimize the cost function by using gradient descent.
- loss – The loss function by which we monitor whether the model is improving with training or not.
- metrics – This helps to evaluate the model by predicting the training and the validation data.
Python3
model. compile ( loss = 'mae' , optimizer = 'adam' , metrics = [ 'mape' ] ) |
Let’s print the summary of the model’s architecture:
Python3
model.summary() |
Output:
Model Training
Now we will train our model using the training and validation pipeline.
Python3
history = model.fit(train_ds, epochs = 50 , validation_data = val_ds) |
Output:
Epoch 45/50 10/10 [==============================] - 0s 14ms/step - loss: 2.8792 - mape: 12.5425 - val_loss: 5.3991 - val_mape: 28.6586 Epoch 46/50 10/10 [==============================] - 0s 8ms/step - loss: 2.9184 - mape: 12.7887 - val_loss: 4.1896 - val_mape: 21.4064 Epoch 47/50 10/10 [==============================] - 0s 9ms/step - loss: 2.8153 - mape: 12.3451 - val_loss: 4.3392 - val_mape: 22.3319 Epoch 48/50 10/10 [==============================] - 0s 9ms/step - loss: 2.7146 - mape: 11.7684 - val_loss: 3.6178 - val_mape: 17.7676 Epoch 49/50 10/10 [==============================] - 0s 10ms/step - loss: 2.7631 - mape: 12.1744 - val_loss: 6.4673 - val_mape: 33.2410 Epoch 50/50 10/10 [==============================] - 0s 10ms/step - loss: 2.6819 - mape: 11.8024 - val_loss: 6.0304 - val_mape: 31.6198
Python3
history_df = pd.DataFrame(history.history) history_df.head() |
Output:
Python3
history_df.loc[:, [ 'loss' , 'val_loss' ]].plot() history_df.loc[:, [ 'mape' , 'val_mape' ]].plot() plt.show() |
Output:
The training error has gone down smoothly but the case with the validation is somewhat different.