Optimizers are algorithms or methods that are used to change or tune the attributes of a neural network such as layer weights, learning rate, etc. in order to reduce the loss and in turn improve the model. In this article, I am going to talk about Adam optimizer and its implementation in Tensorflow.
Before starting the discussion let’s talk a little about momentum and RMSprop.
Momentum Optimizer
The momentum optimizer is an extension of the standard gradient descent algorithm. The normal gradient descent approach would need you to move more quickly in one direction while moving more slowly in the opposite direction, which would slow the algorithm down. Momentum limits oscillation to one direction, which speeds up the convergence of our method. This will also allow us to choose a higher learning rate because there is a limit on the number of steps that can be taken in the y-direction.
The following is the formula for momentum:
where w is the weight, beta is the momentum factor, g is the gradient value and eta is the learning rate.
RMSprop Optimizer
The gradient descent algorithm with momentum and the RMSprop optimizer are comparable. The RMSprop optimizer limits oscillations that move vertically. As a result, we can speed up learning and our algorithm will be able to make bigger horizontal strides and converge more quickly. The method used to calculate the gradients differs between RMSprop and gradient descent. The calculations for the gradients for the RMSprop are shown in the following formulae.
where w is the weight, beta is the momentum factor, g is the gradient value and eta is the learning rate. Epsilon is a very small value that we use to avoid division by zero.
Now that we have an understanding of momentum and RMSprop optimization algorithms, let’s take a closer look at how the Adam algorithm works.
Adam Optimizer
Adam(Adaptive Moment Estimation) is an adaptive optimization algorithm that was created specifically for deep neural network training. It can be viewed as a fusion of momentum-based stochastic gradient descent and RMSprop. It scales the learning rate using squared gradients, similar to RMSprop, and leverages momentum by using the gradient’s moving average rather than the gradient itself, similar to SGD with momentum.
To estimate momentum, Adam uses exponential moving averages computed on the gradients evaluated on the current mini-batch. Mathematically, this can be written as :
where m and v are the moving averages and g is the gradient value. The betas are hyper-parameters whose good default values are, as suggested in the paper, 0.9 and 0.999 respectively.
Now as expectation values of the moments and gradient value should be equal to each other, we take the mean value of the moments, like:
Using all this information, Adam updates the weights using the following formula which is quite similar to the formula we use in RMSprop:
where w is the weight, eta is the learning rate and epsilon is an infinitely small value, usually 10-8, which we use to avoid division by zero.
Adam Optimizer in Tensorflow
You can pass string value adam to the optimizer argument of the model.compile functions like:
model.compile(optimizer="adam")
This method passes an adam optimizer object to the function with default values for betas and learning rate. You can use the Adam class provided in tf.keras.optimizers.
It has the following syntax:
Adam(learning_rate, beta_1, beta_2, epsilon, amsgrad, name)
The following is the description of the parameters given above:
- learning_rate: The learning rate to use in the algorithm. It defaults to a value of 0.001.
- beta_1: The value for the exponential decay rate for the 1st-moment estimates. It has a default value of 0.9.
- beta_2: The value for the exponential decay rate for the 1st-moment estimates. It has a default value of 0.999.
- epsilon: A small constant for numerical stability. It defaults to 1e-7.
- amsgrad: It is a boolean that specifies whether to apply the AMSGrad variant of this algorithm from the paper “On the Convergence of Adam and beyond”. It has a default value of False.
- name: The Optional name for the operations created when applying gradients. Defaults to “Adam”.
Let us go through an example in Tensorflow to better understand the usage of Adam optimizer.
Python3
import tensorflow as tf |
Now let’s create the model. For this purpose, I am using a very simple Neural Network with 2 Dense layers. The following piece of code defines the architecture of the model:
Python3
def createModel(input_shape): X_input = tf.keras.layers. Input (input_shape) X = tf.keras.layers.Dense( 10 , 'relu' )(X_input) X_output = tf.keras.layers.Dense( 2 , 'softmax' )(X) model = tf.keras.Model(inputs = X_input, outputs = X_output) return model |
We can simply create the model now.
Python3
model = createModel(( 10 , 10 )) |
The model summary is as follows:
Python3
print (model.summary()) |
Output:
Now, let’s print out the weights of the model before training.
Python3
print ( 'Initial Layer Weights' ) print () for i in range ( 1 , len (model.layers)): print ( 'Weight for Layer ' + str (i) + ': ' ) print (model.layers[i].get_weights()[ 0 ]) print () |
Output:
Let’s get some dummy data to pass on to the model.
Python3
tf.random.set_seed( 5 ) X = tf.random.normal(( 2 , 10 , 10 )) Y = tf.random.normal(( 2 , 10 , 2 )) |
Now it’s time to compile the model. I have used the ‘adam’ optimizer, ‘categorical_crossentropy‘ loss, and ‘accuracy’ metrics.
Python3
model. compile (optimizer = 'adam' , loss = 'categorical_crossentropy' , metrics = [ 'accuracy' ]) |
The optimizer has the following configurations.
Python3
print (model.optimizer.get_config()) |
Output:
{'name': 'Adam', 'learning_rate': 0.001, 'decay': 0.0, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False}
Now let’s fit the dataset to the model.
Python3
model.fit(X,Y) |
Output:
1/1 [==============================] - 2s 2s/step - loss: -0.2437 - accuracy: 0.6500 <keras.callbacks.History at 0x7eff2b3868d0>
The model has now been trained. Let’s check the weights after training.
Python3
print ( 'Final Layer Weights' ) print () for i in range ( 1 , len (model.layers)): print ( 'Weight for Layer ' + str (i) + ': ' ) print (model.layers[i].get_weights()[ 0 ]) print () |
Output: