Optimizers are techniques or algorithms used to decrease loss (an error) by tuning various parameters and weights, hence minimizing the loss function, providing better accuracy of model faster.
Optimizers in Tensorflow
Optimizer is the extended class in Tensorflow, that is initialized with parameters of the model but no tensor is given to it. The basic optimizer provided by Tensorflow is:
tf.train.Optimizer - Tensorflow version 1.x tf.compat.v1.train.Optimizer - Tensorflow version 2.x
This class is never used directly but its sub-classes are instantiated.
Gradient Descent algorithm
Before explaining let’s first learn about the algorithm on top of which others are made .i.e. gradient descent. Gradient descent links weights and loss functions, as gradient means a measure of change, gradient descent algorithm determines what should be done to minimize loss functions using partial derivative – like add 0.7, subtract 0.27 etc. But obstacle arises when it gets stuck at local minima instead of global minima in the case of large multi-dimensional datasets.
Syntax: tf.compat.v1.train.GradientDescentOptimizer(learning_rate, use_locking, name = 'GradientDescent) Parameters: learning_rate: rate at which algorithm updates the parameter. Tensor or float type of value. use_locking: Use locks for update operations if True name: Optional name for the operation
Tensorflow Keras Optimizers Classes
Tensorflow predominantly supports 9 optimizer classes including its base class (Optimizer).
- Gradient Descent
- SGD
- AdaGrad
- RMSprop
- Adadelta
- Adam
- AdaMax
- NAdam
- FTRL
SGD Optimizer (Stochastic Gradient Descent)
The stochastic Gradient Descent (SGD) optimization method executes a parameter update for every training example. In the case of huge datasets, SGD performs redundant calculations resulting in frequent updates having high variance causing the objective function to vary heavily.
Syntax: tf.kears.optimizers.SGD(learning_rate = 0.01, momentum=0.0, nesterov=False, name='SGD', **kwargs) Parameters: learning_rate: rate at which algorithm updates the parameter. Tensor or float type of value.Default value is 0.01 momentum: accelerates gradient descent in appropriate direction. Float type of value. Default value is 0.0 nesterov: Whether or not to apply Nesterov Momentum. Boolean type of value. Default value is False. name: Optional name for the operation **kwargs: Keyworded variable length argument length.
Advantages:
- Requires Less Memory.
- Frequent alteration of model parameters.
- If Momentum is used then helps to reduce noise.
Disadvantages:
- High Variance
- Computationally Expensive
AdaGrad Optimizer
AdaGrad stands for Adaptive Gradient Algorithm. AdaGrad optimizer modifies the learning rate particularly with individual features .i.e. some weights in the dataset may have separate learning rates than others.
Syntax: tf.keras.optimizers.Adagrad(learning_rate=0.001, initial_accumulator_value=0.1, epsilon=1e-07, name="Adagrad", **kwargs) Parameters: learning_rate: rate at which algorithm updates the parameter. Tensor or float type of value.Default value is 0.001 initial_accumulator_value: Starting value for the per parameter momentum. Floating point type of value. Must be non-negative.Default value is 0.1 epsilon: Small value used to sustain numerical stability. Floating point type of value. Default value is 1e-07. name: Optional name for the operation **kwargs: Keyworded variable length argument length
Advantages:
- Best suited for Sparse Dataset
- Learning Rate updates with iterations
Disadvantages:
- Learning rate becomes small with an increase in depth of neural network
- May result in dead neuron problem
RMSprop Optimizer
RMSprop stands for Root Mean Square Propagation. RMSprop optimizer doesn’t let gradients accumulate for momentum instead only accumulates gradients in a particular fixed window. It can be considered as an updated version of AdaGrad with few improvements. RMSprop uses simple momentum instead of Nesterov momentum.
Syntax: tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9, momentum=0.0, epsilon=1e-07, centered=False, name='RMSprop', **kwargs) Parameters: learning_rate: rate at which algorithm updates the parameter. Tensor or float type of value.Default value is 0.001 rho: Discounting factor for gradients. Default value is 0.9 momentum: accelerates rmsprop in appropriate direction. Float type of value. Default value is 0.0 epsilon: Small value used to sustain numerical stability. Floating point type of value. Default value is 1e-07 centered: By this gradients are normalised by the variance of gradient. Boolean type of value. Setting value to True may help with training model however it is computationally more expensive. Default value if False. name: Optional name for the operation **kwargs: Keyworded variable length argument length.
Advantages:
- The learning rate is automatically adjusted.
- The discrete Learning rate for every parameter
Disadvantage: Slow learning
Adadelta Optimizer
Adaptive Delta (Adadelta) optimizer is an extension of AdaGrad (similar to RMSprop optimizer), however, Adadelta discarded the use of learning rate by replacing it with an exponential moving mean of squared delta (difference between current and updated weights). It also tries to eliminate the decaying learning rate problem.
Syntax: tf.keras.optimizers.Adadelta(learning_rate=0.001, rho=0.95, epsilon=1e-07, name='Adadelta', **kwargs) Parameters: learning_rate: rate at which algorithm updates the parameter. Tensor or float type of value.Default value is 0.001 rho: Decay rate. Tensor or Floating point type of value. Default value is 0.95 epsilon: Small value used to sustain numerical stability. Floating point type of value. Default value is 1e-07 name: Optional name for the operation **kwargs: Keyworded variable length argument length
Advantage: Setting of default learning rate is not required.
Disadvantage: Computationally expensive
Adam Optimizer
Adaptive Moment Estimation (Adam) is among the top-most optimization techniques used today. In this method, the adaptive learning rate for each parameter is calculated. This method combines advantages of both RMSprop and momentum .i.e. stores decaying average of previous gradients and previously squared gradients.
Syntax: tf.keras.optimizers.Adam(leaarning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='Adam', **kwargs) Parameters: learning_rate: rate at which algorithm updates the parameter. Tensor or float type of value.Default value is 0.001 beta_1: Exponential decay rate for 1st moment. Constant Float tensor or float type of value. Default value is 0.9 beta_2: Exponential decay rate for 2nd moment. Constant Float tensor or float type of value. Default value is 0.999 epsilon: Small value used to sustain numerical stability. Floating point type of value. Default value is 1e-07 amsgrad: Whether to use AMSGrad variant or not. Default value is False. name: Optional name for the operation **kwargs: Keyworded variable length argument length
Advantages:
- Easy Implementation
- Requires less memory
- Computationally efficient
Disadvantages:
- Can have weight decay problem
- Sometimes may not converge to an optimal solution
AdaMax Optimizer
AdaMax is an alteration of the Adam optimizer. It is built on the adaptive approximation of low-order moments (based off on infinity norm). Sometimes in the case of embeddings, AdaMax is considered better than Adam.
Syntax: tf.keras.optimizers.Adamax(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name='Adamax', **kwargs) Parameters: learning_rate: rate at which algorithm updates the parameter. Tensor or float type of value.Default value is 0.001 beta_1: Exponential decay rate for 1st moment. Constant Float tensor or float type of value. Default value is 0.9 beta_2: Exponential decay rate for weighted infinity norm. Constant Float tensor or float type of value. Default value is 0.999 epsilon: Small value used to sustain numerical stability. Floating point type of value. Default value is 1e-07 name: Optional name for the operation **kwargs: Keyworded variable length argument length
Advantages:
- Infinite order makes the algorithm stable.
- Requires less tuning on hyperparameters
Disadvantage: Generalization Issue
NAdam Optimizer
NAdam is a short form for Nesterov and Adam optimizer. NAdam uses Nesterov momentum to update gradient than vanilla momentum used by Adam.
Syntax: tf.keras.optimizers.Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name='Nadam', **kwargs) Parameters: learning_rate: rate at which algorithm updates the parameter. Tensor or float type of value.Default value is 0.001 beta_1: Exponential decay rate for 1st moment. Constant Float tensor or float type of value. Default value is 0.9 beta_2: Exponential decay rate for weighted infinity norm. Constant Float tensor or float type of value. Default value is 0.999 epsilon: Small value used to sustain numerical stability. Floating point type of value. Default value is 1e-07 name: Optional name for the operation **kwargs: Keyworded variable length argument length
Advantages:
- Gives better results for gradients with high curvature or noisy gradients.
- Learns faster
Disadvantage: Sometimes may not converge to an optimal solution
FTRL Optimizer
Follow The Regularized Leader (FTRL) is an optimization algorithm best suited for shallow models having sparse and large feature spaces. This version supports both shrinkage-type L2 regularization (summation of L2 penalty and loss function) and online L2 regularization.
Syntax: tf.keras.optimizers.Ftrl(learning_rate=0.001, learning_rate_power=-0.5, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, name='Ftrl', l2_shrinkage_regularization_strength=0.0, beta=0.0, **kwargs) Parameters: learning_rate: rate at which algorithm updates the parameter. Tensor or float type of value.Default value is 0.001 learning_rate_power: Controls the drop in learning rate during training. Float type of value. Should be less than or equal to 0. Default value is -0.5. initial_accumulator_value: Initial value for accumulator. Value should be greater than or equal to zero. Default value is 0.1. l1_regularization_strength:Stabilization penalty. Only positive values or 0 is allowed. Float type of value.Default value is 0.0 l2_regularization_strength: Stabilization Penalty. Only positive values or 0 is allowed. Float type of value.Default value is 0.0 name: Optional name for the operation l2_shrinkage_regularization_strength: Magnitude Penalty. Only positive values or 0 is allowed. Float type of value.Default value is 0.0 beta: Default float value is 0.0 **kwargs: Keyworded variable length argument length
Advantage: Can minimize loss function better.
Disadvantages:
- Cannot achieve adequate stability if the range of the regularizer is insufficient.
- If the range of the regularizer is huge, then it’s far away from the optimal decision.