If we are trying to build a neural network then we have to initialize the layers of the network with some initial weights which we try to optimize as the training process of the model goes on. The method by which the weights of a neural network are initialized does affect the time required to reach the optimized solution and solve the problem of vanishing or exploding gradients. In this article, we will try to learn the method by which effective initialization of weights can be done by using the PyTorch machine learning framework.
Why initialize weights?
Initializing the weights of a neural network is a vital step in the training process as appropriate weight initialization is an instrumental factor impacting the convergence and performance of a network. Weights that are initialized to the same value can cause the model to converge to the same suboptimal solution, regardless of the optimization algorithm being used.
Weights that are initialized to large values can lead to vanishing or exploding gradients, depending on the activation function being used. This can cause the model to converge slowly or not at all. Weights that are initialized to small random values can lead to more efficient training, as the optimization algorithm is able to make larger updates to the weights at the beginning of training. Different initialization methods can be more suitable for different types of problems and model architectures.
Using the nn.init Module for Weights Initialization
The PyTorch nn.init module is a conventional way to initialize weights in a neural network, which provides a multitude of weight initialization methods such as:
- Uniform initialization
- Xavier initialization
- Kaiming initialization
- Zeros initialization
- One’s initialization
- Normal initialization
An example implementation of the same is provided below:
Uniform Initialization
Using a uniform distribution to initialize the weights can help prevent the ‘vanishing gradient’ problem, as the distribution has a finite range and the weights are distributed evenly across that range. However, this method can suffer from the ‘exploding gradient’ problem if the range is too large.
Python3
import torch # Initializing a linear layer with 2 # independent features and 3 dependent features linear_layer = torch.nn.Linear( 2 , 3 ) # Initializing the weights with a uniform distribution torch.nn.init.uniform_(linear_layer.weight) # Displaying the initialized weights print (linear_layer.weight) |
Output:
Parameter containing: tensor([[-0.1768, -0.4942], [ 0.0756, -0.0967], [-0.3923, 0.3283]], requires_grad=True)
Xavier Initialization
Using Xavier initialization can help prevent the ‘vanishing gradient’ problem, as it scales the weights such that the variance of the outputs of each layer is the same as the variance of the inputs.
Python3
import torch # Initializing a linear layer with # 2 independent features and 3 dependent features linear_layer = torch.nn.Linear( 2 , 3 ) # Initializing the weights with the Xavier initialization method torch.nn.init.xavier_uniform_(linear_layer.weight) # Displaying the initialized weights print (linear_layer.weight) |
Output:
Parameter containing: tensor([[ 0.4442, -0.3890], [-0.2876, -0.3379], [-0.5261, 0.5227]], requires_grad=True)
Kaiming Initialization
Using Kaiming initialization can help prevent the ‘vanishing gradient’ problem, as it scales the weights such that the variance of the outputs is the same as the variance of the inputs, taking into account the nonlinearity of the activation function.
Python3
import torch # Initializing a linear layer with # 2 independent features and 3 dependent features linear_layer = torch.nn.Linear( 2 , 3 ) # Initializing the weights with the Kaiming initialization method torch.nn.init.kaiming_uniform_(linear_layer.weight, a = 0 , mode = "fan_in" , nonlinearity = "relu" ) # Displaying the initialized weights print (linear_layer.weight) |
Output:
Parameter containing: tensor([[ 0.0582, 0.4701], [ 0.4982, 0.5452], [-0.0384, 0.5999]], requires_grad=True)
Zeros and Ones Initialisation
Initializing the weights to zeros can cause the model to converge slowly, as all of the weights will be updated in the same direction. This can also lead to the ‘vanishing gradient’ problem.
Python3
import torch # Initializing a linear layer with # 2 independent features and 3 dependent features linear_layer = torch.nn.Linear( 2 , 3 ) # Initializing the weights with the # zeros initialization method torch.nn.init.zeros_(linear_layer.weight) # Displaying the initialized weights print (linear_layer.weight) |
Output:
Parameter containing: tensor([[0., 0.], [0., 0.], [0., 0.]], requires_grad=True)
Initializing the weights to ones can cause the model to converge slowly, as all of the weights will be updated in the same direction. This can also lead to the ‘exploding gradient’ problem.
Python3
import torch # Initializing a linear layer with # 2 independent features and 3 dependent features linear_layer = torch.nn.Linear( 2 , 3 ) # Initializing the weights with the # ones initialization method torch.nn.init.ones_(linear_layer.weight) # Displaying the initialized weights print (linear_layer.weight) |
Output:
Parameter containing: tensor([[1., 1.], [1., 1.], [1., 1.]], requires_grad=True)
Normal Initialisation
Using a normal distribution to initialize the weights can help prevent the ‘exploding gradient’ problem, as the distribution has a finite range and the weights are distributed evenly around the mean. It must be noted that the neural network’s performance is not impacted by the weights alone; the learning rate, the optimization algorithms and the hyperparameters used also play a crucial role in increasing the efficiency of the neural network.
Python3
import torch # Initializing a linear layer with # 2 independent features and 3 dependent features linear_layer = torch.nn.Linear( 2 , 3 ) # Initializing the weights with the # normal initialization method torch.nn.init.normal_(linear_layer.weight, mean = 0 , std = 1 ) # Displaying the initialized weights print (linear_layer.weight) |
Output:
Parameter containing: tensor([[-0.1759, 0.5192], [-0.5621, -0.3871], [-0.6071, 0.3538]], requires_grad=True)
Applying a Custom Function for Weights Initialization
An alternative method is to create a customized function to initialize the weights, which can be applied to the layer using the apply attribute.
Python3
import torch # User defined function to initialize the weights def custom_weights(m): torch.nn.init.uniform_(m.weight, - 0.5 , 0.5 ) # Initializing a linear layer with # 2 independent features and 3 dependent features linear_layer = torch.nn.Linear( 2 , 3 ) # Applying the user defined function to the layer linear_layer. apply (custom_weights) # Displaying the initialized weights print (linear_layer.weight) |
Output:
Parameter containing: tensor([[ 0.4341, -0.3424], [ 0.2095, 0.1782], [-0.4244, 0.1719]], requires_grad=True)
Using a user-defined Layer Class for Weights Initialization
Another method involves creating a user-defined class that inherits from the torch.nn.Module class. Therein, the constructor can be overridden in order to implement custom weights.
Python3
import torch # User defined Layer class MyLayer(torch.nn.Module): # Overriding the constructor def __init__( self , independent, dependent): # Calling the super-class' constructor super (MyLayer, self ).__init__() self .linear = torch.nn.Linear(independent, dependent) torch.nn.init.uniform_( self .linear.weight, - 0.5 , 0.5 ) def forward( self , x): return self .linear(x) # Initializing a linear layer with # 2 independent features and 3 dependent features linear_layer = MyLayer( 2 , 3 ) # Displaying the initialized weights print (linear_layer.linear.weight) |
Output:
Parameter containing: tensor([[-0.1566, 0.2461], [-0.3361, -0.0551], [ 0.4607, 0.3077]], requires_grad=True)
In conclusion, initializing the weights of a neural network model is an important step in the training process, as it can have a significant impact on the model’s performance. PyTorch provides several built-in initialization methods, including uniform, normal, Xavier, Kaiming, ones, and zeros. Each of these methods has its own advantages and disadvantages, and the choice of method will depend on the specific problem and model architecture being used. It is important to choose an initialization method that is suitable for the problem at hand, as it can help prevent vanishing or exploding gradient problems and improve the convergence speed and final accuracy of the model.