Understanding and Optimizing Neural Network Hyperparameters Series
In the other parts I have explained the main hyperparameters used to train a neural network, and how they can contribute to the networks success. However, optimizing these values can be tedious. Once the correct combination of all these parameters are met, you can expect the network to produce impressive results, assuming there is enough meaningful data. However, the learning rate and the neuron architecture are certainly not the only parameters. I will start by going over a few others, and then explain how you should optimize them all for the best results.
Depending on the tools you’re using or the exact form of neural network you’re implementing, the types of parameters can vary. For example, in some cases you may apply all your data in a constant feed until it has been completed. In other cases, to save memory, you may separate your data samples into batches, and apply them to the network per iteration. To keep things more understandable, I will go through parameters based on a total data set, without the separation.
Often when training a network, applying all the data once will not be enough. The solution is simple, just feed it all through again. This is no different to the fact you may need to re-read this article a few times to fully understand it. The amount of times a dataset is fully iterated through is the amount of epochs. Where one epoch is one full iteration through the whole dataset (each sample being fed forward and back propagated). Additionally, the size of the dataset is usually proportional to the amount of epochs required to have a successfully trained network. For example, if we were getting a network to learn the basic OR function,
Where
0,0 = 0
1,0 = 1
1,1 = 0
0,1 = 1
We only have four very simple and linear rules. So, for the weights to optimize enough to represent the problem, the same samples must be fed forward many times. You could expect this small dataset to need 2000 iterations (epochs). However, with huge datasets, that contain thousands of samples (instead of just four), you usually just need to train anywhere between one and thirty epochs.
Another parameter is the weight initialisation. Weights are the essential values that make up a neural network, and during back propagation, they get adjusted and numerically represent what has been learnt. However, there are different ways you can initialise weights. A popular choice is to have all weights start as random values within a certain range. Usually between 0 and 1, or -1 and 1. Although this can suit many problems, it may not reach the same level of success as quickly.
Here is an example of a neural network trained with weights initialized between -1 and 1 (blue) and then -0.4, 0.4 (green). It is clear that for this problem, it was worth optimising the weight parameters as they resulted in a more efficient and successful performance.
There are also many different ways in which neurons can process input data. The method used is essentially another hyperparameter. These are known as activation functions, and intend to squash or normalize neuron input values that may be volatile. Essentially, they are different algorithms, such as Tanh, Step, Sigmoid, Identity and much more. Most commonly, Sigmoid is used, and can squash any value to a proportional size between 0 and 1. Because input values need this normalization, the target values that they need to be compared against on the output layer must conform to the same type. There are also different algorithms that do this comparison. Depending on the situation, you may decide to use different ones. They are known as loss functions or cost functions, such as the squared cost
(L = frac{1}{2}sum_{i} (y_{i} – t_{i})^{2})
Or the cross-entropy cost
(L = -sum_{i} (y_{i}cdot log(t_{i})))
Where y is the output of the ith neuron, which is compared to the target t.
All these hyperparameters need their most ideal values for success. Most simply, you can of course optimise the values manually. Using your knowledge of the data you’re using, and your understanding of each parameter, you may be able to achieve satisfactory levels of success. This is essentially trial and error and known as Hand-tuning. However, there is also a method known as Grid Search, that essentially involves a mass training of many different values at the same time, and you can choose the best. However, this can take ages. There are many different hyperparameters, and to try many different values of each would result in a huge amount of combinations to try simultaneously.
Naturally, you are much more likely to speed this up with a random search. Using the same method as grid, but by approaching it randomly. You will find greater success sooner, as it acts as a more sparse version of what a grid would produce. The bayesian method, as you may expect, is actually very similar to a Hand-tuning, but more algorithmic. It’s estimations are based on previous parameter values, and the error they received. Although that is slightly out of the scope of this series, it is worth knowing how much better automated the process can be.
All these moving parts and parameters have all sorts of in depth research that I have only touched the surface on. However, It’s important you’ve seen this general overview of all the different variants of what builds a neural network, and how they contribute to success.
©ODSC2017