A, B, Cs… of Deep Learning Hyperparameters

A, B, Cs… of Deep Learning Hyperparameters


This post was originally published by at Towards Data Science

Deep learning is currently in the news because of its accuracy and the controls over the models we have. With lots of programming software as TensorFlow, Keras, Caffe, and a huge list in the way simplified the work of programming for deep learning. Now we do not have to worry about backpropagation steps, weight updations, etc, but just have to tune hyperparameters.

However, people in deep learning know it is not a ‘just’ thing. With high prediction accuracy, deep learning is boon to have lots of hyperparameters that are employed by practitioners according to their tasks or kind of problem on which they are implementing deep learning models.

AI is a boon to the world but can be a curse to you if you do not use it wisely, And to use it wisely, you need to understand it precisely.

So to “understand them precisely” we need to answer three Ws, i.e when, where, and in what context. And before that, let us list them for a smooth start.

We are going to discuss the three Ws of the following hyperparameters:

  1. Learning Rate
  2. Number of hidden units and Number of hidden layers
  3. β (Gradient Descent with momentum)
  4. β1, β2, and ϵ (Adam optimizer)
  5. Learning rate decay
  6. Mini-batch size (Mini-Batch Gradient descent)

When you write the most fundamental form of gradient descent and update the parameters, here is where the learning rate appears

Steps of Gradient Descent (Published by Author)

The learning rate decides how long jump you are going to take in each iteration of gradient descent. Generally, the learning rate is between 0.001 and 1, this can vary as you progress towards the minima of the function.

Another important hyperparameter is the number of hidden units in a hidden layer and hidden layer itself. It decides how complex is the function described in the given data points by the model. More the number of hidden units and hidden layers, more complex is the function outlined by the model, hence more chances of overfitting.

Frequently, people take lots of hidden layers and hidden units to make a deep neural network and use some techniques like L2 or dropout regularization, which prevents the stipulations of overfitting the data.

There is no prescribed way for determining the correct or optimal number of layers, you have to commence with some minimum number and increment it until you reach a desirable predictive accuracy. That is the reason why applied machine learning is a highly iterative process.

Another parameter is your system, CPU, or GPU that determines the number of hidden layers and units. As it is a highly iterative process, you want the results of each iteration promptly, for that you should have high computational powers.

What is a Weighted Average?

Suppose we recorded the temperature of 200 days of summer (scatter distribution in yellow). The additional curves represent the weighted average of temperature with different weights (beta).

Temperature distribution of 200 days of Delhi, India ((Published by Author))

Note: we have initialized v[0] = temp[0] to ensure that the weighted average temperature remains well within the domain of actual distribution.

How we can use this?

Suppose we have the following type of cost function and our gradient descent is working well with it.

Batch Gradient Descent on a Cost function (Published by Author)

But when we have ample data to train, gradient descent spent a large amount of time in oscillations to reach the minima of the cost function. When we apply a weighted average to gradient descent, it averages out the values in both the directions. Consequently, the vertical values cancel out each other and we obtain more momentum in the horizontal direction.

Gradient Descent with a Momentum on a Cost function (Published by Author)

This is called gradient descent with momentum. Generally, β ranges from 0.9 to 0.99 and we use a log scale to find the optimal value for this hyperparameter.

Batch Normalization

It is observed that when we normalize input data, training becomes faster. So, why not we normalize the input for every hidden layer? This technique is called batch normalization. Usually, we normalize the value before putting them into the activation function.

But if we train on a given picture of cats as shown below, there are chances that our model does not predict correctly.

This is because all cats in the training data set are black and picture in test dataset is of a white cat. Mathematically, the data distribution of test and train datasets are different.

Normalization is a technique by which we obtain the same data distribution each time when we have to provide input. General steps of normalization are:

Steps for normalizing input data (Published by Author)

These steps make the mean zero and unit standard deviation. This particular distribution may not work always. There may be situations where we need a distribution with different central tendencies. Hence, we require to adjust the distribution parameters i.e. mean and deviation.

Additional steps for adjusting the central tendency for normalized data (Published by Author)

Step 4 allows us to adjust the distribution parameters. Also, in step 3 we have added an epsilon term just to ensure denominator is never equal to zero.

Process of one iteration of gradient descent:

Process of Batch-gradient Descent (Published by Author)

When the training set is very large, say around 5 million examples than the time required for upgrading the parameters once, will be large. So, as a remedy, we divide our large dataset into smaller data sets, and for every iteration over these smaller data set we update our parameters.

Process of Mini-batch-gradient Descent (Published by Author)

Here, we have divided training set into three smaller sets, but there are some norms to do that. People recommend that we should make a batch equivalent to some power of two i.e. 64, 128, 1024 examples in a mini-batch. This somewhat optimizes memory allocation indeed the performance of the model.

The completion of one cycle through the whole data set is called one epoch. When we use batch gradient descent (simple gradient descent) we update the parameter once in an epoch but in mini-batch gradient descent, multiple numbers of times parameters are updated during an epoch.

Taking large values of the learning rate optimizes the time but there can be chances that we never reach a minimum. Contrarily, if we take a small learning rate, the learning speed is low. So, why can’t we vary the learning rate during training the model?

As the learning rate approaches convergence (minima of the cost function), it can be slow down for better results. A very general form of the learning rate, in this case, maybe shown as

Implementing Learning rate decay (Published by Author)

In the example, we can observe how the learning rate is varying. It becomes now important to choose the “decay_rate” wisely, hence can be called another hyperparameter.

It is very important to perceive the details of every parameter and hyperparameters when you train a model with a dream that it will bring some change to the world and society. Even a small thing can induce a large innovation.

With that, I hope this article would add something to you. Please share your thoughts because

“Criticism is an indirect form of self-boasting.”– Emmet Fox

Spread the word

This post was originally published by at Towards Data Science

Related posts