What AlexNet brought to the world of Deep Learning

What AlexNet Brought To The World Of Deep Learning


This post was originally published by Richmond Alake at Towards Data Science

Rectified Linear Units (ReLU)

To train neurons within a neural network, it had been standard to utilize either Tanh or sigmoid non-linearity, this was the goto activation function that was leveraged to model the internal neuron activation within CNNs.

The AlexNet went on to use Rectified Linear Units, ReLU for short. ReLU was introduced in this paper by Vinod Nair and Geoffrey E. Hinton in 2010.

ReLu can be described as a transfer function operation that is performed on the output of the prior convolution layer. The utilization of ReLu ensures that values within the neurons that are positive their values are maintained, but for negative values, they are clamped down to zero.

The benefit of using ReLu is that it enables the training process to be accelerated as gradient descent optimization occurs at a faster rate in comparison to other standard non-linearity techniques.

Another benefit of the ReLu layer is that it introduces non-linearity within the network. It also removes the associativity of successive convolutions.


In the original research paper that introduced the AlexNet neural network architecture, the training of models was conducted with the utilization of two GTX 580 GPUs with 3GB memory.

GPU parallelization and distributed training are techniques that are very much in use today.

From information derived from the research paper, the model was trained on two GPU, where half of the model’s neurons were on one, and the other half held within the memory of a second GPU. The GPUs communicated with each other, without the need of going through the host machine. Communication between the GPU is constrained on a layer basis; therefore, only specific layers can communicate with each other.

For example, the inputs in the fourth layer of the AlexNet network was obtained from half of the third layer’s feature maps on the current GPU, and the rest of the other half is derived from the second GPU. This will be better illustrated later in this article.

Local Response Normalisation

Normalization is taking a set of data points and placing them on a comparable basis or scale(this is an overly simplistic description).

Batch Normalization (BN) within CNNs is a technique that standardizes and normalizes inputs by transforming a batch of input data to have a mean of zero and a standard deviation of one.

Many are familiar with batch normalization, but the AlexNet architecture used a different method of normalization within the network: Local Response Normalization (LRN).

LRN is a technique that maximizes the activation of neighboring neurons. Neighboring neurons describe neurons across several feature maps that share the same spatial position. By normalizing the activations of the neurons, neurons with high activations are highlighted; this essentially mimics the lateral inhibition that happens within neurobiology.

LRN are not widely utilized in modern CNN architectures, as there are other more effective methods of normalization. Although, LRN implementations can still be found in some standard machine learning libraries and frameworks, so feel free to experiment.

Overlapping Pooling

Pooling layers in CNNs essentially encapsulate information within a set of pixels or values within a feature map and projects them into a lower sized grid, while reflecting the general information from the original set of pixels.

The illustration below provides an example of a pooling, more specifically max pooling. Max pooling is a variant of sub-sampling where the maximum pixel value of pixels that fall within the receptive field of the pooling window.

Max Pooling Illustration by Justin Francis at Oriely

Within the paper that introduces the AlexNet CNN architecture, a different methodology of pooling was introduces and utilizes. Overlapping pooling. In traditional pooling techniques, the stride, from one center of a pooling window to another is positioned to ensure that values from one pooling window are not within a subsequent pooling window.

In contrast to the traditional methodology of pooling, overlapping pooling utilizes a stride that is less than the dimension of the pooling window. This means that the outputs of subsequent pooling windows encapsulate information from pixels/values that have been pooled more than once. It’s hard to see the benefits of this, but according to the findings of the paper, overlapping pooling reduces the ability for a model to overfit during training.

Data Augmentation

Another standard method of reducing the chances of overfitting a network is through data augmentation. By artificially augmenting the dataset, you increase the number of training data, which in turn increases the amount of data the network is exposed to during the training phase.

Augmentation of images usually comes in the form of transformation, translation, scaling, cropping, flipping, etc.

The images used to train the network in the original AlexNet paper were artificially augmented during the training phase. The augmentation techniques utilized were cropping and alteration of pixel intensities within images.

Images within the training set were randomly cropped from their 256 by 256 dimensions, to obtain a new cropped image of 224 by 224.

Why does augmentation work?

It turns out that randomly performing augmentation to training sets can significantly reduce the potential of a network to overfit during training.

The augmented images are simply derived from the content of the original training images, so why does augmentation work so well?

Simply kept, data augmentation increases the invariance in your dataset without the need for sourcing new data. The ability for the network to generalize well to unseen dataset also increases.

Let’s take a very literal example; the images in the ‘production’ environment might not be perfect, some might be tilted, blurred or contain only bits of essential features. Therefore, training a network against a dataset that includes a more robust variation of training data will enable the trained network to have more success classifying images in a production environment.


Dropout is a term many deep learning practitioners are familiar with. Dropout is a technique that is utilized to reduce a model’s potential to overfit.

The dropout technique works by adding a probability factor to the activation of neurons within the layers of a CNN. This probability factor indicates the neuron’s chances of been activated during a current feed-forward step and during involved in the process of backpropagation.

Dropout is useful as it enables the neurons to reduce dependability on neighboring neurons; each neuron learns more useful features as a result of this.

In the AlexNet architecture, the dropout technique was utilized within the first two fully connected layers.

One of the disadvantages of using the dropout technique is that it increases the time it takes for a network to converge.

Although, the advantage of utilizing dropout far beats its disadvantages.

Spread the word

This post was originally published by Richmond Alake at Towards Data Science

Related posts