*This post was originally published by Andre Ye at Towards Data Science*

Immediately, the authors of the Batch Normalization paper begin by initializing variables.

*Tip*: Machine learning papers are notorious for creating dozens of variables and expecting the reader to know what they mean when they are referenced later. Take a highlighter and highlight where a variable is ‘initialized’ and where it is used henceforth. This will make reading much easier.

The authors present the formula for Stochastic Gradient Descent, initializing several variables, such as the parameters and the maximum number of training examples in the set.

Glossary:

“arg minf(x)” refers to the arguments, or inputs, which minimize the following function, in this casef(x).

In English, the statement reads, “the parameters of the network Θ are equal to the values in which [the average of all values outlined by a function *l* which takes in an individual training point and the current parameters] is minimized.” This is a mathematically rigorous definition of the goal of a neural network. Just for the purpose of being rigorous, often mathematical equations are written to be more complex than they actually are. It’s helpful to write out what an equation means in a human language.

The authors note that with SGD, training proceeds in steps, outlining an additional variable *m* which represents the size of each “mini-batch”. By using mini batches instead of one example at a time, the gradient of the loss is a better estimate for the gradient over the entire set.

The highlighted statement reads, “the average of [the change in the loss function, which takes in the current training example and the parameters, with respect to the parameters] for all values *i* in the training mini-batch”. This is the definition of a ‘gradient’, which calculates the error landscape of the loss function to provide insight on how the parameters should be updated. The use of [fraction][sigma] is almost always a complex way of meaning the average.

Glossary:

∂ is universally used to represent a partial derivative, or a change in a function with two or more variables. A derivative can be thought of, in a simple context, as “a small change in one variable’s effect on another variable.”

While SGD works well for a variety of reasons, the authors write that a change in the distributions of one layer’s inputs causes difficulties in the following layers because they need to adapt to the changing distributions, a phenomenon they call covariate shift.

While traditionally this was handled by domain adaptation, the authors believe that the idea can be extended to a sub-network or a layer.

The first highlighted statement reads “the loss is equal to an arbitrary transformation on [a parameter and an arbitrary transformation on (another parameter and an input *u*)]”. In this statement, the authors are setting up a hypothetical network to support their ideas. The authors simplify the initial statement by replacing a component with *x* to represent an input form the previous function. Θ[1] and Θ[2] are the parameters that are learned to minimize the loss *l*. These premises are identical to a complete neural network, simply built at a smaller scale.

The highlighted equation demonstrates the mechanics behind gradient descent, which computes partial derivatives to calculate the gradient, with progress determined by a learning rate. The change, which may be positive or negative, is subtracted to the parameter Θ[2], and is intended to steer the parameters in a direction to minimize loss/*F2*.

The authors write that this gradient descent step is the same for a single network *F2* with the input *x* to establish the legitimacy of comparison between a real neural network and the one hypothesized. Because it Θ[2] does not need to readjust to compensate for a change in the distribution of *x*, it must be advantageous to keep *x*’s distribution fixed.

Note:

This is a common theme in machine learning papers. Because machine learning deals with systems that involve so many more variables and with so much more complexity than other fields, its papers will often follow a three-step process to demonstrating how a thesis works:

1. Create a hypothetical and simple system.

2. Establish the identical natures between it and a real neural network.

3. Draw conclusions by making operations on the simple system.

Of course, in more modern papers one will see a section entirely devoted to displaying accuracies and how the method works on various common benchmark datasets like ImageNet, with comparisons to other methods.

Besides, the authors write, a fixed distribution of inputs would be beneficial for inputs in the entire network. They bring up the standard equation

*z = g(Wu + b)*, with *W* representing the weights and *b* the bias. *g(x)* is defined to be the sigmoid function. The authors point out that as *x*’s distance from 0 increases, its derivative — or its gradient — tends ever closer to 0.

Because the derivative slopes away for extreme values of *x*, when the distribution shifts, less information (gradient) is given because of the nature of the sigmoid function.

Glossary: g’(x)

is another notation for the derivative ofg(x).

Hence, the authors conclude, useful information propagated by the gradient will slowly vanish as it reaches the back of the network because changes in distributions cause cumulative decay in information. This is also known as the vanishing gradient problem. (Its opposite, the explosive gradient problem, is when massive gradients cause weights to fluctuate wildly, causing instability in learning.)

As a proposal to address this issue, they consider a layer that adds a learnable bias to the input, then normalizes the result. If the changes of the normalization caused by the bias are ignored (the bias’s gradient is calculated and updated independently), the combination of updating *b* and the corresponding change in the normalization yielded no change in the output layer.

Glossary: E[x]

is often used to represent the mean ofx, where “E” represents “expected value”. This is later defined with the formal summation definition later on. ∝means “proportional to’ — the delta (change) inbis proportional to the standard formula for gradient descent.

This is proven mathematically — since *x*[hat] is equal to *x* − *E*[*x*], and *x* is equal to *u + b*, these statements are combined to form *u* + *b* − *E*[*u + b*]. However, the changes to *b*, represented by Δ*b*, cancel each other out, and is equal to itself without any changes. Hence, *b* will grow indefinitely because of the faulty gradient while the loss remains fixed.

Tip:

Often, papers will set up various statements and suddenly combine them together. How the authors arrive at a conclusion may be puzzling; try to underlying various relevant equations and see how they fit together. More importantly, however, it’s important to understand what the equation means.

With these considerations, the authors slightly adjust their batch normalization formula to normalize each scalar feature independently, each with zero mean and a unit variance. With the removal of an unnecessary bias, the layer transforms all inputs into a normally distributed output.

There’s plenty more within the Batch Normalization paper to be read and understood. Be aware, however, that the conclusions these authors come to has been proven to be slightly flawed — specifically, that internal covariate shift was the reason why batch normalization worked so well.

*This post was originally published by Andre Ye at Towards Data Science*