Gradient Descent, the Learning Rate, and the importance of Feature Scaling

towards-data-science

This post was originally published by Daniel Godoy at Towards Data Science

The content of this post is a partial reproduction of a chapter from the book: Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide”.

Every time we train a deep learning model, or any neural network for that matter, we’re using gradient descent (with backpropagation). We use it to minimize a loss by updating the parameters/weights of the model.

The parameter update depends on two values: a gradient and a learning rate. The learning rate gives you control of how big (or small) the updates are going to be. A bigger learning rate means bigger updates and, hopefully, a model that learns faster.

But there is a catch, as always… if the learning rate is too big, the model will not learn anything. This leads us to two fundamental questions:

  • How big is “too big”?
  • Is there anything I can do to use a bigger learning rate?

Unfortunately, there is no clear-cut answer for the first question. It will always depend on many factors.

But there is an answer for the second one: feature scaling! How does that work? Well, that’s why I’ve written this post: to show you, in detail, how gradient descent, the learning rate, and the feature scaling are connected.

In this post we will:

  • define a model and generate a synthetic dataset
  • randomly initialize the parameters
  • explore the loss surface and visualize the gradients
  • understand the effects of using different learning rates
  • understand the effects of feature scaling

Simple Linear Regression

In this model, we use a feature (x) to try to predict the value of a label (y). There are three elements in our model:

  • parameter b, the bias (or intercept), which tells us the expected average value of y when x is zero
  • parameter w, the weight (or slope), which tells us how much y increases, on average, if we increase x by one unit
  • and that last term (why does it always have to be a Greek letter?), epsilon, which is there to account for the inherent noise, that is, the error we cannot get rid of

First, let’s generate our feature (x): we use Numpy’s rand method to randomly generate 100 (N) points between 0 and 1.

Then, we plug our feature (x) and our parameters b and w into our equation to compute our labels (y). But we need to add some Gaussian noise (epsilon) as well; otherwise, our synthetic dataset would be a perfectly straight line.

We can generate noise using Numpy’s randn method, which draws samples from a normal distribution (of mean 0 and variance 1), and then multiplying it by a factor to adjust for the level of noise. Since I don’t want to add so much noise, I picked 0.1 as my factor.

Synthetic Dataset

Train-Validation-Test Split

  1. The split should always be the first thing you do — no preprocessing, no transformations; nothing happens before the split — that’s why we do this immediately after the synthetic data generation
  2. In this post we will use only the training set — so I did not bother to create a test set, but I performed a split nonetheless to highlight point #1 🙂

Train-validation split

Synthetic data

OK, given that we’ll never know the true values of the parameters, we need to set initial values for them. How do we choose them? It turns out; a random guess is as good as any other.

So, we can randomly initialize the parameters/weights (we have only two, b and w).

Random start point

Our randomly initialized parameters are: b = 0.49 and w = -0.13. Are these parameters any good?

Obviously not… but, exactly how bad are they? That’s what the loss is for. Our goal will be to minimize it.

Making predictions and computing the loss

We have just computed the loss (2.74) corresponding to our randomly initialized parameters (b = 0.49 and w = -0.13). Now, what if we did the same for ALL possible values of b and w? Well, not all possible values, but all combinations of evenly spaced values in a given range?

We could vary b between -2 and 4, while varying w between -1 and 5, for instance, each range containing 101 evenly spaced points. If we compute the losses corresponding to each different combination of the parameters b and w inside these ranges, the result would be a grid of losses, a matrix of shape (101, 101).

These losses are our loss surface, which can be visualized in a 3D plot, where the vertical axis (z) represents the loss values. If we connect the combinations of b and w that yield the same loss value, we’ll get an ellipse. Then, we can draw this ellipse in the original b x w plane (in blue, for a loss value of 3). This is, in a nutshell, what a contour plot does. From now on, we’ll always use the contour plot, instead of the corresponding 3D version.

The plots below show us the loss surface for the suggested ranges of parameters, using our training set to compute the loss for each combination of b and w.

Loss surface

In the center of the plot, where parameters (b, w) have values close to (1, 2), the loss is at its minimum value. This is the point we’re trying to reach using gradient descent.

In the bottom, slightly to the left, there is the random start point, corresponding to our randomly initialized parameters (b = 0.49 and w = -0.13).

This is one of the nice things about tackling a simple problem like a linear
regression with a single feature: we have only two parameters, and thus we can compute and visualize the loss surface.

Cross-Sections

Let’s start by making b =0.52 (the value from our evenly spaced range that is closest to our initial random value for b, 0.4967) — we cut a cross-section vertically (the red dashed line) on our loss surface (left plot), and we get the resulting plot on the right:

Vertical cross-section — parameter b is fixed

What does this cross-section tell us? It tells us that, if we keep b constant (at 0.52), the loss, seen from the perspective of parameter w, can be minimized if w gets increased (up to some value between 2 and 3).

Sure, different values of b produce different cross-section loss curves for w. And those curves will depend on the shape of the loss surface (more on that later, in the “Learning Rate” section).

OK, so far, so good… what about the other cross-section? Let’s cut it horizontally now, making w = -0.16 (the value from our evenly spaced range that is closest to our initial random value for b, -0.1382). The resulting plot is on the right:

Horizontal cross-section — parameter w is fixed

Now, if we keep w constant (at -0.16), the loss, seen from the perspective of parameter b, can be minimized if b gets increased (up to some value close to 2).

In general, the purpose of this cross-section is to get the effect on the loss of changing a single parameter, while keeping everything else constant. This is, in a nutshell, a gradient 🙂

What effect do these increases have on the loss? Let’s check it out:

Computing (approximate) gradients, geometrically

On the left plot, increasing w by 0.12 yields a loss reduction of 0.21. The geometrically computed and roughly approximate gradient is given by the ratio between the two values: -1.79. How does this result compare to the actual value of the gradient (-1.83)? It is actually not bad for a crude approximation… Could it be better? Sure, if we make the increase in w smaller and smaller (like 0.01, instead of 0.12), we’ll get better and better approximations… in the limit, as the increase approaches zero, we’ll arrive at the precise value of the gradient. Well, that’s the definition of a derivative!

The same reasoning goes for the plot on the right: increasing b by the same 0.12 yields a bigger loss reduction of 0.35. Bigger loss reduction, bigger ratio, bigger gradient — and bigger error, too, since the geometric approximation (-2.90) is farther away from the actual value (-3.04).

There is still another (hyper-)parameter to consider: the learning rate, denoted by the Greek letter eta (that looks like the letter n), which is the multiplicative factor that we need to apply to the gradient for the parameter update.

Updating parameters b and w

We can also interpret this a bit differently: each parameter is going to have its
value updated by a constant value eta (the learning rate), but this constant is going to be weighted by how much that parameter contributes to minimizing the loss (its gradient).

Honestly, I believe this way of thinking about the parameter update makes more sense: first, you decide on a learning rate that specifies your step size, while the gradients tell you the relative impact (on the loss) of taking a step for each parameter. Then you take a given number of steps that’s proportional to that relative impact: more impact, more steps.

“How to choose a learning rate?”

Unfortunately, that is a topic on its own and beyond the scope of this post.

Learning Rate

Maybe you’ve seen this famous graph below(from Stanford’s CS231n class) that shows how a learning rate that is too big or too small affects the loss during training.

Source

Most people will see it (or have seen it) at some point in time. This is pretty much general knowledge, but I think it needs to be thoroughly explained and visually demonstrated to be truly understood. So, let’s start!

I will tell you a little story (trying to build an analogy here, please bear with me!): imagine you are coming back from hiking in the mountains and you want to get back home as quickly as possible. At some point in your path, you can either choose to go ahead or to make a right turn.

The path ahead is almost flat, while the path to your right is kinda steep. The
steepness is the gradient. If you take a single step one way or the other, it will lead to different outcomes (you’ll descend more if you take one step to the right instead of going ahead).

But, here is the thing: you know that the path to your right is getting you home faster, so you don’t take just one step, but multiple steps in that direction: the steeper the path, the more steps you take! Remember, “more impact, more steps”! You just cannot resist the urge to take that many steps; your behavior seems to be completely determined by the landscape. This analogy is getting weird, I know…

But, you still have one choice: you can adjust the size of your step. You can choose to take steps of any size, from tiny steps to long strides. That’s your learning rate.

OK, let’s see where this little story brought us so far… that’s how you’ll move, in a nutshell:

updated location = previous location + step size * number of steps

Now, compare it to what we did with the parameters:

updated value = previous value – learning rate * gradient

You got the point, right? I hope so because the analogy completely falls apart now… at this point, after moving in one direction (say, the right turn we talked about), you’d have to stop and move in the other direction (for just a fraction of a step, because the path was almost flat, remember?). And so on and so forth… Well, I don’t think anyone has ever returned from hiking in such an orthogonal zigzag path…

Anyway, let’s explore further the only choice you have: the size of your step, I
mean, the learning rate.

Small Learning Rate

How does this reasoning apply to our model? From computing our (geometric) gradients, we know we need to take a given number of steps: 1.79 (parameter w) and 2.90 (parameter b), respectively. Let’s set our step size to 0.2 (small-ish). It means we move 0.36 for w and 0.58 for b.

IMPORTANT: in real life, a learning rate of 0.2 is usually considered BIG — but in our very simple linear regression example, it still qualifies as small-ish.

Where does this movement lead us? As you can see in the plots below (as shown by the new dots to the right of the original ones), in both cases, the movement took us closer to the minimum — more so on the right because the curve is steeper.

Using a small-ish learning rate

Big Learning Rate

Using a BIG learning rate

Even though everything is still OK on the left plot, the right plot shows us a completely different picture: we ended up on the other side of the curve. That is not good… you’d be going back and forth, alternately hitting both sides of the curve.

“Well, even so, I may still reach the minimum, why is it so bad?”

In our simple example, yes, you’d eventually reach the minimum because the curve is nice and round.

But, in real problems, the “curve” has some really weird shape that allows for
bizarre outcomes, such as going back and forth without ever approaching the minimum.

In our analogy, you moved so fast that you fell down and hit the other side of the valley, then kept going down like a ping-pong. Hard to believe, I know, but you definitely don’t want that…

Very Big Learning Rate

Using a REALLY BIG learning rate

Ok, that is bad… on the right plot, not only we ended up on the other side of the curve again, but we actually climbed up. This means our loss increased, instead of decreasing! How is that even possible? You’re moving so fast downhill that you end up climbing it back up?! Unfortunately, the analogy cannot help us anymore. We need to think about this particular case in a different way…

First, notice that everything is fine on the left plot. The enormous learning rate did not cause any issues because the left curve is less steep than the one on the right. In other words, the curve on the left can take bigger learning rates than the curve on the right.

What can we learn from it?

Too big, for a learning rate, is a relative concept: it depends on how steep the curve is or, in other words, it depends on how big the gradient is.

We do have many curves, many gradients: one for each parameter. But we only have one single learning rate to choose (sorry, that’s the way it is!).

It means that the size of the learning rate is limited by the steepest curve. All other curves must follow suit, meaning, they’d be using a sub-optimal learning rate, given their shapes.

The reasonable conclusion is: it is best if all the curves are equally steep, so the learning rate is closer to optimal for all of them!

“Bad” Feature

  • I multiplied our feature (x) by 10, so it is in the range [0, 10] now, and renamed it bad_x
  • but since I do not want the labels (y) to change, I divided the original true_w parameter by 10 and renamed it bad_w — this way, both bad_w * bad_x and w * x yield the same results

Generating “bad” dataset

Then I performed the same split as before for both, original and bad, datasets and plot the training sets side by side, as you can see below:

Train-validation split for the “bad” dataset

Same data, different scales for feature x

The only difference between the two plots is the scale of feature x. Its range was [0, 1], now it is [0, 10]. The label y hasn’t changed, and I did not touch true_b.

Does this simple scaling have any meaningful impact on our gradient descent? Well, if it hadn’t, I wouldn’t be asking it, right? Let’s compute a new loss surface and compare to the one we had before:

Loss surface — before and after scaling feature x

Look at the contour values of the plot above: the dark blue line was 3.0, and now it is 50.0! For the same range of parameter values, loss values are much bigger.

Let’s look at the cross-sections before and after we multiplied feature x by 10:

Comparing cross-sections: before and after

What happened here? The red curve got much steeper (bigger gradient), and thus we must use a smaller learning rate to safely descend along with it.

More importantly, the difference in steepness between the red and the black curves increased.

This is exactly what WE NEED TO AVOID!

Do you remember why?

Because the size of the learning rate is limited by the steepest curve!

How can we fix it? Well, we ruined it by scaling it 10x bigger… perhaps we can make it better if we scale it in a different way.

Scaling / Standardizing / Normalizing

How does it achieve that? First, it computes the mean and the standard deviation of a given feature (x) using the training set (N points):

Mean and Standard Deviation, as computed in the StandardScaler

Then it uses both values to scale the feature:

Standardization

If we were to recompute the mean and the standard deviation of the scaled feature, we would get 0 and 1, respectively. This preprocessing step is commonly referred to as normalization, although, technically, it should always be referred to as standardization.

Spread the word

This post was originally published by Daniel Godoy at Towards Data Science

Related posts