*This post was originally published by Daniel Godoy at Towards Data Science*

The content of this post is a partial reproduction of a chapter from the book:

“Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide”.

What do **gradient descent**, the **learning rate**, and **feature scaling** have in common? Let’s see…

Every time we train a deep learning model, or any neural network for that matter, we’re using **gradient descent** (with backpropagation). We use it to *minimize a loss* by *updating the parameters/weights* of the model.

The parameter update depends on two values: a gradient and a **learning rate**. The learning rate gives you control of how big (or small) the updates are going to be. A **bigger learning rate** means **bigger updates** and, hopefully, a model that **learns faster**.

But there is a catch, as always… if the learning rate is **too big**, the model will **not learn anything**. This leads us to *two fundamental questions*:

**How big is “too big”?****Is there anything I can do to use a bigger learning rate?**

Unfortunately, there is no clear-cut answer for the first question. It will always depend on many factors.

But **there is** an answer for the **second** one: **feature scaling**! How does that work? Well, that’s why I’ve written this post: to show you, in detail, how gradient descent, the learning rate, and the feature scaling are connected.

In this post we will:

- define a
**model**and generate a**synthetic dataset** - randomly
**initialize**the parameters - explore the
**loss surface**and**visualize the gradients** - understand the
**effects**of using different**learning rates** - understand the
**effects**of**feature scaling**

The model must be **simple** and **familiar**, so you can focus on the **inner workings** of gradient descent. So, I will stick with a model as simple as it can be: a **linear regression with a single feature x**!

Simple Linear Regression

In this model, we use a **feature** (** x**) to try to predict the value of a

**label**(

**). There are three elements in our model:**

*y***parameter**, the*b**bias*(or*intercept*), which tells us the expected average value of*y*when*x*is zero**parameter**, the*w**weight*(or*slope*), which tells us how much*y*increases, on average, if we increase*x*by one unit- and that
**last term**(why does it*always*have to be a Greek letter?),*epsilon*, which is there to account for the inherent**noise**, that is, the**error**we cannot get rid of

We know our model already. In order to generate **synthetic data** for it, we need to pick values for its **parameters**. I chose ** b = 1** and

**w = 2**.

First, let’s generate our **feature** (** x**): we use Numpy’s

**rand**method to randomly generate 100 (

*N*) points between 0 and 1.

Then, we plug our **feature** (*x*) and our **parameters b and w** into our

**equation**to compute our

**labels**(

**). But we need to add some**

*y***Gaussian noise**(

**) as well; otherwise, our synthetic dataset would be a perfectly straight line.**

*epsilon*We can generate noise using *Numpy*’s **randn** method, which draws samples from a normal distribution (of mean 0 and variance 1), and then multiplying it by a **factor** to adjust for the **level of noise**. Since I don’t want to add so much noise, I picked 0.1 as my factor.

Synthetic Dataset

### Train-Validation-Test Split

It is beyond the scope of this post to explain the reasoning behind the **train-
validation-test split**, but there are two points I’d like to make:

- The split should
**always**be the**first thing**you do — no preprocessing, no transformations;**nothing happens before the split**— that’s why we do this**immediately after the synthetic data generation** - In this post we will use
**only the training set**— so I did not bother to create a**test set**, but I performed a split nonetheless to**highlight point #1**🙂

Train-validation split

Synthetic data

In our example, we already **know** the **true values** of the parameters, but this will obviously never happen in real life: if we *knew* the true values, why even bother to train a model to find them?!

OK, given that **we’ll never know** the **true values** of the parameters, we need to set **initial values** for them. How do we choose them? It turns out; a **random guess** is as good as any other.

So, we can **randomly initialize** **the parameters/weights** (we have only two, ** b** and

**).**

*w*Random start point

Our randomly initialized parameters are: ** b = 0.49** and

**. Are these parameters any good?**

*w*= -0.13Obviously not… but, exactly *how bad* are they? That’s what the **loss** is for. Our goal will be to **minimize** it.

After choosing a random starting point for our parameters, we use them to *make predictions*, *compute the corresponding errors*, and *aggregate these errors* into a **loss**. Since this is a linear regression, we’re using **Mean Squared Error (MSE)** as our loss. The code below performs these steps:

Making predictions and computing the loss

We have just computed the **loss** (2.74) corresponding to our **randomly initialized parameters** (*b* = 0.49 and *w* = -0.13). Now, what if we did the same for **ALL** possible values of ** b** and

**? Well, not**

*w**all*possible values, but

*all combinations of evenly spaced values in a given range*?

We could vary ** b** between

**-2**and

**4**, while varying

**between**

*w***-1**and

**5**, for instance, each range containing 101 evenly spaced points. If we compute the

**losses**corresponding to

**each different combination of the parameters**, the result would be a

*b*and*w*inside these ranges**grid of losses**, a matrix of shape (101, 101).

These losses are our **loss surface**, which can be visualized in a 3D plot, where the vertical axis (*z*) represents the loss values. If we **connect** the combinations of *b* and *w* that yield the **same loss value**, we’ll get an **ellipse**. Then, we can draw this ellipse in the original *b x w* plane (in blue, for a loss value of 3). This is, in a nutshell, what a **contour plot** does. From now on, we’ll always use the contour plot, instead of the corresponding 3D version.

The plots below show us the loss surface for the suggested **ranges of parameters**, using our training set to compute the loss for each combination of *b* and *w*.

Loss surface

In the center of the plot, where parameters (*b, w*) have values close to (1, 2), the loss is at its **minimum** value. This is the point we’re trying to reach using gradient descent.

In the bottom, slightly to the left, there is the **random start** point, corresponding to our randomly initialized parameters (*b* = 0.49 and *w* = -0.13).

This is one of the nice things about tackling a simple problem like a linear

regression with a single feature: we have only **two parameters**, and thus **we can compute and visualize the loss surface**.

### Cross-Sections

Another nice thing is that we can cut a **cross-section** in the loss surface to check what the **loss** looks like **if the other parameter were held constant**.

Let’s start by making ** b =0.52** (the value from our evenly spaced range that is closest to our initial random value for

*b*, 0.4967) — we cut a cross-section

*vertically*(the red dashed line) on our loss surface (left plot), and we get the resulting plot on the right:

Vertical cross-section — parameter **b** is fixed

What does this cross-section tell us? It tells us that, **if we keep b constant** (at 0.52), the

**loss**, seen from the

**perspective of parameter**, can be minimized if

*w***(up to some value between 2 and 3).**

*w*gets increasedSure, **different values of b** produce

**different cross-section loss curves for**. And those curves will depend on the

*w**shape of the loss surface*(more on that later, in the “

**Learning Rate**” section).

OK, so far, so good… what about the *other* cross-section? Let’s cut it horizontally now, making ** w = -0.16** (the value from our evenly spaced range that is closest to our initial random value for b, -0.1382). The resulting plot is on the right:

Horizontal cross-section — parameter **w** is fixed

Now, **if we keep w constant** (at -0.16), the

**loss**, seen from the

**perspective of parameter**, can be minimized

*b***if**(up to some value close to 2).

*b*gets increasedIn general, the purpose of this cross-section is to get

the effect on the lossofchanging a single parameter, while keepingeverything else constant. This is, in a nutshell, agradient🙂

From the previous section, we already know that to *minimize the loss*, both *b* and *w* needed to be **increased**. So, keeping the spirit of using gradients, let’s **increase each parameter a little bit** (always keeping the other one fixed!). By the way, in this example, a

*little bit*equals 0.12 (for convenience sake, so it results in a nicer plot).

What effect do these increases have on the loss? Let’s check it out:

Computing (approximate) gradients, geometrically

On the left plot, **increasing w by 0.12** yields a

**loss reduction of 0.21**. The geometrically computed and roughly approximate gradient is given by the ratio between the two values:

**-1.79**. How does this result compare to the

*actual*value of the gradient (-1.83)? It is actually not bad for a crude approximation… Could it be better? Sure,

**if we make the increase in**(like 0.01, instead of 0.12), we’ll get

*w*smaller and smaller**better and better**approximations… in the limit, as the

**increase approaches zero**, we’ll arrive at the

**precise value of the gradient**. Well, that’s the definition of a derivative!

The same reasoning goes for the plot on the right: **increasing b by the same 0.12** yields a

**bigger loss reduction of 0.35**. Bigger loss reduction, bigger ratio, bigger gradient — and bigger error, too, since the geometric approximation (-2.90) is farther away from the actual value (-3.04).

Finally, we **use the gradients to update** the parameters. Since we are trying to **minimize our losses**, we **reverse the sign** of the gradient for the update.

There is still another (hyper-)parameter to consider: the **learning rate**, denoted by the *Greek letter* ** eta** (that looks like the letter

**n**), which is the

**multiplicative factor**that we need to apply to the gradient for the parameter update.

Updating parameters **b** and **w**

We can also interpret this a bit differently: **each parameter** is going to have its

value **updated by a constant value eta** (the learning rate), but this constant is going to be

**weighted by how much that parameter contributes to minimizing the loss**(its gradient).

Honestly, I believe this way of thinking about the parameter update makes more sense: first, you decide on a *learning rate* that specifies your **step size**, while the *gradients* tell you the **relative impact** (on the loss) of taking a step for each parameter. Then you take a given **number of steps** that’s **proportional** to that **relative impact: more impact, more steps**.

“How to

choosea learning rate?”Unfortunately, that is a topic on its own and beyond the scope of this post.

### Learning Rate

The **learning rate** is the most important hyper-parameter — there is a gigantic

amount of material on how to choose a learning rate, how to modify the learning rate during the training, and how the wrong learning rate can completely ruin the model training.

Maybe you’ve seen this famous graph below(from Stanford’s CS231n class) that shows how a learning rate that is **too big** or **too small** affects the **loss** during training.

Most people will see it (or have seen it) at some point in time. This is pretty much general knowledge, but I think it needs to be **thoroughly explained and visually demonstrated** to be *truly* understood. So, let’s start!

I will tell you a little story (trying to build an analogy here, please bear with me!): imagine you are coming back from hiking in the mountains and you want to get back home as quickly as possible. At some point in your path, you can either choose to *go ahead* or to *make a right turn*.

The path ahead is almost flat, while the path to your *right* is kinda steep. The

**steepness** is the **gradient**. If you take a single step one way or the other, it will lead to different outcomes (you’ll descend more if you take one step to the right instead of going ahead).

But, here is the thing: you know that the path to your *right* is getting you home **faster**, so you don’t take just one step, but **multiple steps** in that direction: **the steeper the path, the more steps you take**! Remember, “*more impact, more steps*”! You just cannot resist the urge to take that many steps; your behavior seems to be completely determined by the landscape. This analogy is getting weird, I know…

But, you still have **one choice**: you **can adjust the size of your step**. You can choose to take steps of any size, from tiny steps to long strides. That’s your **learning rate**.

OK, let’s see where this little story brought us so far… that’s how you’ll move, in a nutshell:

**updated location = previous location + step size * number of steps**

Now, compare it to what we did with the parameters:

**updated value = previous value – learning rate * gradient**

You got the point, right? I hope so because the analogy completely falls apart now… at this point, after moving in one direction (say, the *right turn* we talked about), you’d have to stop and move in the other direction (for just a fraction of a step, because the path was almost *flat*, remember?). And so on and so forth… Well, I don’t think anyone has ever returned from hiking in such an orthogonal zigzag path…

Anyway, let’s explore further the **only choice** you have: the size of your step, I

mean, the **learning rate**.

### Small Learning Rate

It makes sense to start with *baby steps*, right? This means using a **small learning rate**. Small learning rates are **safe(r)**, as expected. If you were to take tiny steps while returning home from your hiking, you’d be more likely to arrive there safe and sound — but it would take a **lot of time**. The same holds true for training models: small learning rates will likely get you to (some) minimum point, **eventually**. Unfortunately, time is money, especially when you’re paying for GPU time in the cloud… so, there is an *incentive* to try **bigger learning rates**.

How does this reasoning apply to our model? From computing our (geometric) gradients, we know we need to take a **given number of steps**: **1.79** (parameter *w*) and **2.90** (parameter *b*), respectively. Let’s set our **step size to 0.2** (small-ish). It means we **move 0.36 for w** and

**0.58 for**.

*b*

IMPORTANT: in real life, a learning rate of 0.2 is usually considered BIG — but in our very simple linear regression example, it still qualifies as small-ish.

Where does this movement lead us? As you can see in the plots below (as shown by the **new dots** to the right of the original ones), in both cases, the movement took us closer to the minimum — more so on the right because the curve is **steeper**.

Using a small-ish learning rate

### Big Learning Rate

What would have happened if we had used a **big** learning rate instead, say, a **step size of 0.8**? As we can see in the plots below, we start to, literally, **run into trouble**…

Using a BIG learning rate

Even though everything is still OK on the left plot, the right plot shows us a completely different picture: **we ended up on the other side of the curve**. That is ** not** good… you’d be going

**back and forth**, alternately hitting both sides of the curve.

“Well, even so, I may

stillreach the minimum, why is it so bad?”

In our simple example, yes, you’d eventually reach the minimum because the **curve is nice and round**.

But, in real problems, the “curve” has some really **weird shape** that allows for

**bizarre outcomes**, such as going back and forth **without ever approaching the minimum**.

In our analogy, you **moved so fast** that you **fell down** and hit the **other side of the valley**, then kept going down like a ** ping-pong**. Hard to believe, I know, but you definitely don’t want that…

### Very Big Learning Rate

Wait, it may get **worse** than that… let’s use a **really big learning rate**, say, a **step size of 1.1**!

Using a REALLY BIG learning rate

Ok, that ** is** bad… on the right plot, not only we ended up on the

*other side of the curve*again, but we actually

**climbed up**. This means

**our loss increased**, instead of decreasing! How is that even possible?

*You’re moving so fast downhill that you end up climbing it back up*?! Unfortunately, the analogy cannot help us anymore. We need to think about this particular case in a different way…

First, notice that everything is *fine* on the left plot. The *enormous learning rate* **did not cause any issues** because the left curve is **less steep** than the one on the right. In other words, the curve on the left **can take bigger learning rates** than the curve on the right.

What can we learn from it?

Too big, for alearning rate, is a relative concept: it depends onhow steepthe curve is or, in other words, it depends onhow big the gradient is.We do have many curves,

many gradients: one for each parameter. But we only haveone single learning rateto choose (sorry, that’s the way it is!).It means that the

size of the learning rate is limited by the steepest curve. All other curves must follow suit, meaning, they’d be using a sub-optimal learning rate, given their shapes.The reasonable conclusion is: it is

bestif all thecurves are equally steep, so thelearning rateis closer to optimal for all of them!

### “Bad” Feature

How do we achieve *equally* *steep* curves? I’m on it! First, let’s take a look at a *slightly* modified example, which I am calling the “bad” dataset:

- I
**multiplied our feature (**, so it is in the range [0, 10] now, and renamed it*x*) by 10*bad_x* - but since I
**do not want the labels (**, I*y*) to change**divided the original**and renamed it*true_w*parameter by 10— this way, both*bad_w*and*bad_w***bad_x*yield the same results*w * x*

Generating “bad” dataset

Then I performed the same split as before for both, *original* and *bad*, datasets and plot the training sets side by side, as you can see below:

Train-validation split for the “bad” dataset

Same data, different scales for feature **x**

The **only** difference between the two plots is the **scale of feature x**. Its range was [0, 1], now it is [0, 10]. The label y hasn’t changed, and I did not touch

**.**

*true_b*Does this simple **scaling** have any meaningful impact on our gradient descent? Well, if it hadn’t, I wouldn’t be asking it, right? Let’s compute a new **loss surface** and compare to the one we had before:

Loss surface — before and after scaling feature **x**

Look at the **contour values** of the plot above: the *dark blue* line was **3.0**, and now it is **50.0**! For the same range of parameter values, **loss values are much bigger**.

Let’s look at the *cross-sections* before and after we multiplied feature *x* by 10:

Comparing cross-sections: before and after

What happened here? The **red curve** got much **steeper** (bigger gradient), and thus we must use a **smaller learning rate** to safely descend along with it.

More importantly, the

differenceinsteepnessbetween the red and the black curvesincreased.This is exactly what

WE NEED TO AVOID!Do you remember why?

Because the

size of the learning rate is limited by the steepest curve!

How can we fix it? Well, we *ruined* it by **scaling it 10x bigger**… perhaps we can make it better if we **scale it in a different way**.

### Scaling / Standardizing / Normalizing

Different how? There is this *beautiful* thing called the **StandardScaler**, which transforms a **feature** in such a way that it ends up with **zero mean** and **unit standard deviation**.

How does it achieve that? First, it computes the *mean* and the *standard deviation* of a given **feature ( x)** using the training set (

**points):**

*N*Mean and Standard Deviation, as computed in the StandardScaler

Then it uses both values to **scale** the feature:

Standardization

If we were to recompute the mean and the standard deviation of the scaled feature, we would get 0 and 1, respectively. This preprocessing step is commonly referred to as *normalization*, although, technically, it should always be referred to as *standardization*.

*This post was originally published by Daniel Godoy at Towards Data Science*