*This post was originally published by Saurabh Singh at Towards Data Science*

This is the first article in a series of articles where we will understand the “under the hood” workings of various ML algorithms, using their base math equations.

With so many optimized implementations out there, we sometimes focus too much on the library and the abstraction it provides, and too little on the underlying calculations that go into the model. Understanding these calculations can often be the difference between a good and a great model.

In this series, I focus on implementing the algorithms by hand, to understand the math behind it, which will hopefully help us train and deploy better models.

Note — This series assumes you know the basics of Machine Learning and why we need it. If not, do give this article a read to get up to speed on why and how we utilize ML.

Linear Regression forms the backbone of Machine Learning and is based on the simple concept of curve fitting –

Curve fittingis the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints.

Essentially, the “model” produces a linear equation relating input features(X) to the target variable (Y).

Consider the following data with two input variables — ** X1, X2**, and one target variable

**.**

*Y*Sample data with 5 rows

Linear regression will try to find the optimum value for *w1, w2* and *b* such that for every row of data —

Here, *w1* and *w2* are the coefficients of the input variables, while *b* is the bias term. This is known as the “best fit line” for the data, and the algorithm iteratively tries to find this best fit line, using the following steps —

- Assign random values to parameters
*w1*,*w2*and*b*. - Pick one instance in the data and calculate

3. Calculate the loss — How off was our output from the actual output?

4. Calculate the gradient for *w1*, *w2* and *b* — How should we change the weights to move closer to the actual output?

5. Update *w1*, *w2* and *b*.

6. Repeat steps 2–5 until convergence.

The following set of images convey the steps for a single variable —

0. Starting out with a set of ’n’ variables

1. Assign random values to parameters and plot assumed curve

2. Using one instance of data calculate ŷ

3. Calculate loss

4 & 5. Calculate loss and update parameters

6. Repeat steps 2–5 with another instance of data

**1. Assign random values to ***w1*, w2 and b

*w1*, w2 and b

Let’s start with our assumptions —

## 2. Pick one instance from the data and calculate ŷ

Let’s start with the first row of our data

Putting in our assumed values for the parameters, we calculate an estimated output

Our goal is to update the parameters such that our estimated output (ŷ) is equal to the actual output (y).

## 3. **Calculate the loss — How off was the calculated output from the actual output?**

This is where things get interesting.

We calculate how far our assumption has lead us from the actual value and update our parameters to move closer to the actual output.

In order to calculate and update our parameter assumptions using gradients, we need to calculate the loss using a function that is differentiable.

We will use the Squared Error as our loss function. It measures the squared difference between what our assumptions lead us to (ŷ) and the actual output (y). Squaring the error has a unique advantage as it ensures a small value for minute variations in the error, but explodes when the model assumptions are way off from the actual value.

“Why” this value of loss matters critically to the algorithm, will get cleared in the next step.

Let’s calculate the loss from our assumptions —

## 4. **Calculate gradients**

This is the most important step of the algorithm, as this is where we iteratively learn from and improve our assumption to get closer to the actual output.

We start with writing our loss as a function of the model parameters —

To determine how we should change our parameters to get closer to the actual output, we calculate our gradients (partial derivatives) for each of our coefficients and the bias term w.r.t the loss—

This gives us the gradients, which measures the impact each parameter had on predicting the output, essentially telling us how much we need to change each of our parameters to get closer to the actual output.

## 5. **Update ***w1*, *w2* and *b*

*w1*,

*w2*and

*b*

Wow! Looks like our assumptions were off by a huge margin!

This does not look right. Let’s scale our updates using a scaling variable — **Learning Rate(η)**. Learning rate ensures that our weights don’t change by a huge number (and start oscillating) every time we make an update.

Considering η= 0.01, our updates look more reasonable —

That looks more reasonable.

This completes one iteration of the algorithm. Now we repeat these steps until **convergence** *.i.e.,* until our weights don’t change as much and/or our loss is close to 0.

Let’s do another iteration with other rows of data with our updated parameters.

Continuing the same process for a few more iterations*, randomly sampling a row from the first four rows of data (we preserve the last row for validating our model), we get the following parameter values —

Using these, let’s see what the error on our validation sample(last row) looks like.

Our loss is almost 0. We can say that the model has found the optimal parameters for this data.

And that’s it. At its core, this is what Linear Regression does.

*This post was originally published by Saurabh Singh at Towards Data Science*