What is Linear Regression?


This post was originally published by at Towards Data Science

In this article, you will learn —

  1. What is a Linear Regression?
  2. How does it find the relation between input features and targets?
  3. How does it predict?
  4. How to evaluate the predictions?
  5. How to implement it in code?

It tries to find out the best possible linear relationship between the input features and the target variable(y).

That’s it! This is what Linear Regression does. Pretty simple right?😃

In machine learning jargon the above can be stated as “It is a supervised machine learning algorithm that best fits the data which has the target variable(dependent variable) as a linear combination of the input features(independent variables). ”


  1. The target variable is also known as an independent variable or label.
  2. Input features are also known as dependent variables.

(Left) Image by Sathwick (Right) Image by Sathwick

For now, when you think of linear regression think of fitting a line such that the distance between the data points and the line is minimum. As shown above, the red line best fits that data than the other blue lines.

The linear relation between the input features and the output in 2D is simply a line.

Yes! The linear regression tries to find out the best linear relationship between the input and output.

y = θx + b  # Linear Equation

The goal of the linear regression is to find the best values for θ and b that represents the given data.

We will learn more about it in a detailed manner later in this article.

OK! It’s time to dig deeper into the Linear Regression.

For now, let’s think that we have already obtained the proper linear relation between input features and labels (explained later). Now we focus on how a linear regression model would predict the values of an instance with the obtained relationship.

Linear Regression (Data is not original it is created for example purpose)

From the data in the above image, the linear regression would obtain the relation as a line of equation y= 0.5*x + 1. (don’t worry if you do not know how to find the linear relation the methods to find this will be discussed in detail later.)

y = Earning per year

x = Experience

1 is the intercept or bias term and 0.5 is the feature weight of Experience

So, if the model has given a new point data point of a person who has the experience of 8 years (the orange point) then it would predict that the person would earn around $50k per year.

The general form of a model’s prediction

Prediction of Linear Regression model

This contains all the feature values of a particular point except the corresponding target value.

θ is the model’s parameter vector, containing the bias term θ0 and the feature weights θ1 to θn. (n = number of features)

You can think of feature weights as the coefficients of the features in the linear equation.

x is an instance’s(a data point in the dataset) feature vector, containing x0 to xn, with x0 always equal to 1

θ · x is the dot product of the vectors θ and x, which is, of course, equal to θ0x0 + θ1×1 + θ2×2 + … + θnxn.

θj is the jth model parameter (including the bias term θ0 and the feature weights θ1, θ2, ⋯, θn ).

Here θ, x are column vectors as you will see in a moment.

The model replaces the given new instance x^(i) feature value in the model prediction equation and returns the obtained value as the prediction.

Remember that we are using a linear regression which assumes that there is only a linear type of relation between the input and output(target variable). It is not suitable for problems where there is no linear relation in the data.

This is why the equation of the model’s prediction has the input features power raised to one (which is linear), and this explains why we do not have the terms like (x0)² or (x2)³ in the model prediction equation.


If these notations do not make sense continue reading in the later section, the notations are explained with an example which helps you to understand them. Then come back and revisit this section which may make things easier.

Bias Term

In Linear Regression, we will add a bias term to get unbiased results. (if we do not add this term then there would be no intercept and it assumes that the best-fitted line passes through the origin while having an intercept of zero which not the case for every dataset).

We have learnt how a linear regression model predicts, but how can we evaluate the predictions and measure how accurate are they?

The will do that job for us.

Cost Function

It evaluates the model’s predictions and tells us how accurate are the model’s predictions. The lower the value of the cost function, better accurate the predictions of the model. They are many cost functions to choose, but we will use the Mean Squared Error (MSE) cost function.

The MSE function calculates the average of the squared difference between the prediction and the actual value (y).

Mean Squared Error

Let’s break down every variable in the equation.

Remember the x⁰(bias) of any instance j is 1

Understanding Notations with an example

If you have trouble understanding what the variables represent, let’s try to understand them with an example.

Let’s consider you have a dataset that has predictions of the price of the house having the features — size, rooms, family (the ‘bias’ column represents the bias term which will be added to each sample it is not a feature). It has five samples. Herewith the attributes of size, no.of rooms, family members, we have to predict the price of a house.

Features = size, rooms, family = dependent variables

Target (or) label = price = independent variable

m = 5

Finding the correct relation means finding the accurate feature weights(θj) for each of the features because only finding the correct feature weights would give the correct linear equation which can be used for prediction.

There are two ways to find the right relationship in the data.

  1. Normal Equation
  2. Gradient Descent

The Normal Equation

Normal Equation

This is a direct equation which will give the optimised values directly without any further steps.

By calculating the above equation, you will the optimised values of feature weights.

Now let’s see how to do this in code.

First, let’s create some data that has a linear relationship.

Preparing data

Now we have the data, so let’s see how to find out the values of feature weights(θ0, … θn) with the help of the normal equation.

Linear Regression with normal equation

You have seen it has predicted the feature weights very close to the actual values (y = 5 + 3*X + Gaussian noise), but due to the noise in the data it is unable to predict the exact values, but the predictions were close enough.

Disadvantages of Normal Equation

  1. It is computationally expensive if you have a large number of features.
  2. If there are any redundant features in your data, they the matrix inversion in the normal equation is not possible. In that case, the inverse can be replaced by the pseudo inverse.

Gradient Descent

Photo by on

Instead of doing the complex computation of the matrix operations which might slow down the process when we have a large number of features we can use gradient descent which will yield great results with a vast number of features in less time.


Let’s say you are on the top of a mountain and you want to descend the mountain as fast as possible, but it’s cloudy, and you cannot see the way down. One of the solution you might think for this to consider all the possible directions around you and move in the direction that descends you the most.

This same idea can be applied to the linear regression, but here the problem is to minimise the cost function.

If you do not know what a Gradient is, remember that computing a gradient to a function gives us the direction which maximises the function’s value.

Applying gradient descent to find the optimised feature weights

First, the feature weights have to be initialised with random values.

Then the gradient of the function is calculated for each model parameter(feature weight) θj. In other words, it estimates how much the cost function will change if we change θj a little bit (which is known as partial derivative).

The vectorised form to calculate the partial derivative for every feature weight θj is

Vectorised form

Computing the gradient to a function gives to the direction that maximises that function, but we want to minimise it so we would move in the opposite direction.

The step we have to take to optimise the feature weights θ is

Gradient descent step

After taking steps in the direction that minimises the cost function, we will get the optimised values for the feature weights at which the value of the cost function would be minimum.

Learning rate

(left) When the learning rate is small (right) When the learning rate is large.

Instead of changing the value of feature weights by an immense amount in the right direction, which may overshoot the actual minimum, we will take multiple small steps in the direction which minimises the cost function. The learning rate determines the size of the step.

We have to be careful in choosing the learning rate. If the learning rate is too low, it will take a long time to converge to the right solution, or if the learning rate is too high, it may overshoot the global minimum.

The problem of local minimum

You are using the gradient descent to minimise the cost function but what if you got stuck in a local minimum.

Fortunately, with Mean Squared Error cost function, we do not have that problem because it is a convex function, i.e. it has only a global minimum and free from the local minimums.

Mean Squared Error Function (Image by Sathwick)

Implementing Gradient Descent

Linear Regression with gradient descent

That’s interesting we have got the same result as Normal Equation from the gradient descent approach that is also in just 50 iterations.

OK! That’s fine. But how do we know at which iteration we will get the optimised feature weights?

As we approach the minimum of the cost function the partial derivatives in the gradient descent get close to zero and when multiplied by the learning which is less than one the learning step becomes much tinier which ensures that at the end of all iterations we would be pretty close to the actual feature weights (given you have selected appropriate learning rate that is not too large)

Disadvantages of Gradient Descent

  1. There are possibilities for the gradient descent to stuck in a local minimum if you use another cost function that is not of a convex shape.
  2. You should find the appropriate value for the learning rate.

Implementing Linear Regression in Scikit-Learn

Linear Regression in sklearn

Scikit-Learn provides a LinearRegression class to perform Linear Regression

As you can see the values we got from the normal equation, gradient descent, sklearn are nearly the same.

Well if you have read this far and everything makes sense pat yourself on the back!. You have learned all the underlying concepts of linear regression.

What about data with higher dimensions (multiple features)?

Linear Regression in 3D(Image by Sathwick)

Until now, we have seen the linear regression for two dimensions data. We know that the linear regression in two dimensions is achieved by a line similarly in three dimensions data the line gets replaced by a plane and for the data that is higher than three dimensions we use a hyperplane and the remaining steps are same.

Simply as you, the input features increases the number of feature weights to calculate increases.

I will let you research more about the data in higher dimensions on your own, but the underlying concepts remain the same.

Well done!👏 You have made it. Now you probably should have a better idea about linear regression.


Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

Linear Regression is a widely used technique for regression problems. It can be used to predict a dependent variable from independent variables. It searches for a relationship for the dependent variable in the data(training data) for the independent variables. In the case of two-dimensional data, it is merely a line equation. It uses this line equation (the relation which is found from the data during training) to predict the values for data it has not seen before. This is the whole idea behind linear regression.

Make sure that you understand the gradient descent part. Gradient Descent is one of the most widely used optimisation techniques. Gradient Descent technique is used pretty much everywhere in Machine Learning and even in the Deep Learning.

Thanks for Reading!

Spread the word

This post was originally published by at Towards Data Science

Related posts