Building a Neural Network with Logistic Regression

mediumThis post was originally published by Mithavachana sm at Medium [AI]

Logistic Regression as a basic neural network, Yes we will be discussing about how to implement neural network work for logistic regression. I assume that you know some basics of neural networks when we say building neural network, we have to define few steps on how to build a neural network, yes there are few things about the basics of neural networks before we implement neural network as logistic regression, let’s see how?

First one, we have look at forward Pass which includes inputs and weights, non-linear functions which produces output and then cost function for calculating loss of the predicted out compared to actual output

Second one, we have look at backward pass which includes minimizing the loss function using gradient descent algorithm and finding out the ideal weights for final output by calculating gradients to adjust the weights to the actual output, which is called optimization.

These two basic things are fundamental building blocks of building neural network, we will discuss about this when we progress through, we have just looked overview, we will see this in details

Logistic regression is binary classifier

For example: Given some dog or cat image we have to identify whether there is cat or dog in the image, if cat is present then we label as 1 and if cat is not present we label it as 0 , it’s as simple as it is .Then how we represent the image? , we humans born with some faculty’s and we can identify that whether it is cat or not but what about computers how they identify it rite? For that we have to present the data as computer understand, before that here pictures represented as pixels and pixels have value in to, let have look

cat image rgb

Above is cat image and below is corresponding pixels value and picture itself as 3 channels

RGBchannels

R-RED ,G-GREEN,B-BLUE HERE WE HAVE 7*5 WITH 3 CHANNELS THAT IS 7*5*3 PIXELS

As you can see above cat image having more than 7*5*3 dimension but here for example we have seeing in the second image for example we have taken 7*5*3 dimension, we cannot use thIS image as it is for building neural network ,we have to un- row all the pixels values in to vectorized format ,for ex: 7*5*3=105 pixels

Then we represent picture as feature vector as nx *1 vector where nx=number of pixels(105) and it’s a columns vector and this is single image but in real we will have thousands of images so we stack all these image vectors vertically like below .

Here we needed input feature vector to get the output as cat or not

Here x1 represent single image vector I.e. feature vector with nx*1 dimension column vector

matrix with nx*m training examples

We will stack all the input image vectors I,e feature vectors from x1,x2,x3,x4…xn and we stack them vertical,you can refer above which explains all, each column is single training image as feature vector and over all columns we represent as capital X

And this outputs for neural network

We have input vector X and we want predicted output I,e y^=p(y=1/x)

Where y is probability of being 1 with respect to input X

Here out put y^ is linear combination of x inputs with weights w plus bias b

That is y^= wTx+b

You can see above output produces linear output which is continues in nature and we don’t need that here, in logistic regression here we are predicting 0 and 1 ,so we have to apply sigmoid function to this output to get value between 0 and 1 and where sigmoid is 1/1+e-z and where Z is the liner output from y^

Here how we represent a = sigmoid(Z) ,which in turn produce value between 0 and 1 ,for more about logistic regression you can refer online materials .below image is sigmoid function

linear ouput
sigmoid ouput

Here we are using cost function that is nothing but our loss function (for single training example we use loss function but for all training examples we use cost function),The purpose of cost function is to calculate the loss on over all training examples against actual and output the loss on all training images, in order to find right w-weights-bias for our model for that we have define out cost function in the first place and once we find loss then we can tune the weights

For single training example we define loss like below

L(Y^,Y)=- ( y log y^+(1-y) log (1-y^) )

Where y is actual out put and y^ is predicted output, so here we are using minus sign because we have to minimize the loss function

If y=1, loss function becomes L(y^,y)=-log y^ , for that -logy^ have to be as small a possible and then y^ will become large value near to 1

If y=0 ,loss function becomes L(y^,y)=-log(1- y^ ), for that -log(1- y^ ), have to be as large as possible and then y^ will become small value near to 0

For overall training examples then we use cost function to calculate the loss(cost)

I,e Cost function J(w,b)=1/m sum from i to m L(Y^(i),Y(i))

Cost function : J(w,b)=-1/m sum i to m [( y(i) log y^(i)+(1-y(i) log (1-y^(i)) )

Where negative sign to minimize the loss to maximize the probabilities and w and b are parameters for loss function and i to m here is all training example we have to sum and average the loss ,we are taking average loss over all training examples for cost function

Now we have defined our cost and next we will see how to find the best parameters of w and b to reduced cost function over all images

We have cost function which computes loss for our images and now what have to do is to reduce the loss to match our predicted output to actual output and the approach we using here is gradient descent algorithm and which find finds the optimal weights and bias parameters for our image prediction and lets have look at how it works in below image

Here since we will see convex loss function below to understand gradient descent

convex function

Above cost(loss) function where height and width of the curve represents the overall weights and biases and when initialize our weights randomly and when we calculate cost function that weights lying on somewhere on this curve and what gradient descent will do it will take steps in to down ward direction where loss is minimum ,which global minima .gradient descent takes the steps in steepest root till it find the global minima to reduce loss.

Lets see below

updating weights on convex loss function

Job is to find minimum loss which lies below the surface you can see and we will find derivative of the loss function each time when take steps towards the minimum loss and and about derivatives we will not discuss here ,you can find online material ,for understanding purpose derivatives means slope and we have to find the horizontal slope where loss will you can refer above for horizontal slope

Here above is cost function and which when we initialize weights and w will lying on lets say on top of the curve that means when w is large value we have to take direction to left to the w and subtracted from current loss and by using something called learning rate , we will find learning rate is one which decides how large steps to take in the particular direction ,we are taking steps in opposite when w is large

And when w is negative then by using learning rate, we take steps in the positive direction and here addiction will happen

And our ultimate goal is to reach global minima, that’s what gradient descent updates every time it updates the weights by reducing the loss and to finds the global minima

Overall objective is to find optimal w and b parameters and since we cannot change input here

Using w and b only we can optimize loss

Forward pass

Computation of neural network comprises of two ways first is forward pass where we will compute cost function at the out put and second is backward pass where we calculate gradients for for cost function.

Lets say we have some function j (a,b)=4(a+bc)

Let’s see blow computational graph for forward pass

Whole idea forward pass or forward propagation is to calculate function j and that is to calculate loss for output ,in example we have taken simple function on how forward pass computation graph looks like and we can first multiply b and c and then we had w to a and finally we multiply x with 3 to get output of function to calculate loss on output and this forward pass.

Backward pass

In backward pass we have to find derivatives with respect out put j and i,e finding the weights and bias using derivatives mathematical concept ,we will not go in to derivative now but let’s see how its works here ,derivatives means lets say slope of for the function with respect to output by tuning weight inputs we can change loss to minimum by updating these weights frequently

You can see in above image how simple back propagation looks, we have to find derivatives of all inputs with respect to output function j and to do that we employ chain rule of derivatives

Fist we calculate derivatives of output with respect to x that is dj/dx ,how much j output changes if you change x and then we find derivatives with respect to x by changing w i,e dx/dw and finding derivative of x by changing a i e, dx/da, finding derivative for w by changing b and c respectively ie, dw/db/,dw/dc

This is how simple backward propagation looks and for this you have to know the chain rule of derivatives, here we finding the weights of inputs a,b,c and bias with respect to loss or cost function in back propagation

Here we will discuss about how to compute derivatives for implementing gradient descent for logistic regression, let use computation graph to do this, we will recall logistic regression now

Z=wTx+B — -linear output with respect to input x

Y^=a=sigmoid(z)- logistic sigmoid output for linear input

L(a,y)=-(ylog(a)+(1- y)log(1-a)) — loss function for sigmoid output for single training example

Below image explains on single training example

computing derivatives for logistic regression

Here we discuss about computing gradients on single training example

Lets discuss about m training examples below

Lets compute cost function for all training example

J(w,b)=1/m sum from i to m L(a(i),y(i))

Where i to m is summation of all loss and a/m is averaging all loss and a(i) or y(i) refers to all training examples

a(i)=y(i)=sigmoid(Z(i))=sigmoid(wTx(i)+b)

and we have computed sigmoid and cost function j

lets have look at how to compute derivative’s for total cost function for all training example’s

d/dw1*j(w,b)=1/m sum from i to m d/dw1*L(a(i),y(i)

where dw1(i) with respect to all training examples (x(i),y(i)

and like these we calculate weights and bias for w2,b

d/dw2*j(w,b)=1/m sum from i to m d/dw2*L(a(i),y(i)

d/db*j(w,b)=1/m sum from i to m d/db*L(a(i),y(i)

Now we have we know how to find weights dw1(i),dw2(i),db(i) for all training examples

Let’s how to implement this logic

Z(i)=wTx(i)+b

a(i)=sigmoid(z(i)

and we loop over cost function for all training examples is j+=[y(I) log a(i)+(1-y(i) log (1-a(i))

and z(i)=a(i)-y(i) derivative with respect loss a for all training examples

dw1+=x1(i)dz(i)

dw2+=x2(i)dz(i)

db+=dz(i)

above is weights and biases with respect to all inputs and all training example’s

over all cost function j/m averaging loss, where m is training example’s

Final weights for all training example’s m is

dw1/m,dw2/m/db/m over all training example’s

weights update as follows w1:=w1-alpha dw1, w2:=w2-alpha dw2

b:=b-alpha db and where alpha refers to the learning rate

We have seen how to implement gradient descent for logistic regression for all training example’s

Here for all training example we take average loss 1/m when you compare to single training example which don’t average out.

Vectorization is to get rid for implementing for loops in code and when we train our model on large training set , while take very long time to train because of for loops which goes in iteration and so get rid of that we use vectorization

Let’s see how

When you do compute z=wTx +b ,here w and x are large nx*1 large vectors both to compute this we have to use for loop to go for each row of vector. So avoid that

What we can do is use NumPy library to do this using below function in NumPy

Z=np.dot(w,x)+b its equal to wTx+b

Here we stack input vectors horizontal as single vector

Lets see below how to do it for logistic regression

vectorization

We can see above by using NumPy library we can stack out it horizontally to have broadcasting in NumPy ,which is better than for loops and also less time consuming .

Now we understood that how to implement logistic regression as basic neural network with all necessary components to it ,here we have discusses in brief and but you can explore more about using online sources and here main components are forward computation which is calculating cost and back ward computation which is by implement gradient descent we find right weights and bias and reduce loss .

Happy learning!!!!!

Spread the word

This post was originally published by Mithavachana sm at Medium [AI]

Related posts