Gradient Descent Implementation in Python

Palash Prashant Thakur
5 min readJan 18, 2021

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent.

Source : Wikipedia

Generalized Explanation

The class of optimization algorithms are broadly classified into two parts :

  1. Discrete Optimization : In discrete optimization, the variable which is to be updated is itself a discrete number i.e. variables take on values from a discrete set, often a subset of integers. Optimization is done using “Dynamic Programming”.
  2. Continuous Optimization : In continuous optimization, models contain variables that can take on any real value. Optimization is done using “Gradient Descent”.

Here we are going to focus on how to implement gradient descent using python.

Lets move forward with an example of very simple linear predictor. Consider a straight line Y=w*x+b. Where x is the feature vector ,w is the weight vector and b is the bias term and Y is the output variable. The more generalized form of the equation with n number of features can be written as Y=w_0*x_0+w_1*x_1+w_2*x_2+………+w_n*x_n . Feature vector x=[x_0,x_1,x_2,…..,x_n] and x_0 is considered to be 1.Weight vector w=[w_0,w_1,w_2,..,w_n] .

We will consider only one input variable for simplicity. The line is given by Y=2*x-1. Where x can be any real number and w is a vector of [2,-1].

Step 1: Creating Dataset

We will create dataset of 1000 samples with different values of x from 0 to 20. For this task we are going to use numpy library. You can import numpy as follows.

Next we will define true value of w which is [2,-1]. Then for each value of x we will find different values of y. The random values of x is generated using np.random.randint(20,size=1).

The data looks something like this:

Step 2: Now we need to initialize some random value of w vector which will be used for initial prediction. You can choose any random value of w. Here I am choosing w to be 0. With this initial value of w we will make prediction. Lets consider our prediction function to be h(x). Since the prediction is done on the random value of w, there will exist an error which can be given as L(w).

The problem is continuous optimization problem. Hence the loss function is considered to be MSE(Mean Squared Error) . The MSE is given by:

Formula for MSE

For our problem, MSE is given as:

For implementation of this task we will define loss function in python.

Step 3 : Now the optimization comes in the picture. We will start with a random weight w, and compute loss function over entire dataset. Next we will compute the gradient of loss function w.r. to each weight value. The derivative of above given loss function is :

The function can be implemented in python as :

Step 4: Now its time to update the weights w so as to find the minimum value of loss function. The update is done using the update rule,

alpha is the learning rate. Learning rate is the amount by which weight is changed in each step.

Step 5 : Finally we will create a gradient descent function where we will include all the above functions. We will update the weights 1000 times in our function using update rule. The gradient descent function can be implemented as follows:

The Entire Code With Output is Given Below:

# Gradient Descent Algorithmpoints=[]
w_true=[2,-1]
for i in range(1000):
temp=np.random.randint(20,size=1)
x=np.array([temp[0],1])
y=np.dot(w_true,x)
points.append((x,y))

def F(w):
return sum((np.dot(w,x)-y)**2 for x,y in points)/len(points)
def dF(w):
return sum(2*(np.dot(w,x)-y)*x for x,y in points)/len(points)
def GradientDescent(F,dF):
w=0
alpha=0.01
for _ in range(1000):
value=F(w)
gradient=dF(w)
w=w-alpha*np.asarray(gradient)
print(f' Loss after :{_} iteration is : {value} and w : {w}')
GradientDescent(F,dF)

Output: You can observe in the output that loss function is approaching towards zero and the weight vector w is achieving values ,very close to true value.

This algorithm is also known as vanilla gradient descent. Since it calculates mean of all the weight vectors in all direction, it is very slow for very large dataset and may take long time to converge. To overcome this problem we use “Stochastic Gradient Descent” which I will discuss in the next story.

Since this is my first story, I heartily welcome any suggestions. Thank You so much.😄

--

--