Continuing from the previous post, I'll explain an algorithm to minimize the cost function of a linear regression problem.
You might have noticed that the cost function -
Here we see that there is a value of
Let us say we start with a random value of
This algorithm is known as gradient descent.
Using formal notation, we repeatedly update
Here α is called the learning rate. It denotes how big a step we take while moving towards the optimal value of θ and is usually set to a small number.
We have to use partial derivatives when we have more than one feature. The gradient descent step in that case will be:
You can find a sample implementation of this algorithm at https://github.com/rkaippully/mach-learn/blob/master/linearRegression.m. You need GNU Octave to run this program.
Here is a plot showing the linear fit to the training data set:
Here is a plot showing how the cost function approaches zero as we perform more and more iterations of the gradient descent:
You might have noticed that the cost function -
J(θ) - is a function of θ. So If we plot J(θ) against θ we'll get a figure like this:Here we see that there is a value of
θ for which J(θ) is minimum. At that point, the slope of the tangent of the cost function is zero. In other words, the derivative of the cost function is zero at that point.Let us say we start with a random value of
θ and compute the derivative of J(θ) at this value. The cost function J(θ) decreases fastest if we go from θ in the direction of the negative value of the derivative. After taking a small step in the direction of the negative value of the derivative we'll get a new value of theta. Then we can repeat the process by computing the derivative and taking another small step. Eventually, we'll reach the point where the derivative is zero (or very close to zero) and taking further steps does not change the value of the cost function.This algorithm is known as gradient descent.
Using formal notation, we repeatedly update
θ such that:Here α is called the learning rate. It denotes how big a step we take while moving towards the optimal value of θ and is usually set to a small number.
We have to use partial derivatives when we have more than one feature. The gradient descent step in that case will be:
Here is a plot showing the linear fit to the training data set:
Here is a plot showing how the cost function approaches zero as we perform more and more iterations of the gradient descent:





No comments:
Post a Comment