In the last post I mentioned about different types of machine learning algorithms. In this post I'll explain one type of supervised learning algorithm called regression.
Let me start with an example. I have some data about the Mozilla Firefox project showing how the code base grew over time. This data was taken from Ohloh. The below graph shows how the number of lines of code grew from June 2008 to November 2010. On the X axis we have the number of months since June 2008. On the Y axis we have the number of lines of code in millions.
Let us try to use this data to predict how the code size will grow over time, beyond November 2010. This is an example of supervised learning because we are given a set of features (in this case we have only one feature - the number of months since June 2008) and a labelled training set (in this case the label is number of lines of code).
As you can see from the above graph, the size of code base increases over time somewhat linearly - i.e. we can fit a straight line into this graph to closely match the data points. If we find a best fitting line, we can extrapolate it and use it to predict the code base size for values of X not shown in the graph. The machine learning algorithm's job is to find the best fitting line to the given training set. This type of a supervised learning problem is called regression because the label to be predicted is more or less a continuous function. In this specific case we are using linear regression because we are attempting to fit a straight line to this data.
To solve this problem, I'll establish some terminology first. I define
where
Similarly, I am going to represent the labels with a vector Y:
where
Let us say the best fitting line is represented by the equation:
where
In order to fit this line to the data, the distance between the line and the data points in the training set should be minimum. We want to find
As you can see, we are minimizing the mean squared error to find the best fit. I have added a factor of 1/2 to simplify some of the following calculations. But that should not matter as we are minimizing the expression above; we will get the same value of
Before we see how to minimize the cost function, let us try to generalize the formula. In this example, we had only one feature in the training set. But in most of the machine learning problems we'll have multiple features. I'll update the terminology to handle such a scenario. Let us say
We can represent this concisely using matrix notation:
Note that the subscripts in the feature vector denote the features and superscripts denote the training example number. And
With the matrix notation, the machine learning objective becomes:
Modern CPUs can parallelize matrix operations so this version can be implemented more efficiently in tools like GNU Octave.
Hopefully, this post gave you an idea about linear regression, in the next post I'll mention a method to minimize the cost function.
Let me start with an example. I have some data about the Mozilla Firefox project showing how the code base grew over time. This data was taken from Ohloh. The below graph shows how the number of lines of code grew from June 2008 to November 2010. On the X axis we have the number of months since June 2008. On the Y axis we have the number of lines of code in millions.
Let us try to use this data to predict how the code size will grow over time, beyond November 2010. This is an example of supervised learning because we are given a set of features (in this case we have only one feature - the number of months since June 2008) and a labelled training set (in this case the label is number of lines of code).
As you can see from the above graph, the size of code base increases over time somewhat linearly - i.e. we can fit a straight line into this graph to closely match the data points. If we find a best fitting line, we can extrapolate it and use it to predict the code base size for values of X not shown in the graph. The machine learning algorithm's job is to find the best fitting line to the given training set. This type of a supervised learning problem is called regression because the label to be predicted is more or less a continuous function. In this specific case we are using linear regression because we are attempting to fit a straight line to this data.
To solve this problem, I'll establish some terminology first. I define
m to be the number of entries in the training set. I have 30 data points in this in this example; so m = 30. In this example, we have a single feature in the data - the number of months since June 2008. We can collect all m values for this feature into a vector; let us name that vector X. In other words, X is a vector as shown below.where
x(i) represents the value of the feature for the ith training example - the number of months since June 2008.Similarly, I am going to represent the labels with a vector Y:
where
y(i) represents the label for the ith training example - the number of lines of code.Let us say the best fitting line is represented by the equation:
hθ(x) = θ0 + θ1x
where
θ0 and θ1 are constants and the machine learning algorithm will determine values for them. This function
hθ(x)is called the hypotheses function.In order to fit this line to the data, the distance between the line and the data points in the training set should be minimum. We want to find
θ0 and θ1 such that the difference between y(i) and θ0 + θ1x(i) is minimal. In other words, the objective of the machine learning algorithm is to find θ0 and θ1 to:J(θ) is called the cost function and the objective of the machine learning algorithm is to minimize the cost function.As you can see, we are minimizing the mean squared error to find the best fit. I have added a factor of 1/2 to simplify some of the following calculations. But that should not matter as we are minimizing the expression above; we will get the same value of
θ0 and θ1even if we multiply the whole expression by 1/2.Before we see how to minimize the cost function, let us try to generalize the formula. In this example, we had only one feature in the training set. But in most of the machine learning problems we'll have multiple features. I'll update the terminology to handle such a scenario. Let us say
n is the number of features. Then X will be an m×n matrix where each row represent a training example and each column represents a feature. The hypotheses function will be:
hθ(x) = θ0x0 + θ1x1 + θ2x2 + ... + θnxn where x0 = 1 and xi is the value of the ith feature.
We can represent this concisely using matrix notation:
Note that the subscripts in the feature vector denote the features and superscripts denote the training example number. And
x0(i) = 1 for all values of i.With the matrix notation, the machine learning objective becomes:
Modern CPUs can parallelize matrix operations so this version can be implemented more efficiently in tools like GNU Octave.
Hopefully, this post gave you an idea about linear regression, in the next post I'll mention a method to minimize the cost function.






No comments:
Post a Comment