Machine Learning: Machine Learning - Part 3: Introduction to Linear Regression

In the last post I mentioned about different types of machine learning algorithms. In this post I'll explain one type of supervised learning algorithm called regression.

Let me start with an example. I have some data about the Mozilla Firefox project showing how the code base grew over time. This data was taken from Ohloh. The below graph shows how the number of lines of code grew from June 2008 to November 2010. On the X axis we have the number of months since June 2008. On the Y axis we have the number of lines of code in millions.

Let us try to use this data to predict how the code size will grow over time, beyond November 2010. This is an example of supervised learning because we are given a set of features (in this case we have only one feature - the number of months since June 2008) and a labelled training set (in this case the label is number of lines of code).

As you can see from the above graph, the size of code base increases over time somewhat linearly - i.e. we can fit a straight line into this graph to closely match the data points. If we find a best fitting line, we can extrapolate it and use it to predict the code base size for values of X not shown in the graph. The machine learning algorithm's job is to find the best fitting line to the given training set. This type of a supervised learning problem is called regression because the label to be predicted is more or less a continuous function. In this specific case we are using linear regression because we are attempting to fit a straight line to this data.

To solve this problem, I'll establish some terminology first. I define m to be the number of entries in the training set. I have 30 data points in this in this example; so m = 30. In this example, we have a single feature in the data - the number of months since June 2008. We can collect all m values for this feature into a vector; let us name that vector X. In other words, X is a vector as shown below.

where x⁽ⁱ⁾ represents the value of the feature for the i^th training example - the number of months since June 2008.

Similarly, I am going to represent the labels with a vector Y:

where y⁽ⁱ⁾ represents the label for the i^th training example - the number of lines of code.

Let us say the best fitting line is represented by the equation:


h_θ(x) = θ₀ + θ₁x

where θ₀ and θ₁ are constants and the machine learning algorithm will determine values for them. This function


h_θ(x)

is called the hypotheses function.

In order to fit this line to the data, the distance between the line and the data points in the training set should be minimum. We want to find θ₀ and θ₁ such that the difference between y⁽ⁱ⁾ and θ₀ + θ₁x⁽ⁱ⁾ is minimal. In other words, the objective of the machine learning algorithm is to find θ₀ and θ₁ to:

J(θ) is called the cost function and the objective of the machine learning algorithm is to minimize the cost function.

As you can see, we are minimizing the mean squared error to find the best fit. I have added a factor of 1/2 to simplify some of the following calculations. But that should not matter as we are minimizing the expression above; we will get the same value of θ₀ and θ₁even if we multiply the whole expression by 1/2.

Before we see how to minimize the cost function, let us try to generalize the formula. In this example, we had only one feature in the training set. But in most of the machine learning problems we'll have multiple features. I'll update the terminology to handle such a scenario. Let us say n is the number of features. Then X will be an m×n matrix where each row represent a training example and each column represents a feature. The hypotheses function will be:


h_θ(x) = θ₀x₀ + θ₁x₁ + θ₂x₂ + ... + θ_nx_n where x₀ = 1 and x_i is the value of the i^th feature.

We can represent this concisely using matrix notation:

Note that the subscripts in the feature vector denote the features and superscripts denote the training example number. And x₀⁽ⁱ⁾ = 1 for all values of i.

With the matrix notation, the machine learning objective becomes:

Modern CPUs can parallelize matrix operations so this version can be implemented more efficiently in tools like GNU Octave.

Hopefully, this post gave you an idea about linear regression, in the next post I'll mention a method to minimize the cost function.

Machine Learning

Monday, November 19, 2012

Machine Learning - Part 3: Introduction to Linear Regression

No comments:

Post a Comment

Followers