Machine Learning

Thursday, November 29, 2012

Machine Learning - Part 4: Gradient Descent

Continuing from the previous post, I'll explain an algorithm to minimize the cost function of a linear regression problem.

You might have noticed that the cost function - J(θ) - is a function of θ. So If we plot J(θ) against θ we'll get a figure like this:

Here we see that there is a value of θ for which J(θ) is minimum. At that point, the slope of the tangent of the cost function is zero. In other words, the derivative of the cost function is zero at that point.

Let us say we start with a random value of θ and compute the derivative of J(θ) at this value. The cost function J(θ) decreases fastest if we go from θ in the direction of the negative value of the derivative. After taking a small step in the direction of the negative value of the derivative we'll get a new value of theta. Then we can repeat the process by computing the derivative and taking another small step. Eventually, we'll reach the point where the derivative is zero (or very close to zero) and taking further steps does not change the value of the cost function.

This algorithm is known as gradient descent.

Using formal notation, we repeatedly update θ such that:

Here α is called the learning rate. It denotes how big a step we take while moving towards the optimal value of θ and is usually set to a small number.

We have to use partial derivatives when we have more than one feature. The gradient descent step in that case will be:

You can find a sample implementation of this algorithm at https://github.com/rkaippully/mach-learn/blob/master/linearRegression.m. You need GNU Octave to run this program.

Here is a plot showing the linear fit to the training data set:

Here is a plot showing how the cost function approaches zero as we perform more and more iterations of the gradient descent:

Monday, November 19, 2012

Machine Learning - Part 3: Introduction to Linear Regression

In the last post I mentioned about different types of machine learning algorithms. In this post I'll explain one type of supervised learning algorithm called regression.

Let me start with an example. I have some data about the Mozilla Firefox project showing how the code base grew over time. This data was taken from Ohloh. The below graph shows how the number of lines of code grew from June 2008 to November 2010. On the X axis we have the number of months since June 2008. On the Y axis we have the number of lines of code in millions.

Let us try to use this data to predict how the code size will grow over time, beyond November 2010. This is an example of supervised learning because we are given a set of features (in this case we have only one feature - the number of months since June 2008) and a labelled training set (in this case the label is number of lines of code).

As you can see from the above graph, the size of code base increases over time somewhat linearly - i.e. we can fit a straight line into this graph to closely match the data points. If we find a best fitting line, we can extrapolate it and use it to predict the code base size for values of X not shown in the graph. The machine learning algorithm's job is to find the best fitting line to the given training set. This type of a supervised learning problem is called regression because the label to be predicted is more or less a continuous function. In this specific case we are using linear regression because we are attempting to fit a straight line to this data.

To solve this problem, I'll establish some terminology first. I define m to be the number of entries in the training set. I have 30 data points in this in this example; so m = 30. In this example, we have a single feature in the data - the number of months since June 2008. We can collect all m values for this feature into a vector; let us name that vector X. In other words, X is a vector as shown below.

where x⁽ⁱ⁾ represents the value of the feature for the i^th training example - the number of months since June 2008.

Similarly, I am going to represent the labels with a vector Y:

where y⁽ⁱ⁾ represents the label for the i^th training example - the number of lines of code.

Let us say the best fitting line is represented by the equation:


h_θ(x) = θ₀ + θ₁x

where θ₀ and θ₁ are constants and the machine learning algorithm will determine values for them. This function


h_θ(x)

is called the hypotheses function.

In order to fit this line to the data, the distance between the line and the data points in the training set should be minimum. We want to find θ₀ and θ₁ such that the difference between y⁽ⁱ⁾ and θ₀ + θ₁x⁽ⁱ⁾ is minimal. In other words, the objective of the machine learning algorithm is to find θ₀ and θ₁ to:

J(θ) is called the cost function and the objective of the machine learning algorithm is to minimize the cost function.

As you can see, we are minimizing the mean squared error to find the best fit. I have added a factor of 1/2 to simplify some of the following calculations. But that should not matter as we are minimizing the expression above; we will get the same value of θ₀ and θ₁even if we multiply the whole expression by 1/2.

Before we see how to minimize the cost function, let us try to generalize the formula. In this example, we had only one feature in the training set. But in most of the machine learning problems we'll have multiple features. I'll update the terminology to handle such a scenario. Let us say n is the number of features. Then X will be an m×n matrix where each row represent a training example and each column represents a feature. The hypotheses function will be:


h_θ(x) = θ₀x₀ + θ₁x₁ + θ₂x₂ + ... + θ_nx_n where x₀ = 1 and x_i is the value of the i^th feature.

We can represent this concisely using matrix notation:

Note that the subscripts in the feature vector denote the features and superscripts denote the training example number. And x₀⁽ⁱ⁾ = 1 for all values of i.

With the matrix notation, the machine learning objective becomes:

Modern CPUs can parallelize matrix operations so this version can be implemented more efficiently in tools like GNU Octave.

Hopefully, this post gave you an idea about linear regression, in the next post I'll mention a method to minimize the cost function.

Tuesday, November 13, 2012

Machine Learning - Part 2: Classification of Algorithms

In this part of the series I'll explain the classification of machine learning algorithms.

Supervised Learning

In the previous post I mentioned that a machine learning algorithm will get better at performing tasks as it learns from experiences. Let us take an example; consider e-mail spam detection. We have collected a large number of e-mails from users and they have marked them as spam or not spam. We want to learn an algorithm to automatically detect spam based on these e-mails.

As a first step, we extract "features" from the e-mails. These could be the words in the content of the e-mail, the from address, the to address, other headers etc. Anything that will help in detecting spam.

So, the inputs to our machine learning algorithm is these set of features and a label - the spam or not-spam flag - assigned to the e-mail. The machine learning algorithm will learn from these inputs and come up with a function that can take features from e-mails and predict whether they are spam or not. We can use this function on future e-mails to detect spam.

This kind of a machine learning problem is called supervised learning. We provide a training data set containing both inputs and expected output (labels). The supervised learning algorithm learns from the training data and produces a function. The function should predict the correct output value for any valid input object.

Unsupervised Learning

Unsupervised learning is a process to find hidden structure in unlabelled data.

Consider DNA sequence analysis of humans. Unsupervised learning algorithms can find degree of similarity of genetic structures between individuals and group them into population structures.

We do not provide a training set with labelled data in this case. The training set consists of only genetic data. The job of the machine learning algorithm is to automatically find patterns in the features and group them in to different groups.

Semi-supervised Learning

In some cases, it may be relatively easy to get unlabelled data compared to labelled data. So we may end up having lot of unlabelled data with some labelled data. There is considerable improvement in learning accuracy when we combine them.

In the next part in this series I'll explain more about supervised learning.

Saturday, November 3, 2012

Machine Learning - Part 1: Overview

I plan to start a tutorial series on Machine Learning. It is a topic I am interested in and I guess this will help others too.

The series will cover introductory topics and no machine learning background is assumed. Some basic familiarity with linear algebra is assumed. I'll also cover some programming examples using GNU Octave. If you are not familiar with it, a good tutorial is available here.

Definition

So, what is machine learning? It is a broad area and presents some challenge to come up with a good definition. It'll be easier to go through an example to understand the kind of problems solved by machine learning and then come up with a definition for it.

Consider the game of checkers. It is a board game with relatively simple rules. There were a lot of attempts - as early as in 1950s - to develop algorithms to make computers play checkers.

So how can we teach a computer to play such a game? One simplistic approach is to record every single board position and the corresponding best move the computer should play. However, this will soon prove to be tedious because there are too many board positions and it may not be possible to store all of them in memory. An improved version of this would be to record "patterns", group similar board positions in to one pattern and the program knows the best strategy for each pattern. However, even this approach would prove tiresome in more complex games because there are just too many patterns and it may not be possible for a human to list all of them correctly, in any case it will take a huge effort.

Arthur Lee Samuel was one of the pioneers in artificial intelligence and he found a way to tackle this problem. Instead of explicitly programming the computer to handle every single move in the game he allowed it to "learn" itself. He let the program play thousands of games against itself and understand the hidden patterns/features of the game. The program got better and better and eventually reached a respectable skill level to challenge human players. This approach is closer to how humans learn to play such games. Instead of remembering every single board position, we understand recurring patterns in the game and strategies to handle them. Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed".

A more formal definition was given by Tom M. Mitchell: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E". Thus the machine learning program becomes better and better in performing tasks as it learns from experiences.

Put it differently, let us say we want to perform some task using a computer. A machine learning program will come up with an algorithm to perform that task. The initial algorithm given by the machine learning program will not perform so well. But as we feed more "experiences" (whatever that means, we'll see more on this later) to the machine learning program it will tune the output algorithm better and better to achieve a higher performance.

Some applications of machine learning are:

Medical diagnosis
Fraud detection
Speech and handwriting recognition
Recommender systems
Spam detection
Autonomous driving

I will discuss some of these applications in detail in future posts.

Hopefully, this post gave you a high level overview of machine learning. I'll explain classification of machine learning algorithms in the next post.