In this post, I’ll be going over the theory behind a machine learning system called the Support Vector Machine (also known as ‘SVM’). This post assumes very little prior knowledge, and it won’t do some hand-wavy ‘solve the optimization function here’ assertion; we’ll go into detail about how this works.

There is some required knowledge; this post will assume knowledge of basic calculus (just derivatives!) and some very basic linear algebra. If anything is unclear, feel free to add it in the comments!

Now, let’s get started.

First, what is machine learning? Machine learning is the practice of teaching computers the ability to detect patterns and derive meaning from the input data. Sometimes, they are used to predict the outcome, given a previously unseen input. Other times, machine learning can be used to understand the input data and show patterns. There are plenty of practical applications of machine learning, from speech recognition, image processing, computational linguistics, stock market predictions, etc.

Terminology

Before we go further, there is some terminology we need to establish. In general, machine learning systems have two phases in them:

  1. Training phase
    • This is the phase where we ‘teach’ the machine how to understand the data. We generally call the data that is used in this as ‘training data’.
  2. Calculation phase
    • This is the phase where we try to make a prediction or show a pattern of the data. The data we use in this phase is known as ‘test data’.

For most experiments, the training phase needs to complete before the calculation phase begins.

Machine learning algorithms are also categorized by their learning mechanism. There are two types:

  1. Supervised learning
    • A supervised learning algorithm is one where the training data has the expected result. This way, you can train the system to try to make a prediction based on the training data.
  2. Unsupervised learning
    • Unsupervised learning algorithms are algorithms whose training data don’t have the expected results. 1

The SVM algorithm is an example of a supervised learning algorithm.

Lastly, SVMs are examples of classifiers. A classifier is an algorithm that, given a bunch of input about a specific data set, determines if it is of one type or another. For an example, an SVM can be built to classify the color of a pixel on an image (where the ‘classes’ are the different colors like red, blue, etc.). For this series, we’ll talk about SVMs as binary classifiers (classifiers that only have two classes).

Perceptrons

Before we dive into the details of support vector machines, we’ll explore a super type of support vector machines: perceptrons. In other words, SVMs are a type of perceptron (imagine ML algorithms as ‘food2’. You could say perceptrons are like ‘fruit’, as a type of food. SVMs are oranges as a type of fruit.)

Perceptrons are a generic type of machine learning algorithm that use training data to come up with a function F(…) to do binary classification. When you take test data x and apply F(…) to it (notated as F(x)), the result will be the expected result. Since SVMs are binary classifiers, the expected result is either one result (positive) or the other (negative). We can then say that all positive results are ones where F(x) >= 0, and all negative results are where F(x) < 0.

Let’s look at a simple example to clarify what I’m saying. Let’s have a ML problem where, given a grade score (0 - 100), we want to know if the grade is considered an A. Some training data for this learning algorithm would be:

Input (x) Expected Result (y)
90 true
95 true
15 false
85 false
89 false

Thus, the perceptron function F(x) should, given a grade, determine if the grade is an A or not.

Effectively, the function is:

F(x):
	return x - 90;

Trying this out on the data set:

  • F(90) = 0 (since F(x) >= 0 means it is true, this conforms with the training data)
  • F(89) = -1 (since F(x) < 0 means it is false, this conforms as well)
  • F(85) = -5 (since it’s negative, it is false, as expected)
  • F(95) = 5 (since it’s positive, it is true, as expected)

Now some of you might think, ‘well, it’s really simple to see the correct function’. For a simple one dimensional problem like this, it’s really easy to find the function. However, for real data sets with thousands of features and millions of training data entries, it’s a lot harder to visually inspect the problem for a solution.

The function F in a perceptron is defined by the following:

W * x + b = y

  • W is the weight vector that is calculated.
  • x is the input vector.
  • b is the threshold value (this is a constant that is calculated when learning).
  • y is the expected result.

The learning part of the perceptron is to figure out W and b. When testing, the caller inputs x and gets y as the expected result. Remember, W is a vector with a cardinality equal to the number of dimensions in the data set.

From the previous algorithm, W = [1] and b = -90.

Let’s look at another example. Let’s take the following 2 dimensional data:

Input (x) Expected Result (y)
(2,2) true (or +1, which is the convention to indicate positive results)
(3, -1) false (or -1, which is the convention to indicate negative results)
(4,0) false
(1,1) true
(2,1) true

After running an SVM through this training data, the weights were calculated as:

W (weights) Value
w1 -1
w2 1
B (threshold) Value
b 2

You can see this works against the dataset: w_1 * x_1 + w_2 * x_2 + b = y

-1 * 2 + 1 * 2 + 2 = 2 >= 0 (as expected).

You can try this with all the other training data points, and you’ll get the expected result.

Again, the ‘learning’ part of the perceptron is to solve for W and b. But going one step further, let’s set W * X + b = 0. w1x1 + w2x2 + b = 0 x2 = -(w1x1 + b)/w2;

As observed, this defines a line we can plot. Everything on one side of the line is considered positive, and everything on the other side is negative.

This line is a ‘hyperplane’ that defines the perceptron learning function. I know, it’s just a line, but when you expand the number of dimensions to thousands (or millions), then the ‘hyperplane’ is a lot more complex. This hyperplane is also known as the decision boundary (since this is the boundary as to whether or not the decision of the data is positive or negative).

Now, there are plenty of decision boundaries that can be drawn for the example — infinitely many lines separate the two classes. How do we determine what is the best decision boundary? We need a mathematical definition of the ‘best’ hyperplane/decision boundary. For SVMs, we define the best one as the hyperplane such that there is the maximum amount of distance between the points closest to the hyperplane and the hyperplane. The points that are closest to the decision boundary are special since they essentially ‘support’ the decision boundary (like a bridge). Since these are a set of points, they’re normally shown in the form of a vector. Hence the name, support vector machine.

Representing that criteria mathematically — and deriving the quadratic optimization problem that solves for the optimal hyperplane — is where the real work of an SVM lives. If you’d like to dig into the details, here is my source code for SVMs, including the optimization step.

1 An example of an unsupervised learning algorithm is the K-means clustering algorithm.