Implementing SVM from Scratch - in Python

July 29, 2017
HOME

Support Vector Machine is one of the most popular Machine Learning algorithms for classification in data mining. One of the prime advantages of SVM is that it works very good right out of the box. You can take the classifier in it’s generic form, without any explicit modifications, run it directly on your data and get good results. In addition to their low error rate, support vector machines are computationally inexpensive in contrast to other classification algorithms such as the K Nearest Neighbours.

Full code here

Support Vector Machine algorithm is a supervised learning algorithm, ie. it needs training data. You will have to feed the algorithm training data for it make predictions on the actual data.

How a Support Vector Machine works:

In my previous blog post, I had explained the theory behind SVMs and had implemented the algorithm with Python’s scikit learn. If you are not very familiar with the algorithm or its scikit-learn implementation, do check my previous post.

Support Vector Machines Seperating lines

Implementing a Support Vector Machine from scratch:

The implementation can be divided into the following:

  1. Handle Data: Clean the file, normalize the parameters, given numeric values to non-numeric attributes. Read data from the file and split the data for cross validation.
  2. Initialize : The heuristic for our SMO, steps sizes and multiples.
  3. Optimization: Run the SMO loop until we complete the convex optimization for the values of W and b.
  4. Stop the iteration: Stop the algorithm when the we have optimal values found.

Predict the Iris flower type :

I’ve used the classic “Iris” dataset from the UCI ML repository. We will be predicting the type of Iris based on many input parameters. The predict class is binary: “Iris-setosa” or “Iris-versicolor”.

The dataset will be divided into ‘test’ and ‘training’ samples for cross validation. The training set will be used to ‘teach’ the algorithm about the dataset, ie. to build a model. The test set will be used for evaluation of the results.

Handling Data:

I’ve modified the original data set and have changed the number of classes to 2. You can find the modified dataset here.

The original dataset has the data description and other related metadata. You can find the original dataset from the UCI ML repo here.

The first thing to do is to read the csv file. To deal with the csv data data, let’s import Pandas first. Pandas is a powerful library that gives Python R like syntax and functioning.

import pandas as pd

Now, loading the data file:

df = pd.read_csv(r".\data\iris.csv") #Reading from the data file

The first thing is to convert the non-numerical data elements into numerical formats. In this dataset, all the non-numerical elements are of Boolean type. This makes it easy to convert them to numbers. I’ve assigned the numbers ‘4’ and ‘2’ to positive and negative Boolean attributes respectively.

  df.replace('Iris-setosa', -1, inplace = True)
  df.replace('Iris-versicolor', 1, inplace = True)

In main.py:

    dataset = df.astype(float).values.tolist()
    #Shuffle the dataset
    random.shuffle(dataset) #import random for this

Next, we have split the data into test and train. In this case, I will be taking 25% of the dataset as the test set:

    #20% of the available data will be used for testing

    test_size = 0.2

    #The keys of the dict are the classes that the data is classified into

    training_set = {-1: [], 1:[]}
    test_set = {-1: [], 1: []}

Now, split the data into test and training; insert them into test and training dictionaries:

    #Split data into training and test for cross validation

    training_data = dataset[:-int(test_size \* len(dataset))]
    test_data = dataset[-int(test_size \* len(dataset)):]

    #Insert data into the training set

    for record in training_data:
			#Append the list in the dict will all the elements of the record except the class
            training_set[record[-1]].append(record[:-1])

    #Insert data into the test set

    for record in test_data:
			# Append the list in the dict will all the elements of the record except the class
            test_set[record[-1]].append(record[:-1])

Setting up the class:

Before we move forward, let’s create a class for the algorithm.

  class CustomSVM:
      def __init__(self):
          pass

We have the CustomSVM, that will be our main class for the algorithm. In the constructor, we do not have to initialize any value.

Defining Functions:

Now, let’s define the functions that go inside the CustomSVM class. For the simplicity of this tutorial, we will not delve into Advanced SVM topics such as kernels. For our simple implementation, we will only need two functions - fit and predict.

As their names suggest, the fit function will use the incoming data to model a SVM - essentially calculating the values for W (feature vector) and b (bias). If you are unfamiliar with the terminology or the theoretical fundamentals of SVMs, you can read about it here. The predict function will predict the classification for the incoming parameters, deriving it from the model we ‘fit’ from the training dataset.

First we have the fit function. We only need the pre-processed data set as a param for fit, that’s all we need to model a SVM at this point.

def fit(self, dataset):
  pass

Let’s define the predict function. The predict function take in the attributes, ie. the incoming values for which we need to make a prediction. The predict function will use the model that will by created by the fit function

def predict(self, attrs):
  pass

Let’s define the test function. The test function take in the a list of attributes, ie. the test data set for our cross validate. The test function will use the model that will by created by the fit function and use predict to predict the classes of test data. We will calculate the accuracy of the algorithm.

def test(self, attrs):
  pass

Implementing the fit function:

The fit function is the core function of our implementation. This is where we will train algorithm based on the training data set provided. This function will model the data by calculating values for W - feature vector and b - bias. In the fit function, we will be be trying to find the optimal values for W and b, essentially trying to approximate the solution for the equation: WT x + b = y. The predict function will use the calculated values of W and b to find the classification for any new input data points based on the input data point x. The fit part of the algorithm is where do our optimization. In SVMs, we are to perform convex optimizations.

A convex optimization problem is a problem where all of the constraints are convex functions, and the objective is a convex function if minimizing

In this tutorial, we are not going to delve into the depths of optimization problems and the theoretical aspects behind them. If you are interested in exploring convex optimizations further, here is a good resource for you to do so. You can also refer here if you want know more about the optimization problems and the geometry behind them.

Let’s get started out on our SMO optimization.

SVM Optimization through SMO - Sequential minimal optimization :

For our implementation, we are going to optimize using the SMO (Sequential minimal optimization) method. SMO is one of the older optimizations used in SVMs and are relatively easier to implement. Needless to say, there are much more complex and better performing optimizations out there. The Sequential Minimal Optimization (SMO) algorithm is derived by taking the idea of the decomposition method to its extreme and optimizing a minimal subset of just two points at each iteration.

Sequential minimal optimization (SMO) is an algorithm for solving the quadratic programming (QP) problem that arises during the training of support-vector machines (SVM).

In this article, we will consider a linear classifier for a binary classification problem with labels y (y ϵ [-1,1]) and features x. A SVM will compute a linear classifier (or a line) of the form: WT x + b = y

With f(x), we can predict y = 1 if f(x) ≥ 0 and y = -1 if f(x) < 0. And, without getting into too many theoretical details, this f(x) can be expressed by solving the dual problem as :

Lagarange Form

where αi (alpha i) is a Lagrange multiplier for solution and <x(i),x> called inner product of x(i) and x. The simplified SMO algorithm takes two α parameters, αi and αj, and optimizes them. To do this, we iterate over all αi, i = 1, . . . m.

The power of this technique resides in the fact that the optimization problem for two data points admits an analytical solution, eliminating the need to use an iterative quadratic programming optimizer as part of the algorithm.

At each step SMO chooses two elements αi and αj to jointly optimize, find the optimal values for those two parameters given that all the others are fixed, and updates the α vector accordingly. The choice of the two points is determined by a heuristic, while the optimization of the two multipliers is performed analytically. Despite needing more iterations to converge, each iteration uses so few operations that the algorithm exhibits an overall speed-up of some orders of magnitude.

Further, SMO was such an important development for SVMs mainly because of the limited use of computational resources required for the optimization. With SMO, we do not directly perform any matrix operations, and hence we do not need to store the kernel matrix in memory. This allows the SMO to run with limited memory, which is very useful for large data sets. If you are interested to know know more about SMO, here is a good resource that covers the theory a bit deeper.

Implementing SMO within our fit function

Now, let’s get on with our fit function.

def fit(self, dataset):
  self.dataset = dataset

  # Magnitude of W is the key, list of W and b is the value
  options = {}

First, we set the incoming data set as the data on which train the algorithms on. We also have the options variable initialized as an empty dictionary. The options variable will have the magnitude of W (||W||) as the key as a a list of W and b as the value.

For our choice of the two points we need determine it by a heuristic (it can also be random). For our case, we will use the heuristic as the min and max feature values, so we can start at the out-most ranges and converge inward.

all_feature_values = []

for yi, attrs in self.dataset.items():
  for attr in attrs:
    for f in attr:
      all_feature_values.append(f)

self.max_attr = max(data)
self.min_attr = min(data)
del all_feature_values

We can get rid of all_feature_values to save up some sweet memory space. Next, we need to think about the step sizes we are going to use whilst performing the optimization. We are essentially starting at the out-most ranges, so initially our steps sizes can to be big. This makes the first pass optimization faster and having small step sizes at the initial pass does not improve the optimization values. We sequentially go through smaller and smaller step sizes as we find the local minima for each pass. So, for each pass, we have smaller step sizes and at the same time, smaller range of values to search for in our quest for finding optimal values. With this optimization, we are approximating the optimal values for W.

For this case, we are going to have three step sizes. Each orders of magnitude smaller than the previous one. We also have to initialize the initial value (latest_optimum) of our vector W. For simplicity, we will just have one unique value in out vector ie. we will repeat the same value in all the elements of the vector. While, this is unarguably a very sub-optimal way of going about things, the SVM with this limitation still performs well - which is a testament to how good the algorithm is. We will set the default value of our latest_optimum as positive infinity, since we want any value derived from the first optimization step will replace it.

step_size = [self.max_attr * 0.1,self.max_attr * 0.01,self.max_attr * 0.005]
latest_optimum = float(inf)

Now that we have basic setup for finding W, let’s focus on b. We will define the range and the multiple for b. We can set it to35 (which is very expensive anyway, since we are more sensitive to the value of W than we are of b). The b_range will be out extended range from the min and max heuristic that we have chosen. b_multiple is essentially our step size for finding b.

b_range = 3
b_multiple = 5

We then define our 4 dimensional transform list (since we have 4 feature attributes), which will have the values of [1, 1, 1, 1], [1, 1, 1, -1], [1, 1,-1, 1], [1, 1,-1,-1], [1 ,-1, 1, 1], [1 ,-1, -1, 1], [1 ,-1, 1, -1], [1 ,-1, -1, -1], [-1, 1, 1, 1], [-1, 1, 1, -1], [-1, 1, -1, 1], [-1, 1, -1, -1], [-1,-1, 1, 1], [-1,-1, 1, -1], [-1,-1, -1, 1], [-1,-1, -1, -1]. Having this as a list will make it easier to perform the transformations.

	trans =  [[1, 1, 1, 1], [1, 1, 1, -1], [1, 1,-1, 1], [1, 1,-1,-1], [1 ,-1, 1, 1], [1 ,-1, -1, 1], [1 ,-1, 1, -1], [1 ,-1, -1, -1], [-1, 1, 1, 1], [-1, 1, 1, -1], [-1, 1, -1, 1], [-1, 1, -1, -1], [-1,-1, 1, 1], [-1,-1, 1, -1], [-1,-1, -1, 1], [-1,-1, -1, -1]]

That concludes the initial set up of algorithm. Next, we dive right into the optimization steps. We first initialize the W vector as a list of latest_optimum values. We also need a flag to break out of the optimization, once we feel that the values cannot be optimized further. The fact we are essentially solving for a convex optimization problem gives us the luxury of knowing when the optimization has been completed.

for step in step_size:
  W = np.array([latest_optimum,latest_optimum])
  optimization_flag = false

We can now jump right into the loop. We keeping optimizing until we know that we have hit the limit (ie. the optimization_flag is true). We can start the inner loop with the range of values that we are considering out optimal b to be in. This would be the range of all values between the maximum attribute value and the negative of that value. Within the inner loop, we will perform WT, which as mentioned before, is just multiplying the defined W will all the elements of the trans array. The inner loop will look like:

while not optimization_flag:
  for b in np.arange(-1*(self.max_attr* b_range ), self.max_attr * b_range, step * b_multiple):
    for transformation in trans:
      W_t = W * transformation

This is where we jump into the computationally expensive part. We go through the entire data set and perform out calculations to make sure it fits as well as possible. For every single data point in our training data, we check for the equation yi(WT.xi + b) >= 1, even if there is a single data point which does not conform to the above equation, we no longer continue. We can have a flag (found) to keep track of whether all of the data points conform to the aforementioned equation. If we happen to find a the values where we are able to satisfy the equation, we will then update our options dictionary. We calculate the magnitude of W using numpy’s linalg.norm. At this point, our inner loop looks like this:

  for step in step_size:
  W = np.array([latest_optimum,latest_optimum, latest_optimum, latest_optimum])
  optimization_flag = False

  while not optimization_flag:
    for b in np.arange(-1*(self.max_attr * b_range ), self.max_attr * b_range, step * b_multiple):
      for transformation in trans:
        W_t = W * transformation
        found = True
        for yi, attributes in self.dataset.items():
          for xi in attributes:
            if not (yi * (np.dot(xi, W_t)  +b )) >= 1:
              found = False
              break
        if found:
          options[np.linalg.norm(W_t)] = [W_t, b]

Now, let us define our optimized case. This is where the myriad of assumptions we have make come into play. To summerize, we assumed that W is gonna be in the form of W=[C,C] for simplicity, and then we starts with C at a value of infinity. Then we check all 4 possible transformation of W ([C,C] [C,-C] [-C,C] [-C,-C]) and keep only the viable ones. The viable ones are the ones that satisfy the condition yi(xi.w+b) >= 1. Then we lower the value of C by the step and repeat the same process. So we know that we have checked all the steps when the value of C goes below 0. For optimization flag to be true, we just need to check if w[0] is less than 0. This is because not every step i better than the previous, or more optimized. We only care about the final state. Since we are checking all 4 symmetrical cases for W for various values of C, there is no point in checking negative values since by having negative values for C would result in double checking cases we have already checked. So we can safely stop the optimization pass once we have the value of W[0] (or W[1] since they are the same value) goes below zero.

In case, we have not reached the local minima, we just move on to the next step.

    if W[0]<0:
      optimization_flag = True
      print("Optimized by a step: ", step)
    else:
      W = np.array(list(map(lambda w: w-step, W)))

After all of loops within the pass is done, we have to set the values for W and b for them to be used by the predict function. We choose the lowest magnitude value from the options as the most optimal one. We then update the latest_optimum for the next pass.

  norms = min([n for n in options])
  self.W = options[norms][0]
  self.b = options[norms][1]

  latest_optimum = options[norms][0][0] + step*2

So, in the end, our fit function will look like:

  def fit(self, dataset):
    self.dataset = dataset

    # Magnitude of W is the key, list of W and b is the value
    options = {}

    all_feature_values = []

    for yi, attrs in self.dataset.items():
      for attr in attrs:
        for f in attr:
          all_feature_values.append(f)

    self.max_attr = max(all_feature_values)
    self.min_attr = min(all_feature_values)
    del all_feature_values

    step_size = [self.max_attr * 0.1,self.max_attr * 0.01,self.max_attr * 0.005]
    latest_optimum = float(inf)

    b_range = 3
    b_multiple = 5


    trans =  [[1, 1, 1, 1], [1, 1, 1, -1], [1, 1,-1, 1], [1, 1,-1,-1], [1 ,-1, 1, 1], [1 ,-1, -1, 1], [1 ,-1, 1, -1], [1 ,-1, -1, -1], [-1, 1, 1, 1], [-1, 1, 1, -1], [-1, 1, -1, 1], [-1, 1, -1, -1], [-1,-1, 1, 1], [-1,-1, 1, -1], [-1,-1, -1, 1], [-1,-1, -1, -1]]

    for step in step_size:
      W = np.array([latest_optimum,latest_optimum])
      optimization_flag = False

      for step in step_size:
    			W = np.array([latest_optimum,latest_optimum, latest_optimum, latest_optimum])
    			optimization_flag = False

    			while not optimization_flag:
    				for b in np.arange(-1*(self.max_attr* b_range ), self.max_attr * b_range, step * b_multiple):
    					for transformation in trans:
    						W_t = W * transformation
    						found = True
    						for yi, attributes in self.dataset.items():
    							for xi in attributes:
    								if not (yi * (np.dot(xi, W_t)  +b )) >= 1:
    									found = False
    									break
    						if found:
    							options[np.linalg.norm(W_t)] = [W_t, b]
    				if W[0]<0:
    					optimization_flag = True
    				else:
    					W = np.array(list(map(lambda w: w-step, W)))

        norms = min([n for n in options])
        self.W = options[norms][0]
        self.b = options[norms][1]

        latest_optimum = options[norms][0][0] + step*2

Implementing the predict function:

Now that we have the fit function implemented, the predict is much easier to implement. Using the optimal separating hyperplane we find in the fit function, we can predict the class of new incoming data points.

As mentioned above, we have to solve for y in the equation: WT x + b = y. y is our class. We have the values of x provided in the attrs param in the function. W and b are already calculated in the fit function. So all we need to do in the predict function is to just substitute the values of W, x and b, then we will have our classification (y).

Breaking the above equation down:

WT x = W.x (dot product). To perform the dot product we can use numpy’s handy dot function. Before that, we would have to convert our attrs array into numpy array, which can be done by using the array function. The + b part of the equation is a simple addition, so it is very straight forward. So, the python implementation of the equation will look like:

(np.dot(np.array(attrs), self.W) + self.b)

By implementing the above, we will have something like this:

import numpy as np

def predict(self, attrs):
  #sign of the X(i).W + b defines the class
  classification = np.sign(np.dot(np.array(attrs), self.W) + self.b)

  return classification

Implementing the cross validation:

Now that we have the predict function implemented, Now is the time to check the accuracy of our algorithm. As mentioned early in the post, we will be using cross validation for the same. We will take the test_set, for which we already know the classes, and we shall use our predict function to predict the classes using our model.

For the cross validation, we can create a test function on our CustomSVM class:

  def test(self, test_set):
    self.accurate_predictions, self.total_predictions = 0, 0
    for group in test_set:
      for data in test_set[group]:
        predicted_class = self.predict(data)
        if predicted_class == group:
          self.accurate_predictions += 1
        self.total_predictions += 1
    self.accuracy = 100*(self.accurate_predictions/self.total_predictions)
    print("\nAcurracy :", str(self.accuracy) + "%")

For our iris dataset, we can run the cross validation: svm = CustomSVM()

  svm.fit(dataset = training_set)
  svm.test(test_set)

Which has the output:

Acurracy : 100.0%

Although, a 100% accuracy is usually too good to be true and in fact is generally detrimental due to over fitting; in our case, for the simple, easily separable classes of the data set - this makes sense.

That’s it for now; if you have any comments, please leave them below.