# Implementing DBSCAN Clustering from scratch in Python

September 13, 2017
HOME

DBSCAN (Density-based spatial clustering of applications with noise) is one of the most popular Machine Learning algorithms for clustering in data mining. One of the prime advantages of DBSCAN is that it’s ability to work with data with significant noise and outliers. Further, DBSCAN does not need to know how many clusters there are in the data set, unlike K-Means clustering.

In this article, I will implement the algorithm from scratch in python and visualize the results on a 2 dimensional data set that, when plotted, forms two concentric circles.

Full code here

DBSCAN is an unsupervised learning algorithm, ie. it needs no training data, it performs the computation on the actual dataset. This should be apparent from the fact that with DBSCAN, we are just trying to group similar data points into clusters, there is no prediction involved. Unlike previously covered K Means Clustering where we were only concerned with the distance metric between the data points, DBSCAN also looks into the spatial density of data points. Further, DBSCAN can effectively label outliers in the data set based on this density metric - data points which fall in low density areas can be segregated as outliers. This enables the algorithm to be un-distracted by noise.

## How DBSCAN works?

In my previous blog post, I had explained the theory behind DBSCAN and had implemented the algorithm with Python’s scikit learn. If you are not very familiar with the algorithm or its scikit-learn implementation, do check my previous post. ### Pseudocode for the DBSCAN algorithm:

1. Iterate though all the points in the data set - assume all of them to be outliers.
2. For any arbitrary point p, we now retrieve all points density-reachable from p.
3. If the number of points in the Ɛ neighbourhood is equal to or greater than the `minPoints`, p is a core point and a new cluster is formed.
4. Next, we go through each point in the newly formed cluster in BFS fashion and find the points in their respective Ɛ neighbourhood, ie. we are finding all indirectly density reachable points from `p`.
5. Once we have retrived all the indirectly density reachable points, we have now completed out cluster. We move on to the next point still marked as an outlier and repeat steps 1 through 4.
6. The algorithms stops once all the points in the data sets are visited.

## Implementing DBSCAN from scratch:

The implementation can be divided into the following:

1. Handle Data: Clean the file, normalize the parameters, given numeric values to non-numeric attributes. Read data from the file into a data frame.
2. Implement Function to find neighbours: Inclusing the decision on choosing the distance metric. For our example we will use Euclidean distance.
3. Fit: Model the DBSCAN around the data set.
4. Visualize: Visualizing the clusters and outliers.

### The data set for implementation

For this implementation, I have chose a simple and contrived 2 dimensional data set that is ideal for the use case of DBSCAN. I have generated the data set using `dataset` module from `sklearn`. The data set forms two concentric when plotted on a 2D plane.

The code that I used to generate the data set:

``````from sklearn import datasets

n_samples = 1500
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5, noise=.08)
``````

I have saved a version of this into a csv file which can be found here.

### Handling Data:

For the implementation, I will use the saved data set of concentric circles instead of generating them for each run.

The first thing to do is to read the csv file. To deal with the csv data data, let’s import Pandas first. Pandas is a powerful library that gives Python R like syntax and functioning.

``````import pandas as pd
``````

``````df = pd.read_csv(r".\data\concentric_circles.csv") #Reading from the data file
``````

Now, lets convert the the data frame into a more generic data structure - a list of floating point numbers:

``````dataset = df.astype(float).values.tolist()
``````

We can also normalize the data using the `StandardScaler` from `sklearn.preprocessing`

``````from sklearn.preprocessing import StandardScaler

# normalize dataset
X = StandardScaler().fit_transform(dataset)
``````

So far, in `main.py`:

``````def main():

# Reading from the data file

dataset = df.astype(float).values.tolist()

# normalize dataset
X = StandardScaler().fit_transform(dataset)
``````

### Setting up the class

Before we move forward, let’s create a class for the algorithm.

``````  class CustomDBSCAN:
def __init__(self):
self.core = -1
self.border = -2
``````

We have the `CustomDBSCAN`, that will be our main class for the algorithm. In the constructor, we initialize the arbitrary values for core and border points.

### Defining Functions:

Now, let’s define the functions that go inside the `CustomDBSCAN` class. For the simplicity of this tutorial, we will not delve into Advanced DBSCAN topics such as the OPTIC algorithm, parameter estimation etc. For our simple implementation, we will only need two functions for the algorithm - `fit` and `neighbour_points`. In addition, we will implement a `visualize` function, to visualize out clusters and outliers.

As their names suggest, the `fit` function will use the incoming data to model DBSCAN - essentially clustering data based on the provided values of `Eps` and `minPoints`. The `fit` function will use the `neighbour_points` function to get all the spatially close points to any given point.

First we have the `neighbour_points` function. The `neighbour_points` function takes in the pre-processed data set, the index of the current point (`pointId`) and value of `epsilon`. The `neighbour_points` function will return the list of points that are in the Ɛ neighbourhood of any given point.

``````# Find all neighbour points at epsilon distance
def neighbour_points(self, data, pointId, epsilon):
pass
``````

Let’s define the `fit` function. We need the pre-processed data set, the chosen value for `Eps` and `MinPt` as params for our function.

``````# Fit the data into the DBSCAN model
def fit(self, data, Eps, MinPt):
pass
``````

Next, we have the `visualize` function. The `visualize` function takes in the pre-processed data set, the cluster list (returned by the `fit` function - explained in detail later) and the total number of clusters discovered by DBSCAN . We will be using `matplotlib` for plotting a scatter plot of the data.

``````# Visualize the clusters
def visualize(self, data, cluster, numberOfClusters):
pass
``````

### Implementing neighbour_points function:

As mentioned above, the `neighbour_points` function will return all the points in the Ɛ neighbourhood of any arbitrary point in the data set.

For our example, we will be using the Euclidean distance as the distance metric (through there are other options such as the Manhattan Distance, Minkowski Distance ). The Euclidean distance is straight line distance between two data points, that is, the distance between the points if they were represented in an n-dimensional Cartesian plane, more specifically, if they were present in the Euclidean space.

For calculating Euclidean distance, `np.linalg.norm` function from `numpy` will be used.

``````# Find all neighbour points at epsilon distance
def neighbour_points(self, data, pointId, epsilon):
points = []
for i in range(len(data)):
# Euclidian distance
if np.linalg.norm([a_i - b_i for a_i, b_i in zip(data[i], data[pointId])]) <= epsilon:
points.append(i)
return points
``````

### Implementing fit function:

The `fit` function will model the data set to DBSCAN, generating the clusters. The is the core of our implementation.

First, let’s a list that contains which cluster each point belongs to. Initially, as mentioned in the pseudocode, we will mark all the points as outliers. To denote cluster, we can use non-negative integers and we can mark outliers as cluster `0`.

``````    # initialize all points as outliers
point_label =  * len(data)
point_count = []
``````

Then, we can initialize the empty of list of core and border points:

``````    # initilize list for core/border points
core = []
border = []
``````

We have all the initialized variables ready, we can start with iterating through the data points. I will split the iteration into 3 passes for simplicity. In the first pass, we get all the neighbourhood points for each point. We will utilize the `neighbour_points` function here:

``````    # Find the neighbours of each individual point
for i in range(len(data)):
point_count.append(self.neighbour_points(data, i, Eps))
``````

In the second pass, we mark the points as core or border/outliers, any point which has more or equal to `MinPt` number of points in its Ɛ neighbourhood is a core point.

``````    # Find all the core points, border points and outliers
for i in range(len(point_count)):
if (len(point_count[i]) >= MinPt):
point_label[i] = self.core
core.append(i)
else:
border.append(i)

for i in border:
for j in point_count[i]:
if j in core:
point_label[i] = self.border
break
``````

In our final pass, we perform BFS on our data points and assign clusters to each point in our data set. We use the queue data structure to perform BFS.

``````    # Here we use a queue to find all the neighbourhood points of a core point and find the indirectly reachable points
# We are essentially performing Breadth First search of all points which are within Epsilon distance for each other
for i in range(len(point_label)):
q = queue.Queue()
if (point_label[i] == self.core):
point_label[i] = cluster
for x in point_count[i]:
if(point_label[x] == self.core):
q.put(x)
point_label[x] = cluster
elif(point_label[x] == self.border):
point_label[x] = cluster
while not q.empty():
neighbors = point_count[q.get()]
for y in neighbors:
if (point_label[y] == self.core):
point_label[y] = cluster
q.put(y)
if (point_label[y] == self.border):
point_label[y] = cluster
cluster += 1  # Move on to the next cluster
``````

This wraps up the fit function. At this stage we have all our points assigned to a cluster (or as an outlier with cluster `0`). We can return the list of cluster assignments and the total number of clusters found:

``````    return point_label, cluster
``````

Here’s the entire fit function:

``````# Fit the data into the DBSCAN model
def fit(self, data, Eps, MinPt):
# initialize all points as outliers
point_label =  * len(data)
point_count = []

# initilize list for core/border points
core = []
border = []

# Find the neighbours of each individual point
for i in range(len(data)):
point_count.append(self.neighbour_points(data, i, Eps))

# Find all the core points, border points and outliers
for i in range(len(point_count)):
if (len(point_count[i]) >= MinPt):
point_label[i] = self.core
core.append(i)
else:
border.append(i)

for i in border:
for j in point_count[i]:
if j in core:
point_label[i] = self.border
break

# Assign points to a cluster

cluster = 1

# Here we use a queue to find all the neighbourhood points of a core point and find the
# indirectly reachable points.
# We are essentially performing Breadth First search of all points which are within
# epsilon distance from each other
for i in range(len(point_label)):
q = queue.Queue()
if (point_label[i] == self.core):
point_label[i] = cluster
for x in point_count[i]:
if(point_label[x] == self.core):
q.put(x)
point_label[x] = cluster
elif(point_label[x] == self.border):
point_label[x] = cluster
while not q.empty():
neighbors = point_count[q.get()]
for y in neighbors:
if (point_label[y] == self.core):
point_label[y] = cluster
q.put(y)
if (point_label[y] == self.border):
point_label[y] = cluster
cluster += 1  # Move on to the next cluster

return point_label, cluster
``````

### Implementing visualize function:

Now that we have our main `fit` function implemented, let’s quickly write a utility function that visualizes the clusters. We will be using `matplotlib.pyplot` to plot a scatter plot of all our data points - color coded as per the cluster to which they belong to. In this contrived example, there are clearly two cluster and such as to be shown by the scatter plot.

``````# Visualize the clusters
def visualize(self, data, cluster, numberOfClusters):
N = len(data)
# Define colors, ideally better to have around 7-10 colors defined
colors = np.array(list(islice(cycle(['#FE4A49', '#2AB7CA']), 3)))

for i in range(numberOfClusters):
if (i == 0):
# Plot all outliers point as black
color = '#000000'
else:
color = colors[i % len(colors)]

x, y = [], []
for j in range(N):
if cluster[j] == i:
x.append(data[j, 0])
y.append(data[j, 1])
plt.scatter(x, y, c=color, alpha=1, marker='.')
plt.show()
``````

## Wrapping up

We can completed the implementation of the `CustomDBSCAN` class. Let’s wrap this implementation by finishing up our `main` function and looking at the visualized results for the data set. For this exmaple data set, I have found the `epsilon` and `MinPt` values to be optimal at 0.25 and 4 respectively.

``````def main():

# Reading from the data file

dataset = df.astype(float).values.tolist()

# normalize dataset
X = StandardScaler().fit_transform(dataset)

custom_DBSCAN = CustomDBSCAN()
point_labels, clusters = custom_DBSCAN.fit(X, 0.25, 4)

print(point_labels, clusters)

custom_DBSCAN.visualize(X, point_labels, clusters)
``````

The resultant plot: We have two clear cluster (red and blue), the algorithm splits the concentric circles into separate clusters. We can also see some outliers marked in black.

That’s it for now; if you have any comments, please leave them below.

Still have questions? Find me on Codementor