Density Based Spatial Clustering Of Applications With Noise (dbscan)

What is DBSCAN?

Is an unsupervised density-based clustering algorithm. Density-based means that the algorithm focuses on the distance between each point and it’s neighbors instead of the distance to a centroid like K-Means. One way to describe DBSCAN is:

given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). Wiki

What is Docker?

Docker is a container. You may think of it as a self-contained space on your computer. Why would you want to create a self-contained space on your computer? Say for example you’re writing some code that depends on the latest version of some library, and you want to share this code with a client. Without a container, you’d have to make sure your client had the same version of the library as you do. This can be a pain, especially if your client is not technical. However, using Docker you can package up your code along with the software or libraries required to run it, and send it to the client as a single package.

Why Learn The Math Behind The Method

I love learning how and why machine learning works, when to use it, and if there is a better method. Studying mathematics at university has embedded the desire to see what’s under the hood when I am working on a new project or with a new method. Therefore, I am resistant to apply a method that I haven’t studied before or understand how it is derived. One of my biggest fears is applying a method in a situation where the method doesn’t make sense (e.g. reporting an R² value on a nonlinear regression model). I constantly meet people who say, “I don’t need to know how that works, there is already a package out there that does it for me.” or something similar. I couldn’t disagree more. I recently saw a blurb in a Georgia Tech Edx course that addresses this sentiment and I wanted to share its wisdom.

You may rightly ask, why bother with such details? Here are three reasons it’s worth looking more closely.

It’s helpful to have some deeper intuition for how one formalizes a mathematical problem and derives a computational solution, in case you ever encounter a problem that does not exactly fit what a canned library can do for you.

If you have ever used a statistical analysis package, it’s likely you have encountered “strange” numerical errors or warnings. Knowing how problems are derived can help you understand what might have gone wrong. We will see an example below.

Because data analysis is quickly evolving, it’s likely that new problems and new models will not exactly fit the template of existing models. Therefore, it’s possible you will need to derive a new model or know how to talk to someone who can derive one for you.

Iterating through a list

Python neophytes often make the common mistake of using range(len(x)) when looping through an object. For example,

new_list = []
for i in range(len(x)):
  y = i + 1
  new_list.append(y)

The code above accomplishes the goal of creating a new list where each element is one larger than in the prevous list. However, by the beautiful magic powers of Python, we are able to accomplish this faster and in less lines. We are able to use the timeit package from python to see how fast our code runs. I used python 3.6.4 for the following code. First we must come up with a problem statement. Let’s say we have a list of numbers and we want to know what the percent change is from the element previous. Now we need to set some variables that we’ll want to reuse during this test.

from timeit import timeit
my_setup = """
import numpy as np
x = np.arange(1000))
"""
n = 10000

Now let’s dive into the speed tests. We’ll start with the Niave approve.

print(timeit("""
lst = []
for i in range(1,len(x)):
    percent = (x[i-1] - x[i])/x[i]
    lst.append(percent)
""",
setup = my_setup, number = n)
)

The code above ran in approximately 4.5 seconds on my machine. Our first change will be cutting the code down from four lines to one. I mean, come on, typing is so exhausting! Do do this we’ll use what is called list comprehension. Basically, we can do the for loop inside of brackets and it’ll generate the list we want.

print(timeit("""
[(x[i-1] - x[i])/x[i] for i in range(1,len(x))]
""",
setup = my_setup, number = n)
)

This code range in about 4.0 seconds. Just about 10% faster than the naive approach and much cleaner. Now the goal is to stop using range(len(x)). Python allows us to pluck out each element of an iterable object and use that instead of just numbers. for example you could just write for i in list which will iterate using each element of the iterable so you don’t have to continuously refer back to the object (e.g. with ‘list[i]’ you can just say ‘i’). A quick example is:

lst = ["Python", "is", "great!"]
for i in lst:
  print(i)

Which will print:

Python
is
great!

Many objects in python are iterable like lists, dictionaries, arrays, and Pandas DataFrames. Now let’s implement this into our code sample.

print(timeit("""
[(new-old)/old for old,new in zip(x[:-1], x[1:])]
""",
setup = my_setup, number = n)
)

This code runs in about 2.6 seconds. Excelsior! we improved our code by about 33% by simple removing range(len(x)) and using list comprehension. When possible, we should avoid looping at all. This is especially prudent when we are dealing with very large objects. This could be an array that is 1,000,000 long or a dataframe with 250,000 rows. Instead, we should perform what are called vectorized operations where we act on the entire array at once.

print(timeit("""
(x[:-1]-x[1:])/x[1:]
""",
setup = my_setup, number = n)
)

This code runs in about 0.05 seconds. That’s correct, almost 2 orders of magnatude faster the the original code. As the size of the object grows, the difference between the two method grows. WARNING: When objects are small, list comprehensions may be the better choice. It’s always best to speed test your code and find out what works for your situation. For those who still want the index of the element while looping, look into the Enumerate function.

Luke Wileczek

Density Based Spatial Clustering Of Applications With Noise (dbscan)

What is DBSCAN?

Getting Started With Docker

What is Docker?

Why Learn The Math Behind The Method

Python List Comprehension And Arrays

Iterating through a list