Thursday, November 28, 2013

Today we talk about DBSCAN algorithm. It's a simple and effective density-based clustering algorithm that illustrates a number of important concepts. We first consider the notion of density. There are several methods to measure density that are distinct and different from measuring similarity. In the center based approach, density is estimated for a particular point by counting the number of points within a certain radius including that point. This method is simple but the density measured this way depends on radius. If the radius is large enough, then all points will have a density of m which is the number of points in a dataset. If the radius is too small, then all points will have a density of 1.
By this approach, points will either lie in the core of a density based cluster, or they will lie in the border between the neighborhoods of several core points or they may be outliers where they are neither part of a cluster nor in the neighborhood of two or more clusters. Thus points in this approach are classified as core points, border points and noise points.
The algorithm proceeds like this. Any two core points that are close enough i.e. within the radius of one another are put in the same cluster. Any border point that is close enough to a cluster is included within the same cluster as a core point. If there are ties between two or more clusters, they are resolved by choosing the one that's closest. Noise points are discarded.
So the steps can be enumerated as follows:
1. Label the points as core, border and noise points
2. Eliminate the noise points
3. Put an edge between all core points that are within the specified distance of one another
4. Make a cluster out of each group of connected core points that are within the specified distance of one another.
5. Assign each border point to one of the clusters of its associated core points.
For each point we find the points that are within the neighborhood of this point, so the complexity of this algorithm is O(m^2). However, there are data structures that can help reduce the complexity in retrieving the points adjacent to this center to O(mlogm). The space complexity is linear because we only persist a small amount of metadata for each of the points namely the cluster label and the classification of each point as the core, border or noise point.
The parameters for this algorithm are the radius and the minimum number of points in a cluster to distinguish the core from the border points.
This algorithm has trouble when the density of the clusters vary widely. It also has difficulty with the high-dimensional data but it's one of the simple ways of clustering.

No comments:

Post a Comment