Tuesday, May 11, 2021

DBScan vs K-means

 

Clustering algorithms:

The Clustering Algorithm identifies the relationships in a dataset and then generates a cluster. The model specified by the user identifies the relationship and so there is no predictable column required. Taking the example of chip buyers, we can see that the kids form a separate cluster from the others. Further splitting the clusters into age groups, yields smaller clusters. The clustering behavior can be tuned with parameters such as the maximum number of clusters or changing the amount of support required to create a cluster.
Data for clustering usually have a simple one key column, one or more input columns, and other predictable columns. Some tools also ship with a Cluster Viewer that shows the clusters in a diagram.

A partitioning clustering algorithm partitions the data into k groups such that some criterion that evaluates the clustering quality is optimized. k is specified by the user. A hierarchical clustering algorithm generates a sequence of partitions of the records. Starting with a partition in which each cluster consists of one single record, the algorithm merges two partitions in each step until one single partition remains in the end. DBScan and K-means are both partition algorithms. They differ slightly.

K-Means is based on distance or similarity measure and the clusters minimize the sum of squares of errors. The k number of clusters is pre-defined by the user. Clustering involves grouping data points together based on their measures. When we cluster based on similarity measures, we are looking for the similarity of a vector with that of cluster centers.  The clustering begins with a set of finite clusters and the centroids of these clusters are adjusted as data points are added to them. This kind of algorithm is preferred for sparse data where we can determine the number of partitions we want to make.

DBScan is a simple and effective density-based clustering algorithm that illustrates several important concepts. We first consider the notion of density. There are several methods to measure density that is distinct and different from measuring similarity. In the center-based approach, density is estimated for a particular point by counting the number of points within a certain radius including that point. This method is simple, but the density measured this way depends on radius. If the radius is large enough, then all points will have a density of m which is the number of points in a dataset. If the radius is too small, then all points will have a density of 1.
By this approach, points will either lie in the core of a density-based cluster, or they will lie in the border between the neighborhoods of several core points or they may be outliers where they are neither part of a cluster nor in the neighborhood of two or more clusters. Thus, points in this approach are classified as core points, border points, and noise points.
The algorithm proceeds like this. Any two core points that are close enough i.e. within the radius of one another are put in the same cluster. Any border point that is close enough to a cluster is included within the same cluster as a core point. If there are ties between two or more clusters, they are resolved by choosing the one that's closest. Noise points are discarded.
So the steps can be enumerated as follows:
1. Label the points as core, border, and noise points
2. Eliminate the noise points
3. Put an edge between all core points that are within the specified distance of one another
4. Make a cluster out of each group of connected core points that are within the specified distance of one another.
5. Assign each border point to one of the clusters of its associated core points.
For each point, we find the points that are within the neighborhood of this point, so the complexity of this algorithm is O(m^2). However, there are data structures that can help reduce the complexity in retrieving the points adjacent to this center to O(m log m). The space complexity is linear because we only persist a small amount of metadata for each of the points namely the cluster label and the classification of each point as the core, border, or noise point.
The parameters for this algorithm are the radius and the minimum number of points in a cluster to distinguish the core from the border points.
This algorithm has trouble when the density of the clusters varies widely. It also has difficulty with the high-dimensional data but it's one of the simple ways of clustering.

 

No comments:

Post a Comment