Clustering algorithms:
The
Clustering Algorithm identifies the relationships in a dataset and then
generates a cluster. The model specified by the user identifies the
relationship and so there is no predictable column required. Taking the example
of chip buyers, we can see that the kids form a separate cluster from the
others. Further splitting the clusters into age groups, yields smaller clusters. The clustering
behavior can be tuned with parameters such as the maximum number of clusters or
changing the amount of support required to create a cluster.
Data for clustering usually have a simple one key column, one or more input
columns, and other predictable columns. Some tools also ship with a Cluster
Viewer that shows the clusters in a diagram.
A
partitioning clustering algorithm partitions the data into k groups such that
some criterion that evaluates the clustering quality is optimized. k is
specified by the user. A hierarchical clustering algorithm generates a sequence
of partitions of the records. Starting with a partition in which each cluster
consists of one single record, the algorithm merges two partitions in each step
until one single partition remains in the end. DBScan and K-means are both partition algorithms.
They differ slightly.
K-Means
is based on distance or similarity measure and the clusters minimize the sum of
squares of errors. The k number of clusters is pre-defined by the user. Clustering
involves grouping data points together based on their measures. When we cluster
based on similarity measures, we are looking for the similarity of a vector
with that of cluster centers. The clustering begins with a set of finite
clusters and the centroids of these clusters are adjusted as data points are
added to them. This kind of algorithm is preferred for sparse data where we can
determine the number of partitions we want to make.
DBScan
is a simple and effective density-based clustering algorithm that illustrates several
important concepts. We first consider the notion of density. There are several
methods to measure density that is distinct and different from measuring
similarity. In the center-based approach, density is estimated for a particular
point by counting the number of points within a certain radius including that
point. This method is simple, but the density measured this way depends on
radius. If the radius is large enough, then all points will have a density of m
which is the number of points in a dataset. If the radius is too small, then
all points will have a density of 1.
By this approach, points will either lie in the core of a density-based
cluster, or they will lie in the border between the neighborhoods of several
core points or they may be outliers where they are neither part of a cluster
nor in the neighborhood of two or more clusters. Thus, points in this approach
are classified as core points, border points, and noise points.
The algorithm proceeds like this. Any two core points that are close enough
i.e. within the radius of one another are put in the same cluster. Any border
point that is close enough to a cluster is included within the same cluster as
a core point. If there are ties between two or more clusters, they are resolved
by choosing the one that's closest. Noise points are discarded.
So the steps can be enumerated as follows:
1. Label the points as core, border, and noise points
2. Eliminate the noise points
3. Put an edge between all core points that are within the specified distance
of one another
4. Make a cluster out of each group of connected core points that are within
the specified distance of one another.
5. Assign each border point to one of the clusters of its associated core points.
For each point, we find the points that are within the neighborhood of this
point, so the complexity of this algorithm is O(m^2). However, there are data
structures that can help reduce the complexity in retrieving the points
adjacent to this center to O(m log m). The space complexity is linear because
we only persist a small amount of metadata for each of the points namely the
cluster label and the classification of each point as the core, border, or
noise point.
The parameters for this algorithm are the radius and the minimum number of
points in a cluster to distinguish the core from the border points.
This algorithm has trouble when the density of the clusters varies widely. It
also has difficulty with the high-dimensional data but it's one of the simple
ways of clustering.
No comments:
Post a Comment