Monday, December 2, 2013

We talked about cluster validation measures and saw some details about the measures for partitioning and hierarchical clustering. We continue our discussion with a way to verify the number of clusters. Let us take a dataset where there are well-formed ten clusters and we use K-means clustering which also yields ten clusters.Then if we plot the SSE versus the number of clusters for the same data as well as plot the average silhouette coefficient versus the number of clusters, both plots show a change in trend when the number of clusters is ten. The SSE plot shows a knee and the silhouette co-efficient plot shows a peak.
Another validation concern is that almost all clustering algorithms yield clusters but the given data may or may not have any clusters. To workaround this, we refine the validation to see that at least a few of the resulting clusters are well formed.
An alternate approach is something called the clustering tendency where we can try to evaluate whether a data has clusters without clustering. The most common approach is to use statistical tests for spatial randomness. However, choosing the correct model, estimating the parameters and evaluating the significance that the data is non-random can be time-consuming. Still a few are mentioned in this post.
Hopkins statistic - For this approach, we generate p points that are randomly distributed across the data space, and also sample p actual data points. For both sets of points, we find the distance to the nearest neighbor in the original data set. If ui is this nearest distance for the generated points and wi is this nearest distance for the sample points, the statistic measure is given by
H = Sum(wi) / (Sum(ui) + Sum(wi)) Notice that this has a form similar to precision.
If they are the same then H value comes out to be 0.5 indicating that our selection just happened to be representative of the sample data. For highly clustered data, we get H close to 1 and 0 to indicate the randomness in the data.
When we have external information about data such as when we have labels for the data objects,  we use two different kinds of approaches, the first set of techniques based on entropy, purity and F-measure and the second set of techniques based on something similar to similarity measures for binary data such as Jaccard measure. The approaches are based on the measures of the extent to which two objects belong to the same cluster and the extent to which two clusters share the same object. They can also be named classification-oriented and similarity-oriented approaches.
Entropy is the degree to which each cluster consists of objects of a single class.
If mi is the number of objects in cluster i and mij is the number of objects of class j in cluster i, then the probabibility that a member of cluster i belongs to class j  as pij = mij/mi and entropy is -Sum(pij.log-base-2(pij)). Purity on the other hand is (mi/m).pi
F-measure is 2*precision*recall/(precision + recall).

We mentioned that the lower values of the silhouette coefficient indicate that they are outliers. In addition, this measure can be used not just for the points but for the cluster and the overall as well since we just take the average of those for all the points.
We can also measure cluster validity via correlation.If we are given a similarity matrix for a dataset  and a cluster label from a cluster analysis of the data set, we can measure the goodness by looking at the correlation between the similarity matrix and the ideal version of the matrix. An ideal matrix is one whose points have a similarity of 1 to all points in the cluster and 0 to all other points. This means that the ideal matrix will only have non-zero values along the diagonals because the diagonals correspond to the same cluster in both rows and columns i.e intracluster similarity. Hence such matrix is also called the block diagonal.
High correlation between the ideal and the actual means that the points belonging to the same cluster are close together while the low correlation means this is not the case. Note that since both matrices are symmetric, only the upper or lower half of the diagonal needs to be compared.
The shortcoming of this approach is that this works when the clusters are globular but not so when they are interspersed.
In order to visualize the symmetry matrix, we can transform the distances into similarities with the formula:
s = 1 - (d - min_d)/(max_d - min_d)
This lets us see the separation between the clusters better. When we apply this to the results of the clustering to the same data set by K-means, DBSCAN and complete link, we see that the K-means similarity matrix has more uniform separation. All three have well defined separations between the three clusters but K-means displays it better.
We also referred to the Cophenetic correlation co-efficient. This comes in useful to evaluate the agglomerative hierarchical clustering.
The cophenetic distance between two objects is the proximity at which an agglomerative hierarchical clustering technique puts the objects in the same cluster for the first time. For example, if the smallest distance between two clusters that are merged is 0.1, then all the points in one cluster will have a cophenetic distance of 0.1 with respect to the points in the other cluster. Since different hierarchical clustering of a set of points merge the clusters differently, the cophenetic distance between each pair of objects varies with each. In a cophenetic distance matrix the entries are cophenetic distances between each pair of objects. Notice that the cophenetic distance matrix may have zero values along the diagonals.  The cophenetic correlation coefficient is the correlation between the entries of this matrix and the original dissimilarity matrix.


Sunday, December 1, 2013

We talked about evaluating clusters in the previous post. We briefly introduced the various measure categories. We could look into a few details now. For partitional clustering, we have measures based on cohesion and separation based on proximity matrix. For hierarchical clustering, we have cophenetic correlation coefficient. We could also look into entropy, purity and Jaccard measure.In the former  case, we have the proximity as the square Euclidean distance. For cohesion, we aggregate the distance between the points and the centroid.  For separation, we take the link distances whether minimum or maximum or average. The relationship between cohesion and separation is written as
TSS = SSE + SSB
where TSS is the total sum of squares.
SSE is the sum of squared error
and SSB is the between group sum of squares, the higher the total SSB, the more separated the clusters are.
Another thing to note here is that minimizing SSE (cohesion) automatically results in maximizing SSB (separation).  This can be proved in this way. The total sum of squares is the sum of the square of the distance of each point from each centroid (x-c). This distance between each point and each centroid can be taken as the difference between the distance of the point from a centroid ci and the distance between the centroid ci and c which can then be expressed in the quadratic form and from which only the significant components are taken to form the sum of SSE and SSB.
We have so far talked about using these measures for all of the clusters. In fact, these measures can also be used for individual clusters. For example, a cluster that has a high degree of cohesion, may be considered better than a cluster that has a lower value.  A lower value cluster is also indicative of requiring partitioning to result in better clusters.  Thus clusters can be ranked and processed based on these measures.  Having talked about the individual clusters, we can also rank objects within the cluster based on these measures. An object that contributes more to cohesion is likely to be near the core of the cluster. And for others, they would likely be near the edge.
Lastly, we talk about Silhouette co-efficient. This combines both cohesion and separation.  This is done in three steps.
For the ith object, calculate its average distance to all other objects in cluster and call it ai
For the ith object, and any cluster not containing the object, calculate the object's average distance to all the objects in the given cluster. Use the minimum value and call it bi
For the ith object, the silhouette coefficient is given by (bi-ai)/max(ai, bi)

Saturday, November 30, 2013

We can talk about some cluster evaluation or validation techniques. Each clustering algorithm can define its own type of validation. For example we discussed that K-means can be evaluated based on SSE. Almost any clustering algorithm will find clusters in a dataset. Even if we take uniformly distributed data and apply the algorithms we discussed such as DBSCAN, K-means, etc. they will find three clusters. This prompts us to do cluster validation.  We involve the following aspects:
1) Detecting whether there is non-random structure in the data
2) Detecting the clustering tendency of a set of data.
3) Determining the correct number of clusters
4) Comparing the results of cluster analysis with an external labeling
5) Comparing two sets of clusters to determine which is better.
1 to 3 does not rely on external data. 3 to 5 can be applied to entire clustering or just individual clusters.
If the validation were to be done with a measure, it may not apply to all the clusters. Such a measure may only be applicable to two or three dimensional data. Further if we obtain a value for the measure, we may still have to evaluate if its a good match. The goodness of a match can be measured by looking at the statistical distribution of the value to see how likely such a value is to occur.
The measures may be classified into one of following three categories:
1) Unsupervised where there is no external information. eg. SSE. Hence they are also called internal indices.  There are two subcategories here: one that measures cluster cohesion and another that measures cluster separation.
2) Supervised. measures the extent to which the clustering structure matches external classification. Hence they are also called external indices. An example of this type of measure is entropy which measures how well cluster labels match external labels.
3) Relative. Compares different clustering or clusters and they are not a category per se but the comparison of say two clusters.  A relative cluster evaluation measure can work with both supervised or unsupervised method.
These and some previous posts have been reviewed from Clustering : Basic concepts and algorithms book

bisecting k-means

A bisecting k-means algorithm overcomes some of the limitations with the choice of centroids and finding the minimum SSE. It is based on the idea that to obtain k clusters, split the set of all points into two clusters, select one of these clusters to split, and so on until K clusters have been produced.
The steps for the bisecting K-means algorithm include the following:
1. Initialize the list of clusters with one that contains all the points
2. repeat
3. Remove a cluster from the list of clusters
4. Perform several trial bisections of the chosen cluster
5. for i - 1 to number of trials do
6.    Bisect the selected cluster using basic k-means
7 end for
8. Select the two clusters from the bisection with the lowest total SSE
9. Add these two clusters to the list of clusters
10. until the list of clusters contains K clusters.
There are a number of ways to choose which cluster to split. We can choose the largest cluster at each split or the one with the largest SSE or both.
Further, we can refine the resulting clusters by using their centroids as the initial centroids for the basic K-means algorithm.

Friday, November 29, 2013

A discussion on SSE as brought up in the previous post. SSE is the sum of squared error and it describes the scatter of a cluster. When the SSE is low, the clusters are well formed and cohesive. Thus SSE gives a measure of the quality of clustering.  Also, when the measure is low, the clustering is better represented by the centroids. As the name suggests, SSE is computed by aggregating the squares of the distance for each point from its centroid. This distance is the Euclidean distance between two points. It can be shown that the centroid that minimizes the SSE of the cluster is the mean. This is valuable because it describes the optimum.  In K-means for example, the points are assigned to their nearest centroid, thus minimizing the SSE for the given set of centroids. And the centroids are also recomputed so as to minimize the SSE. Notice, however that these steps of the K-means only find the local minimum with respect to the SSE. As we will see shortly, this leads to sub-optimal clustering.  Also, random initialization of the centroids results in different SSEs in each run. This causes variations in the end result. Typically the run with the lowest SSE is chosen as the best result. Consider the following two examples: 1) Poor initial centroids - Even when the centroids are better distributed, we can obtain a sub-optimal clustering with higher squared error,  and 2) Limits of random initialization - one way to overcome the first factor could be to perform multiple runs with varying initial points and then choosing the results with the least SSE. However, this also could make only local minima.
Since we talked about DBSCAN clustering method in the previous post, let's quickly revisit the agglomerative hierarchical clustering as well. They are an important category of clustering algorithms. Hierarchical clustering can be agglomerative or divisive. Both are often displayed visually using a tree-like diagram called a dendrogram which displays both the cluster and the sub-cluster relationships and the order in which the clusters were merged or split. For sets of two dimensional points,  a hierarchical clustering can also be illustrated using a nested cluster diagram
The algorithm proceeds like this:
1. Compute the proximity matrix, if necessary.
2. repeat
3. Merge the closest two clusters
4. Update the proximity matrix to reflect the proximity between new and original clusters.
5. until only one cluster remains.
Step 3. can be done using single link, complete link and group average methods.
The single link is the minimum distance between clusters as measured between two points - one from each cluster that are nearest. The complete link is the maximum distance between clusters as measured between two points - one from each cluster that are farthest. And the group average takes the average of pairwise distances of all pairs of points from the two different clusters.
Proximity can also be measured in terms of the distances between the centroids of the cluster. Ward's method also uses centroid but it measures the proximity between two clusters but it measures the proximity between two clusters in terms of the increase in sum-of-squared-error (SSE) that results from the merging of two clusters.  SSE gives an indication of the cluster quality and Ward's method attempts to minimize the SSE of points from their cluster centroids.
Due to the use of proximity matrix, the space complexity is O(m^2). Since the algorithm updates the proximity matrix and these are updated in m-1 iterations, the time complexity is also O(m^3). With the use of some data-structures for efficiently organizing as heap, the complexity could be reduced to O(m^2.logm).
This kind of clustering does not globally optimize an objective function. Instead it uses criteria described above to determine the clusters to be merged (or split) at each step. The local minima is easily calculated but the time complexity and space complexity for the overall run is high.