Sunday, December 1, 2013

We talked about evaluating clusters in the previous post. We briefly introduced the various measure categories. We could look into a few details now. For partitional clustering, we have measures based on cohesion and separation based on proximity matrix. For hierarchical clustering, we have cophenetic correlation coefficient. We could also look into entropy, purity and Jaccard measure.In the former  case, we have the proximity as the square Euclidean distance. For cohesion, we aggregate the distance between the points and the centroid.  For separation, we take the link distances whether minimum or maximum or average. The relationship between cohesion and separation is written as
TSS = SSE + SSB
where TSS is the total sum of squares.
SSE is the sum of squared error
and SSB is the between group sum of squares, the higher the total SSB, the more separated the clusters are.
Another thing to note here is that minimizing SSE (cohesion) automatically results in maximizing SSB (separation).  This can be proved in this way. The total sum of squares is the sum of the square of the distance of each point from each centroid (x-c). This distance between each point and each centroid can be taken as the difference between the distance of the point from a centroid ci and the distance between the centroid ci and c which can then be expressed in the quadratic form and from which only the significant components are taken to form the sum of SSE and SSB.
We have so far talked about using these measures for all of the clusters. In fact, these measures can also be used for individual clusters. For example, a cluster that has a high degree of cohesion, may be considered better than a cluster that has a lower value.  A lower value cluster is also indicative of requiring partitioning to result in better clusters.  Thus clusters can be ranked and processed based on these measures.  Having talked about the individual clusters, we can also rank objects within the cluster based on these measures. An object that contributes more to cohesion is likely to be near the core of the cluster. And for others, they would likely be near the edge.
Lastly, we talk about Silhouette co-efficient. This combines both cohesion and separation.  This is done in three steps.
For the ith object, calculate its average distance to all other objects in cluster and call it ai
For the ith object, and any cluster not containing the object, calculate the object's average distance to all the objects in the given cluster. Use the minimum value and call it bi
For the ith object, the silhouette coefficient is given by (bi-ai)/max(ai, bi)

No comments:

Post a Comment