Monday, December 2, 2013

We talked about cluster validation measures and saw some details about the measures for partitioning and hierarchical clustering. We continue our discussion with a way to verify the number of clusters. Let us take a dataset where there are well-formed ten clusters and we use K-means clustering which also yields ten clusters.Then if we plot the SSE versus the number of clusters for the same data as well as plot the average silhouette coefficient versus the number of clusters, both plots show a change in trend when the number of clusters is ten. The SSE plot shows a knee and the silhouette co-efficient plot shows a peak.
Another validation concern is that almost all clustering algorithms yield clusters but the given data may or may not have any clusters. To workaround this, we refine the validation to see that at least a few of the resulting clusters are well formed.
An alternate approach is something called the clustering tendency where we can try to evaluate whether a data has clusters without clustering. The most common approach is to use statistical tests for spatial randomness. However, choosing the correct model, estimating the parameters and evaluating the significance that the data is non-random can be time-consuming. Still a few are mentioned in this post.
Hopkins statistic - For this approach, we generate p points that are randomly distributed across the data space, and also sample p actual data points. For both sets of points, we find the distance to the nearest neighbor in the original data set. If ui is this nearest distance for the generated points and wi is this nearest distance for the sample points, the statistic measure is given by
H = Sum(wi) / (Sum(ui) + Sum(wi)) Notice that this has a form similar to precision.
If they are the same then H value comes out to be 0.5 indicating that our selection just happened to be representative of the sample data. For highly clustered data, we get H close to 1 and 0 to indicate the randomness in the data.
When we have external information about data such as when we have labels for the data objects,  we use two different kinds of approaches, the first set of techniques based on entropy, purity and F-measure and the second set of techniques based on something similar to similarity measures for binary data such as Jaccard measure. The approaches are based on the measures of the extent to which two objects belong to the same cluster and the extent to which two clusters share the same object. They can also be named classification-oriented and similarity-oriented approaches.
Entropy is the degree to which each cluster consists of objects of a single class.
If mi is the number of objects in cluster i and mij is the number of objects of class j in cluster i, then the probabibility that a member of cluster i belongs to class j  as pij = mij/mi and entropy is -Sum(pij.log-base-2(pij)). Purity on the other hand is (mi/m).pi
F-measure is 2*precision*recall/(precision + recall).

No comments:

Post a Comment