Tuesday, December 3, 2013

We mentioned similarity-oriented measures for supervised clustering. We will look into Jaccard coefficient now. We can view this approach to cluster validity as involving the comparision of two matrices - an ideal cluster similarity matrix and an ideal class similarity matrix defined wrt class labels. As before we can take the correlation of these two matrices as the measure of cluster validity. Both of these matrices are N * N where N is the number of data points. and they have both binary values 1 or 0 - 1 if the two objects belong to the same cluster or class respectively or 0 otherwise. For all m(m - 1)/2 pairs of distinct objects, we compute the following:
f00 = number of pairs of objects that have a different class and a different cluster
f01 = number of pairs of objects that have a different class and the same cluster.
f10 = number of pairs of objects that have the same class and a different cluster
f11 = number of pairs of objects that have the same class and the same cluster.
These four pairs define a two way contingency table where the columns are the same or different clusters and the rows are same or different classes.
Then we can use one of the two most frequently used similarity measures as follows
1) Rand statistic = (f00 + f11)/ (f00 + f01 + f10 + f11)
and
2) Jaccard co-efficient = (f11)/(f01 + f10 + f11)
For hierarchical clustering, the key idea for a measure is that at least one of the cluster is relatively pure and includes most of the objects of that class and to evaluate it, we use the F-measure as defined earlier for each cluster in the cluster hierarchy.
The overall F-measure for the hierarchical clustering is computed as the weighted average of the per-class F- measures i.e Sum((mj/m)max-F-measure(i,j)) where the weights are based on class sizes and the maximum is taken over all clusters i at all levels, mj is the number of objects in class j and m is the total number of objects
Thus we have seen different cluster measures. Their values are indicative of the goodness of the cluster.  A purity of 0 is bad and 1 is good. An entropy of 0 and an SSE of 0 are both good. We can use absolute number if we want only a certain level of error. If this is not the case, we can use statistical measures which measure how non-random the structure is given the high or low values of the measure.

No comments:

Post a Comment