Monday, December 2, 2013


We mentioned that the lower values of the silhouette coefficient indicate that they are outliers. In addition, this measure can be used not just for the points but for the cluster and the overall as well since we just take the average of those for all the points.
We can also measure cluster validity via correlation.If we are given a similarity matrix for a dataset  and a cluster label from a cluster analysis of the data set, we can measure the goodness by looking at the correlation between the similarity matrix and the ideal version of the matrix. An ideal matrix is one whose points have a similarity of 1 to all points in the cluster and 0 to all other points. This means that the ideal matrix will only have non-zero values along the diagonals because the diagonals correspond to the same cluster in both rows and columns i.e intracluster similarity. Hence such matrix is also called the block diagonal.
High correlation between the ideal and the actual means that the points belonging to the same cluster are close together while the low correlation means this is not the case. Note that since both matrices are symmetric, only the upper or lower half of the diagonal needs to be compared.
The shortcoming of this approach is that this works when the clusters are globular but not so when they are interspersed.
In order to visualize the symmetry matrix, we can transform the distances into similarities with the formula:
s = 1 - (d - min_d)/(max_d - min_d)
This lets us see the separation between the clusters better. When we apply this to the results of the clustering to the same data set by K-means, DBSCAN and complete link, we see that the K-means similarity matrix has more uniform separation. All three have well defined separations between the three clusters but K-means displays it better.
We also referred to the Cophenetic correlation co-efficient. This comes in useful to evaluate the agglomerative hierarchical clustering.
The cophenetic distance between two objects is the proximity at which an agglomerative hierarchical clustering technique puts the objects in the same cluster for the first time. For example, if the smallest distance between two clusters that are merged is 0.1, then all the points in one cluster will have a cophenetic distance of 0.1 with respect to the points in the other cluster. Since different hierarchical clustering of a set of points merge the clusters differently, the cophenetic distance between each pair of objects varies with each. In a cophenetic distance matrix the entries are cophenetic distances between each pair of objects. Notice that the cophenetic distance matrix may have zero values along the diagonals.  The cophenetic correlation coefficient is the correlation between the entries of this matrix and the original dissimilarity matrix.


No comments:

Post a Comment