We can talk about some cluster evaluation or validation techniques. Each clustering algorithm can define its own type of validation. For example we discussed that K-means can be evaluated based on SSE. Almost any clustering algorithm will find clusters in a dataset. Even if we take uniformly distributed data and apply the algorithms we discussed such as DBSCAN, K-means, etc. they will find three clusters. This prompts us to do cluster validation. We involve the following aspects:
1) Detecting whether there is non-random structure in the data
2) Detecting the clustering tendency of a set of data.
3) Determining the correct number of clusters
4) Comparing the results of cluster analysis with an external labeling
5) Comparing two sets of clusters to determine which is better.
1 to 3 does not rely on external data. 3 to 5 can be applied to entire clustering or just individual clusters.
If the validation were to be done with a measure, it may not apply to all the clusters. Such a measure may only be applicable to two or three dimensional data. Further if we obtain a value for the measure, we may still have to evaluate if its a good match. The goodness of a match can be measured by looking at the statistical distribution of the value to see how likely such a value is to occur.
The measures may be classified into one of following three categories:
1) Unsupervised where there is no external information. eg. SSE. Hence they are also called internal indices. There are two subcategories here: one that measures cluster cohesion and another that measures cluster separation.
2) Supervised. measures the extent to which the clustering structure matches external classification. Hence they are also called external indices. An example of this type of measure is entropy which measures how well cluster labels match external labels.
3) Relative. Compares different clustering or clusters and they are not a category per se but the comparison of say two clusters. A relative cluster evaluation measure can work with both supervised or unsupervised method.
These and some previous posts have been reviewed from Clustering : Basic concepts and algorithms book
1) Detecting whether there is non-random structure in the data
2) Detecting the clustering tendency of a set of data.
3) Determining the correct number of clusters
4) Comparing the results of cluster analysis with an external labeling
5) Comparing two sets of clusters to determine which is better.
1 to 3 does not rely on external data. 3 to 5 can be applied to entire clustering or just individual clusters.
If the validation were to be done with a measure, it may not apply to all the clusters. Such a measure may only be applicable to two or three dimensional data. Further if we obtain a value for the measure, we may still have to evaluate if its a good match. The goodness of a match can be measured by looking at the statistical distribution of the value to see how likely such a value is to occur.
The measures may be classified into one of following three categories:
1) Unsupervised where there is no external information. eg. SSE. Hence they are also called internal indices. There are two subcategories here: one that measures cluster cohesion and another that measures cluster separation.
2) Supervised. measures the extent to which the clustering structure matches external classification. Hence they are also called external indices. An example of this type of measure is entropy which measures how well cluster labels match external labels.
3) Relative. Compares different clustering or clusters and they are not a category per se but the comparison of say two clusters. A relative cluster evaluation measure can work with both supervised or unsupervised method.
These and some previous posts have been reviewed from Clustering : Basic concepts and algorithms book
No comments:
Post a Comment