In today's post, I want to talk about a nuance of K-means clustering. Here the vectors are assigned to the cluster that is nearest as compared by the centroids. There are ways to assign clusters without centroids and these are based on single link, complete link etc. However, centroid based clusters are the easiest in that the computations are limited.
Note that the number of clusters is pre-specified before the start of the program. This means that we don't change the expected outcome. That is we don't return fewer than the expected clusters. Even if all the data points belong to one cluster, this method aims to partition the n data-points into k clusters each with its own mean or centroid.
The mean is recomputed after each round of assignment of data-points to different clusters. We start with a cluster with three different means initialized. At each step, the data points are assigned to the nearest cluster. That means that the clusters cannot be empty. If any cluster becomes empty because its members join other clusters, then that cluster should take the outliers of an already populated cluster. This way the cluster coherency goes up for each of the clusters.
If the number of clusters is large to begin with and the data set is fewer, this will lead to highly partitioned data set that is not adequately represented by one or more of the clusters and the resulting aggregation of clusters may need to be taken together to see the overall distribution. However, this is not an undesirable outcome. This is the expected outcome for the number of parititions specified. The number of partitions was incorrectly specified to be too high and this will be reflected by the chi square goodness of fit. The next step then should be to reduce the number of clusters.
If we specify only two clusters and all the data points are visually close to a predominant cluster, then too the other cluster need not be kept empty. It can improve the previous cluster by taking one of the outliers into the secondary cluster.
Note that the number of clusters is pre-specified before the start of the program. This means that we don't change the expected outcome. That is we don't return fewer than the expected clusters. Even if all the data points belong to one cluster, this method aims to partition the n data-points into k clusters each with its own mean or centroid.
The mean is recomputed after each round of assignment of data-points to different clusters. We start with a cluster with three different means initialized. At each step, the data points are assigned to the nearest cluster. That means that the clusters cannot be empty. If any cluster becomes empty because its members join other clusters, then that cluster should take the outliers of an already populated cluster. This way the cluster coherency goes up for each of the clusters.
If the number of clusters is large to begin with and the data set is fewer, this will lead to highly partitioned data set that is not adequately represented by one or more of the clusters and the resulting aggregation of clusters may need to be taken together to see the overall distribution. However, this is not an undesirable outcome. This is the expected outcome for the number of parititions specified. The number of partitions was incorrectly specified to be too high and this will be reflected by the chi square goodness of fit. The next step then should be to reduce the number of clusters.
If we specify only two clusters and all the data points are visually close to a predominant cluster, then too the other cluster need not be kept empty. It can improve the previous cluster by taking one of the outliers into the secondary cluster.
No comments:
Post a Comment