We talked about summarizing data points when clustering very large data sets. This approach is common to many improvement algorithms to clustering and in particular today we will discuss a simple summarizing technique called BFR for K-means clustering we discussed earlier.
Data Points are read into memory and here too the initial k-cluster centroids are chosen by some simple approach. For example, we can take a sample, pick a random point and then k-1 points as far apart from each other as possible.
We maintain three sets of data points and these are called discard set, compression set, and retained set. The discard set are points close enough to a centroid to be summarized. The compression set is a group of points that are close together but not close to any centroid. They are summarized but not assigned to a cluster. The retained sets are isolated points.
The discard set is summarized by the number of points N.
It has a vector SUM for the ith component which is the sum of the co-ordinates of the point in the ith dimension.
The vector SUMSQ for the ith component which is the sum of the squares of co-ordinates in the ith dimension
2d +1 values that represent any number of points where d is the number of dimensions
Centroid in the ith dimension is calculated as SUMi/N
And the variance is calculated as the (SUMSQi/N) - (SUMi/N)^2
This summarization data is used with the processing of the data points from the very large data set.
The data points are processed in the following manner:
Data points that are sufficiently close to a cluster centroid are added to that cluster and its discard set.
Cluster the remaining points and the old retained set. Clusters that are formed goto the cohesive set and outliers go to the retained set. Since the clusters are not assigned yet, we will not add them to the discard set.
The data summarized for these clusters is then updated with the inclusion of these new points. Some cohesive set clusters can even be merged. If this is the last round, merge all the cohesive set and the retained set to their nearest cluster.
Data points that are sufficiently close to a cluster centroid are added to that cluster and its discard set.
Cluster the remaining points and the old retained set. Clusters that are formed goto the cohesive set and outliers go to the retained set. Since the clusters are not assigned yet, we will not add them to the discard set.
The data summarized for these clusters is then updated with the inclusion of these new points. Some cohesive set clusters can even be merged. If this is the last round, merge all the cohesive set and the retained set to their nearest cluster.
No comments:
Post a Comment