Saturday, October 26, 2013

We discussed SKWIC and Fuzzy SKWIC in that order. I wanted this post to be in favor of implementation starting from the basics. Actual implementation details may follow subsequently but in this post we start with the basic.
Given a set of tuples, with fixed set of attributes, we can implement different clustering algorithms. Let us pick at something real simple like classifying each tuple into one of three different categories say X, Y and Z. Let's say the tuples represent points with (x,y) axis co-ordinates. On a graph with x and y axis, these points appear in a cartesian plane that we may be familiar with. Visualizing the points on this graph helps us find the clusters X, Y and Z as separate from each other. It also helps us calculate the distance between any two points as the sqrt of the sum of squares of x axis projections and y-axis projections.
One approach is to use a centroid - the point closest to other points in a cluster and represent the cluster with the centroid. This works in our case since we can keep track of three points for three categories we want our points in. This is called the K-means. and in our case K = 3
So the steps involved are something like this:
1) Initialize the cluster by picking three points for centroids. Pick one point randomly, then the other two points as far away as possible from the previous points.
2) For each point, place it in the nearest cluster, the distance to the centroids is minimum. Recompute the centroid for this cluster
3) After all the points are assigned, fix the centroids of these three clusters.
4) Re-assign all the points to their closest centroid.
Sometimes K is not clear initially
And we may need to try different k, then we see the changes in the average distance to centroid. When K increases, this drops rapidly until after the right k, it changes very little.
This is K-means clustering. Next we will look at optimizing memory and CPU.
For now, we can assume that initialization costs can be done away with random and re-selection picks if they are not far apart.
In general when there are lots of tuples, we load them in batches that can fit in memory.  Then we maintain three classes of points and use summarization so that we can represent them with shorter data structures and continue our clustering.
For each cluster, the summarization info includes the number of points N the vectors for the sum of the co-ordinates of the points in the ith dimension and the sum of the  square of the co-ordinates in the ith dimension.

No comments:

Post a Comment