Monday, April 1, 2013

BIRCH review

This is a review of "BIRCH: a new data clustering algorithm and its application".
BIRCH stands for "Balanced Iterative Reducing and Clustering using hierarchies". It is an efficient and scalable data clustering method. It is used to build an interactive pixel classification tool and generating the initial code-book for compression. Data clustering refers to the problem of dividing N data points into K groups so that the data points are This system works well with very large data-sets because it first generates a more compact summary that retains as much distribution information as possible, then clustering the data summary as much as possible. BIRCH lets other clustering algorithms be applied to its summary.
BIRCH proposes a data structure called CF-tree to efficiently store the summary as clusters. It builds the tree as it encounters the data points and inserts the data points based on one or more of distance function metrics.Then it applies clustering algorithms to the leaf nodes. As opposed to partitioning clustering, especially k-means algorithm, that is widely used, BIRCH is a hierarchical clustering algorithm. In partitioning,we pick initial centers of clusters and assign every data instance to the closest cluster and computer new centers of k-clusters. A hierarchical clustering is one in which there are multiple levels of a clustering like in a dendrogram. The level 1 has n clusters and level n has one cluster. First, we calculate the similarity between all possible combinations of two data instances. This makes one cluster in level 2 for k = 7 clusters. Similarly two other data points are chosen for level 3 k = 6 clusters and repeated for level 4 k = 5 clusters. In level 5 K = 4 clusters, the most similar of earlier formed two clusters forms a new cluster. In level 6 K = 3 clusters, calculate the similarity between new cluster and all remaining clusters.. In level 7, there are only two clusters. In level 8, there is only one cluster. Thus the running time of  partition clustering  is O(n) and that of a hierarchical clustering is O(n2lgn) but its the output for which the latter is chosen. Both need to store data in memory.
BIRCH does incremental and dynamic clustering of incoming objects with only one scan of data and subsequent application of clustering algorithms to leaf nodes.

BIRCH clustering works in the agglomerative style where the cluster centers are not decided initially. Only the maximum
number of cluster summaries and the threshold for any cluster is decided. These two factors are chosen because we want to keep the cluster summaries in memory even for arbitrarily large data sets.
The algorithm reads each of the data points sequentially and proceeds in the following manner:
Step 1: Compute the distance between record r and each of the cluster centers. Let i be the cluster index such that the distance between r and Center is the smallest.
Step 2: For that i'th cluster, recompute the radius as if the record r is inserted into it. If the cluster threshold is not exceeeded, we can proceed to the next record. If not, we start a new cluster with only record r
Step 3: Check that the step 2 does not exceed the maximum cluster summaries. If so, increase threshold such that existing clusters can accomodate more records or they can be merged to fewer cluster summaries.

No comments:

Post a Comment