BIRCH clustering works in the agglomerative style where the cluster centers are not decided initially. Only the maximum
number of cluster summaries and the threshold for any cluster is decided. These two factors are chosen because we want to keep the cluster summaries in memory even for arbitrarily large data sets.
The algorithm reads each of the data points sequentially and proceeds in the following manner:
Step 1: Compute the distance between record r and each of the cluster centers. Let i be the cluster index such that the distance between r and Center is the smallest.
Step 2: For that i'th cluster, recompute the radius as if the record r is inserted into it. If the cluster threshold is not exceeeded, we can proceed to the next record. If not, we start a new cluster with only record r
Step 3: Check that the step 2 does not exceed the maximum cluster summaries. If so, increase threshold such that existing clusters can accomodate more records or they can be merged to fewer cluster summaries.
No comments:
Post a Comment