In the previous post, we talked about finding tau and adjusting it to have the same magnitude between the two components of the objective function. What we will see next is that the cluster partition that minimizes J is the one that assigns each data sample to the cluster with the nearest center. In other words, the fraction of xj over the cosine based distance Dwcij will always be less than the weighted aggregated cosine based distance for all i that is not at the center. Ties are resolved arbitrarily for candidates that are vie for center.
Since the optimization function cannot be minimized with respect to the centers, we compute new cluster centroids and normalize them to unit-length.
Notice that this results in two cases for the feature weight vik which we may remember from the previous post was the default value (1/n) and the bias. The latter can be both positive and negative.
So the first case is when the vik = 0
This means that the kth feature is completely irrelevant relative to the ith cluster. Hence, regardless of the value of cik, the value of this feature will not contribute to the overall weighted distance computation. Therefore, in this situation, any arbitrary value can be chosen for cik and is generally set to 0.
The second case is when vik != 0
In this case the center is recomputed as cik is equal to the normalized aggregated sum of the document frequency vectors across the different attribute directions. i.e we are saying the kth feature has some relevance to the ith cluster we will simply pick nearly the mid-point.
Now, let us look at summarizing the above steps for clustering a collection of N normalized document vectors defined over a vocabulary of n keywords:
Since the optimization function cannot be minimized with respect to the centers, we compute new cluster centroids and normalize them to unit-length.
Notice that this results in two cases for the feature weight vik which we may remember from the previous post was the default value (1/n) and the bias. The latter can be both positive and negative.
So the first case is when the vik = 0
This means that the kth feature is completely irrelevant relative to the ith cluster. Hence, regardless of the value of cik, the value of this feature will not contribute to the overall weighted distance computation. Therefore, in this situation, any arbitrary value can be chosen for cik and is generally set to 0.
The second case is when vik != 0
In this case the center is recomputed as cik is equal to the normalized aggregated sum of the document frequency vectors across the different attribute directions. i.e we are saying the kth feature has some relevance to the ith cluster we will simply pick nearly the mid-point.
Now, let us look at summarizing the above steps for clustering a collection of N normalized document vectors defined over a vocabulary of n keywords:
Fix the number of clusters C
Initialize the centers by randomly selecting C documents
Initialize the partitions X, using equal feature weights (1/n) where n is the total number of items in a collection of N documents.
REPEAT
Compute cosine based distance between document xj and ith cluster center vector using a formula that involves the kth component of the ith cluster center vector as 1/n-(xjk,cik). Repeat this for every cluster and for every term from 1 to N and along every dimension from 1 to k
Update the relevance weights vik by using the default value and computing the bias along this dimension after computing that across all dimensions
Compute the weighted aggregated sum of cosine based distances along the individual dimensions ( dotproduct of feature weights and cosine based distance.
Update the cluster partitioning with the method discussed above
Update the tau
UNTIL centers stabilize
The difference between some similar approaches in pattern recognition and here is that the former has an assumption that data has a multivariate Gaussian distribution where as this treats them as independent.
Initialize the centers by randomly selecting C documents
Initialize the partitions X, using equal feature weights (1/n) where n is the total number of items in a collection of N documents.
REPEAT
Compute cosine based distance between document xj and ith cluster center vector using a formula that involves the kth component of the ith cluster center vector as 1/n-(xjk,cik). Repeat this for every cluster and for every term from 1 to N and along every dimension from 1 to k
Update the relevance weights vik by using the default value and computing the bias along this dimension after computing that across all dimensions
Compute the weighted aggregated sum of cosine based distances along the individual dimensions ( dotproduct of feature weights and cosine based distance.
Update the cluster partitioning with the method discussed above
Update the tau
UNTIL centers stabilize
The difference between some similar approaches in pattern recognition and here is that the former has an assumption that data has a multivariate Gaussian distribution where as this treats them as independent.
No comments:
Post a Comment