Cluster computing

Thursday, October 31, 2013

We were discussing the implementation of Fuzzy SKWIC where were going through the steps within each iteration which we repeat until cluster centers stabilize. So far we had looked into finding cosine distances, updating and adjusting relevance weights, computing the weighted aggregated cosine and adjusting it. We will discuss the updates to partition matrix, cluster centers and tau next. The partition matrix was initialized by weights in a matrix of C x N and each with values ranging from 0 to 1. The partition matrix had the additional constraints that the column wise sum along clusters must equal 1 and the row wise sum along document must be less than N. To update this matrix in each iteration, we use the computed weighted aggregated cosine distances. Here we take the ratio of the weighted aggregated Dwcij and Dwckj and aggregate the ratio for all the clusters. Note we may feel like we did not compute the denominator till now since it has not been mentioned so far. It follows the same computation as the numerator but against a cluster k ranging from 1 to C instead of the specific ith cluster. Both i and k are cluster iterators and the k changes from 1 to C for the same i in this computation. Therefore Dwcij an Dwckj are both known before this step to update the matrix. Once we calculate the aggregated ratios we raise it to power 1/(m-1). m is just a degree for the partition matrix values. And then we inverse the result to calculate the new partition matrix value uij.
Next we look at updating the cluster centers. Here if the feature weight vik = 0, then document xj has no bearing on the cluster so we don't update the cluster center. If the feature weight is greater than zero, we update the cluster center with the mean of the fuzzy membership weight based documents.
Finally, we update tau. Tau is last to be updated in the iteration and here the only thing that has changed from the last computation is the partition matrix and the newly computed Dwcij.
Thus we see that the data structures we need are just the partition matrix and variables per iteration. The steps in the iteration are as discussed.
The authors for the Fuzzy SKWIC mention a very interesting phenomenon called noise magnets. A cluster starts accumulating outliers and noise while the others are more compact with their constituents at similar distances from the center of the cluster. This happens because the outlier documents are neither similar to each other nor are they similar to any clusters. In addition, they get assigned a maximum distance of 1. They don't have any effect on the objective function. They join the cluster with the seed nearest to them. Further, they bring down the feature weights of this cluster due to extreme averaging. Consequently, they draw all the outliers to this cluster.

Cluster computing

Thursday, October 31, 2013

No comments:

Post a Comment