We alluded to several stages of text processing during implementation before we can do clustering. We also compared agglomerative to K-means clustering. Recently, agglomerative processing has been made very fast and efficient. And since it represents bottom up merge strategy, it has some advantages we discussed. However, the K-means is considered better for text purposes. For the rest of our implementation discussions, we will focus on K-means only.
We select C Documents based on random sampling. We Initialize the fuzzy partition matrix C x N where each of the values of the matrix are < 1 and the row wise sum is < N and the column wise sum = 1.
We populate the document frequency vectors xi for each of the n terms in a collection of N documents.
To calculate the cosine similarity between document frequency vectors xi and xj, we calculate their aggregated dot product divided by their magnitudes.
To calculate the dissimilarity between document frequency vector xj and the ith cluster center vector from the matrix C x N above, we aggregate the weighted distance for each of the n terms. This distance is the cosine based distance along the individual dimension as 1/n - xjk.cik where k varies from one to n
Then the rest of the processing is per the algorithm we discussed earlier.
Note that N is typically much smaller than n and that the matrix for the C x N has C typically much much smaller than n.
This means that all our aggregations over n terms have to be efficient. This could be achieved with reducing the number of iterations over n and possibly some pre-computations.
Moreover, the aggregations over C and n may need to be repeated for many components such as the error to the clusters and the feature weights. Similarly the cosine based distance could also be computed once and stored for lookup later so that we don't have to perform the same operations agains. This could speed up processing.
The fuzzy labels are found by aggregated over each of the clusters . They are aggregated over the number of documents N with a proper choice of m - the degree to which the fuzzy membership is raised to compute the feature weight. They are also aggregated to update the cluster centers.
No comments:
Post a Comment