Monday, April 15, 2013

keyword clustering

Wartena 08 describes topic detection by clustering methods. The paper suggests that the topic for the whole text can be found based on a probabilistic latent semantic analysis. Keywords are clustered based on statistical similarity measure. The conditional probability distributions that indicate the latent topic are the clusters that this approach finds. Clustering is based on a hierarchical top down method which requires that the center is recomputed in each step. Two elements are selected that have the largest distance and are used as seeds for the clustering. Next all other items are assigned to the center closest to one of the two seeds. After all the items are assigned, the centers are recomputed. The new centers serve as new seeds and the process is repeated until two centers are converged. If the clusters are larger than a threshold, the whole procedure is repeated on that cluster and thus we find a binary tree of clusters. The choice of distance functions used in this paper for two terms t and s are the cosine similarity of the document distribution, the cosine similarity of vector tf, idf values of keywords, the Jensen Shannon distributions between the document distributions and the Jensen Shannon distributions between the term distributions pt and ps.
When clustering using any distance function, the cluster centers could be chosen among the data points and then centers recomputed in each step.

No comments:

Post a Comment