Saturday, January 25, 2014

Having looked at occurrence and co-occurrence based keyword selection, let's recap: Both the Zipf's law of word occurrences and the refined studies of Booth demonstrate that mid-frequency terms are closely related to the conceptual content of a document. Therefore, it is possible to hypothesize that terms closer to TP can be used as index terms. TP = (-1 + square-root(8 * I + 1))/2 where I represents the number of words with frequency = 1. Alternatively, TP can be found by the first frequency that is not repeated in a non-increasing frequency sorted vocabulary.
 We will now attempt to explain a simple clustering based method for co-occurrence based keyword selection.
We start with a co-occurrence matrix where we count the occurrences of pairwise terms. The matrix is N * M where N is the total number of terms and M is the set of words selected from the keywords with TP above. M << N. We ignore the diagonals of this matrix. The row wise total for a term is the sum of all co-occurrences with the known set for that term.
If terms w1 and w2 have similar distribution of co-occurrence with other terms, w1 and w2 are considered to be in the same cluster. Similarity is measured based on Kullback Leibler divergence. Kullback-Leibler divergence is calculated as KLD = sum for all terms p.log (p/q) We use standard k-means clustering method. Here the initial cluster centroids are chosen far from each other and then the terms are assigned to the nearest cluster, their centroids recomputed and the cycle repeated until cluster centers stabilize. we take p and q to be attribute-wise co-occurrences from the matrix we populated and aggregate them to compute the KLD
 The distribution of co-occurring terms in particular is given by
Pz(t)  = weighted-avg attribute-wise Pz(t)
          = Sum-z((nz/n)*(n(t,z)/nz))
          = 1/n (Sum-z(n(t,z))

 

No comments:

Post a Comment