Hang Yamanishi propose a stochastic topic model where w denotes a set of words and K a set of topics. A probability distribution of words P(W|K). Here the value of P(W|K) will be zero if w is not included in K and the probability w in W is Sum(P(k)P(w/k)). Each such stochastic topic model (STM) is defined one for each block. Word clusters are created from a large data corpus. All words in a vocabulary are treated as seed words and for each seed word, matching words are found from the corpus that co-occur with the seed word and these are grouped in a cluster. The match is based on a value deltaSC which exceeds a threshold gamma and P(w|s) > P(w) is satisfied as that which occurs significantly frequently. Since the (s,w) pairs are built from a large corpus data, this does not have to rebuild for every candidate in a given document. From the results of the clustering, select any cluster topic whose seed word is included among the selected keywords and merge two clusters ki and kj where the seed word of ki is included in kj and vice versa. Merge only if the words belong to different word sets. Lastly, topic segmentation is performed on n candidates based on similarity measures by finding local minima between similarity measures of consecutive candidates say at i and then finding two most distant candidates P1, P2 adjacent to i and on either side such that P1-S(i) > theta and P2-S(i) > theta and then segmenting there.
No comments:
Post a Comment