Tuesday, April 30, 2013

an easier approach to topic analysis

A quick look on how to implement Hang Yamanishi topic analysis and segmentation approach.
The steps involved are:
1) topic spotting
2) text segmentation
3) topic identification
The input is the given text  and the output is a topic structure based on STMs.
In step 1, we select keywords from a given text based on the Shannon Information of each word.
I(w) = -N(w)logP(w) where N(w) denotes the frequency of w in t, and
P(w) the probability of the occurrence of w as given from the corpus data
I(w) is therefore the amount of information represented by w
 P(w) can be evaluated as follows:
 featuresets = [(word_features(n), g) for (n,g) in words]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier.classify(word_features(word))
Clustering helps with topic spotting and this is done with additive merges from treating each word as a cluster.
Text Segmentation is independent. 

No comments:

Post a Comment