I want to discuss the P versus Q divergence variation based on the sample size for P and Q even though the event space is the same for both. The sample size for P is negligible compared to Q so after a certain size Q yields roughly the same KLD for any larger size. This means that we just have to choose a collection which has representations involving the terms we would be interested in.
Today I want to discuss a slightly different take on text categorization. This was presented in the paper "Applying the multiple cause mixture model to text categorization" by Sahami, Hearst and Saund. Here they talk about a novel approach using unsupervised learning that does not require a pre-labeled training data set. In other words, it discovers from the samples itself. I'm looking to see if we can consider such an approach for keyword extraction where we don't have to come up with a pre-determined vocabulary set. Supervised algorithms require a training set of pre-labeled documents and there has been studies to intelligently reduce the size of this training set. However, this approach talks about automatically inducing multiple category structure underlying an unlabeled document corpus. In addition, many algorithms assign one label per document, or else treat the classification task as a sequence of binary decision tree while this approach treats documents as member of multiple categories. We will see why this is important in just a bit.
Documents are represented by term vectors that have binary values. That is the number of occurrences of the word does not matter after a certain point. A topic is associated with a substantial number of words i.e. a cluster of words will be indicative of topics. If there are several topics in the document, the overall topic will be the union of all the cluster of words. If a word has dual or more meanings and appears in different clusters, that word would likely appear more than the individual topics. Taken as layers between clusters and nodes, any activity in the cluster layer will cause activity in the nodes such that the latter reflects how often the word appears in the document. The idea is to tell apart the clusters till we have probabilites for these words.
MCMM tries to discover these hidden clusters by looking for patterns in high-dimensional data. It differs from other models by permitting clusters not only to compete for data points but also to co-operate for accounting of observed data.
There can be many different functions that map the activities m in the cluster layer to the weights c for the documents.One example is the soft disjunction which is represented by the equation rj = 1 - PI(1-mk.cjk) wher rj an cjk are both between 0 and 1.
 
No comments:
Post a Comment