Wednesday, November 20, 2013

In the previous post, I talked about MCMM, Zipf's law and assumptions made.
 In Zipf's law, we relate the number of occurrences of a word r with the number of r occurrences and in this model, we find a threshold for the number of times a word must occur before we indicate the word as occurring in the binary document vector. Z curves are drawn for the document's word frequencies to find this threshold for each document. Longer document will have a larger threshold which will eliminate spurious word occurrences from the document vector. As a support for the binary representation of the terms in the document vector, other studies have shown that there is very little difference in the results between binary and weighted terms. And that categorization can be achieved without term weights.
 Another assumption made is the independence between the components of the input vector (i.e. words)  There are no hidden units in the input..
To test the viability of MCMM, because it had high computational cost, it was tried out with different factors. For example,  the model was made to work with reduced dimensions and Zipf's law was used to eliminate words with very few or very many occurrences in  the documents in the corpus.

Tuesday, November 19, 2013

We will continue the discussion on the Multiple Cause Mixture Model (MCMM). We looked at the global objective function as the aggregation of all the local objective functions pertaining to individual documents. We saw that the maximization of the objective function is performed in two steps :
1) First the cluster centroids c are fixed and then the mi,k value is found for a local maximum.
2) Second, the cluster activation values mi,k are fixed  and then the cj,k are found for a local maximum.
These steps are repeated until the objective function cannot be maximized further.
When the centers stabilize, one of the cluster centers is split into two  and the steps are repeated.
This increase in the number of cluster centers continues until the addition of a cluster center does not increase the objective function.
Note that the first step to find the activity vectors mi can be offloaded as in the case of supervised learning where a teacher provides these activity vectors and the algorithm focuses on finding the cluster centers.
we can represent this solution in a diagram with the cluster centers in a separate layer above the layer representing words. Activity m in cluster layer topic nodes flows top-down to cause activity in nodes r which represents the predictions for how likely the words are to appear in the document. So measurements from observed dj flow up and the predictions flow down during iterations
The assumptions made by the model are as follows:
First, all input data is binary. For text categorization
Second, terms are selected based on Zipf's law which states that the number of times a word appears is invesely proportional to that number of times.
In the Multiple cause mixture model, we mentioned there can be more than one mapping functions between activity m and weights c. One example was the soft disjunction function which explains that the likelihood for any given word to appear in a document only increases with the presence of activity in multiple topic nodes capturing the document's topical content. Here the direction for flow is from activity in clusters to topics in node. The inverse flow can also be described. Any prediction vector r can be combined with a binary vector d representing the words present in the document dj with an objective function Sum-j[log (djrj + (1-dj)(1-rj)] By getting higher values for this function, we have a way to find the vector of cluster activities m that optimize this function. Now given a corpus of documents, indexed by i, the global objective is the aggregation of the individual objective functions. When the global is maximized, we arrive at the set of weights c reflecting clusters of words that co-occur in documents.
Training begins by initializing a single cluster centroid at a random point. For this initial cluster centroid and for later stages when there are more centroids, the maximization occurs in two steps. First the cluster centroids are fixed and the local maximum is found over all data points. In the second step, the cluster activation values are fixed at the values found in the previous step and the gradient ascent is performed until the over the cluster centers. Both the steps are repeated until the function cannot be maximized further.

Monday, November 18, 2013


I want to discuss the P versus Q divergence variation based on the sample size for P and Q even though the event space is the same for both. The sample size for P is negligible compared to Q so after a certain size Q yields roughly the same KLD for any larger size. This means that we just have to choose a collection which has representations involving the terms we would be interested in.
Today I want to discuss a slightly different take on text categorization. This was presented in the paper "Applying the multiple cause mixture model to text categorization" by Sahami, Hearst and Saund. Here they talk about a novel approach using unsupervised learning that does not require a pre-labeled training data set. In other words, it discovers from the samples itself. I'm looking to see if we can consider such an approach for keyword extraction where we don't have to come up with a pre-determined vocabulary set. Supervised algorithms require a training set of pre-labeled documents and there has been studies to intelligently reduce the size of this training set. However, this approach talks about automatically inducing multiple category structure underlying an unlabeled document corpus. In addition, many algorithms assign one label per document, or else treat the classification task as a sequence of binary decision tree while this approach treats documents as member of multiple categories. We will see why this is important in just a bit.
Documents are represented by term vectors that have binary values. That is the number of occurrences of the word does not matter after a certain point. A topic is associated with a substantial number of words i.e. a cluster of words will be indicative of topics. If there are several topics in the document, the overall topic will be the union of all the cluster of words. If a word has dual or more meanings and appears in different clusters, that word would likely appear more than the individual topics. Taken as layers between clusters and nodes, any activity in the cluster layer will cause activity in the nodes such that the latter reflects how often the word appears in the document. The idea is to tell apart the clusters till we have probabilites for these words.
MCMM tries to discover these hidden clusters by looking for patterns in high-dimensional data. It differs from other models by permitting clusters not only to compete for data points but also to co-operate for accounting of observed data.
There can be many different functions that map the activities m in the cluster layer to the weights c for the documents.One example is the soft disjunction which is represented by the equation rj = 1 - PI(1-mk.cjk) wher rj an cjk are both between 0 and 1.
Wartena mentioned in this paper that the keyword selection was done based on the filtering out general terms and this was done by requiring that a keyword has to be different from the background distribution. The technique is to use Kullback-Leibler divergence to see whether the term is different enough and using a cutoff based on this KLD(ptk', q) > 1.  In Kullback divergence measure, we normalized the distances based on an empty document with probability distribution all consisting of epsilon values. And we measured the P(tk/q) as [tf(tk, q) / Sum-for-all-terms-x-in-q (tf (tx,q))] . When we take the KLD distance for a term in a document containing just that term weighted against the KLD for the same term in the distribution Q, we will find that some terms have higher ratios. This we use with a threshold to select the ones that are not too frequent and overly general.  This threshold lets us reduce the number of terms in Q which may be quite large to a few that have higher discrimination. Choosing the terms this way lets us come up with representative keywords for use with any sample document.
From all the words in a sample document, each term can be weighted this way. This gives us a keyword selection technique. The number of terms to be extracted can be varied by adjusting the threshold on a sliding scale.
 By increasing the size of the distribution Q or choosing a category that is similar to the sample document, we could further improve the selection technique.
Thereafter, we could use the document categories and sample document clustering for simultaneous feature extraction and document categorization.

Sunday, November 17, 2013

Today I wrote a document instead of my post. But I did follow up on my suggestions to use KLD starting from an empty document and adding terms to differentiate from the background based on a threshold.

Saturday, November 16, 2013

In the previous post, we discussed using KLD for narrow domain short texts. Here we will cover the feature selection techniques used in that discussion. The first technique is the DocumentFrequency (DF)  which assigns the value dft to each term t where dft is the number of texts in each collection where t occurs. The second technique is the term strength where weights are assigned to each term t from similar texts. The third is the transition point. This is found from the vocabulary frequencies of each text.
Apart from these, there was a mention for extracting the keywords based on KLD itself. i.e we find terms that contribute most to the distance between two probability distributions P and Q. We had proposed two approaches. The first one was the selection of a few terms at a time and choosing the ones that was consistently higher than a threshold. The second one was the selection of a term at a time and adding it to an empty document so that the KLD distance is maintained higher than a threshold.
The clustering techniques used by the authors include the agglomerative clustering proposed by Johnson.
This is described by Borgatti as follows:
Let there be N items to be clustered and a N * N distance matrix to begin with:
Assign each item to its own cluster so that we have N clusters each containing just one term. The distances between the clusters equals the distances between the items they contain.
The closest most similar pair of clusters are found and merged into a single cluster, so that we now have one less cluster.
The distances between the new cluster and each of the old cluster is computed
The above two steps of merging and computing the distances is repeated until we are left with one cluster
Computing the distance between the new cluster and the old clusters can be done in one of the following three ways:
single link -  here the distance between one cluster and another is the shortest distance between any member of the cluster to any member of the other cluster. This method is also called the connectedness or minimum method
complete link - here the distance between one cluster and another is the longest distance from any member of one cluster to any member of other cluster. This method is also called the diameter or maximum method.
average link - here the distance between one cluster and another is the average distance between any member of one cluster to any member of the other cluster.
Notice that in the N * N matrix the distances along the diagonal between same elements is zero. And as the elements are combined into clusters, the cluster names are the combination of the names of the elements.