Thursday, November 21, 2013

I came across a book Algorithms for clustering data by Jain and Dubes. I want to start posting from that book. But first I want to jump to a section towards the end of the book which talks about Graph theory. As we may know, a graph has a finite set of nodes and they can represent objects being clustered.  A set of edges records the interactions between pairs of vertices. The vertices and edges are related by a function f that maps edges into unordered pairs of vertices. we discuss finite undirected graphs when we say unordered pairs of vertices. Therefore a graph is a tuple V, E, f.
Some applications require that labels or weights be assigned to the vertices or to the edges.
As an example, the edges could be labeled by the distance between patterns representing the objects. There are some common definitions as well. A subgraph is a subset of nodes and vertices of the original graph where there are no edges or vertices that are not in the original graph. A path in graph is an alternating sequence of vertices and edges that contains no repeated edges and no repeated vertices and for which edge ei is incident to vertices vi and vi+1 for i = 1 to n.
 A graph is connected if a path exists between any two vertices in the graph.
 A component is a maximal piece of a connected graph. Maximal means as many nodes as possible. So a component is not a proper subgraph. of another graph because an edge may be drawn where it didn't exist in the original graph.  A graph is complete if an edge is assigned to every possible pair of nodes. A complete graph on n nodes contains exactly n(n-1)/2 edges.
 There are some interesting observations as well. A maximal complete subgraph of G is a complete subgraph of G that is not a proper subgraph of any other complete subgraph of G. In other words, H is a complete subgraph of G if an edge is incident to every pair of vertices in H but no additional vertices of G can be added to H without destroying the completeness.  A maximal complete subgraph is also called a clique.
A cycle is the same as a path except that vertices v1 and vn are the same vertex.  A tree is a connected graph with no cycles. If a subgraph has m vertices, a tree containing these vertices has m-1 edges. A spanning tree is a tree containing all vertices in the graph. The weight of a tree is the sum of the weights of the edges when the weights represent dissimilarity or ratio scale. A minimal spanning tree is the one with the least weight  among all other spanning trees.
A number of algorithms are available for finding the minimum spanning tree. MSTs of complete graph are important in cluster analysis.
Prim's algorithm is easy to try out by hand. The edges are ranked by their weights and insert the edges by order of dissimilarity. The shortest is included first until a tree is formed. We prevent any cycles when building an MST.
A threshold graph is one which has n labeled nodes and N unlabeled edges. A random graph is one which has edges inserted at random. A rank graph is a threshold graph in which the edges are labeled. N nodes can be sampled from n(n-1)/2 edges without replacement or ordering and combined with N! to get the number of rank graphs.
In the unsupervised category, MCMM was compared to K-means clustering on the same DS2 dataset. Three of the four clusters align very well with the labels given by Reuters, achieving very high precision score, especially considering label information was unavailable during training. The algorithm stops after it has added an additional cluster center that does not increase the gain function. That is why the first three clusters are clearly delineated than the last one. The fourth cluster found by the MCMM is a general subtopic within the dataset almost like a miscellaneous group.
When compared with the earlier supervised run, where we found clusters that did not match the reuters labels, a closer inspection of the same showed that the clusters were well formed and that some documents were originally mislabeled.
Thus MCMM appears to be a practical machine learning method for text categorization. The ability to run essentially the same algorithm in both supervised and unsupervised mode helps evaluate existing labels. And there is flexibility to include the  documents in more than one category along with a ranked list of the degree of the inclusion in the cluster.

Wednesday, November 20, 2013

In the experimental results for MCMM, we find cluster coherency. The datasets used by MCMM involved the Reuters document collection.  Two small subsets of
documents were used. The first consisted of 983 documents with 9 labels involving 372 dimensions and the second consisted of 240 article with 3 labels and
360 dimensions or words.  This dataset was chosen so that MCMM could first be run on pre-labeled data before running in an unsupervised manner. There was a
70-30% split in each data set for training and test purposes. For comparision, a naive Bayesian classifier was also run. With the MCMM, the cluster
activities produce a ranking for category assignment instead of a binary decision which allows us to compare the tradeoff between precision and recall. If
we increase the threshold for activity, there would be more precision but less recall. It was found that the precision and recall were great for all except
three labels which seemed more of an exception than the norm. The documents from the first set in the reuter collections from the mentioned three labels had
multiple labels assigned to them. However, it was also found that the Naive Bayes Classifier performed better. And the performance was compared against a
much larger reuters collection.
MCMM was expected to have better performance in the unsupervised mode.
In the previous post, I talked about MCMM, Zipf's law and assumptions made.
 In Zipf's law, we relate the number of occurrences of a word r with the number of r occurrences and in this model, we find a threshold for the number of times a word must occur before we indicate the word as occurring in the binary document vector. Z curves are drawn for the document's word frequencies to find this threshold for each document. Longer document will have a larger threshold which will eliminate spurious word occurrences from the document vector. As a support for the binary representation of the terms in the document vector, other studies have shown that there is very little difference in the results between binary and weighted terms. And that categorization can be achieved without term weights.
 Another assumption made is the independence between the components of the input vector (i.e. words)  There are no hidden units in the input..
To test the viability of MCMM, because it had high computational cost, it was tried out with different factors. For example,  the model was made to work with reduced dimensions and Zipf's law was used to eliminate words with very few or very many occurrences in  the documents in the corpus.

Tuesday, November 19, 2013

We will continue the discussion on the Multiple Cause Mixture Model (MCMM). We looked at the global objective function as the aggregation of all the local objective functions pertaining to individual documents. We saw that the maximization of the objective function is performed in two steps :
1) First the cluster centroids c are fixed and then the mi,k value is found for a local maximum.
2) Second, the cluster activation values mi,k are fixed  and then the cj,k are found for a local maximum.
These steps are repeated until the objective function cannot be maximized further.
When the centers stabilize, one of the cluster centers is split into two  and the steps are repeated.
This increase in the number of cluster centers continues until the addition of a cluster center does not increase the objective function.
Note that the first step to find the activity vectors mi can be offloaded as in the case of supervised learning where a teacher provides these activity vectors and the algorithm focuses on finding the cluster centers.
we can represent this solution in a diagram with the cluster centers in a separate layer above the layer representing words. Activity m in cluster layer topic nodes flows top-down to cause activity in nodes r which represents the predictions for how likely the words are to appear in the document. So measurements from observed dj flow up and the predictions flow down during iterations
The assumptions made by the model are as follows:
First, all input data is binary. For text categorization
Second, terms are selected based on Zipf's law which states that the number of times a word appears is invesely proportional to that number of times.
In the Multiple cause mixture model, we mentioned there can be more than one mapping functions between activity m and weights c. One example was the soft disjunction function which explains that the likelihood for any given word to appear in a document only increases with the presence of activity in multiple topic nodes capturing the document's topical content. Here the direction for flow is from activity in clusters to topics in node. The inverse flow can also be described. Any prediction vector r can be combined with a binary vector d representing the words present in the document dj with an objective function Sum-j[log (djrj + (1-dj)(1-rj)] By getting higher values for this function, we have a way to find the vector of cluster activities m that optimize this function. Now given a corpus of documents, indexed by i, the global objective is the aggregation of the individual objective functions. When the global is maximized, we arrive at the set of weights c reflecting clusters of words that co-occur in documents.
Training begins by initializing a single cluster centroid at a random point. For this initial cluster centroid and for later stages when there are more centroids, the maximization occurs in two steps. First the cluster centroids are fixed and the local maximum is found over all data points. In the second step, the cluster activation values are fixed at the values found in the previous step and the gradient ascent is performed until the over the cluster centers. Both the steps are repeated until the function cannot be maximized further.

Monday, November 18, 2013


I want to discuss the P versus Q divergence variation based on the sample size for P and Q even though the event space is the same for both. The sample size for P is negligible compared to Q so after a certain size Q yields roughly the same KLD for any larger size. This means that we just have to choose a collection which has representations involving the terms we would be interested in.
Today I want to discuss a slightly different take on text categorization. This was presented in the paper "Applying the multiple cause mixture model to text categorization" by Sahami, Hearst and Saund. Here they talk about a novel approach using unsupervised learning that does not require a pre-labeled training data set. In other words, it discovers from the samples itself. I'm looking to see if we can consider such an approach for keyword extraction where we don't have to come up with a pre-determined vocabulary set. Supervised algorithms require a training set of pre-labeled documents and there has been studies to intelligently reduce the size of this training set. However, this approach talks about automatically inducing multiple category structure underlying an unlabeled document corpus. In addition, many algorithms assign one label per document, or else treat the classification task as a sequence of binary decision tree while this approach treats documents as member of multiple categories. We will see why this is important in just a bit.
Documents are represented by term vectors that have binary values. That is the number of occurrences of the word does not matter after a certain point. A topic is associated with a substantial number of words i.e. a cluster of words will be indicative of topics. If there are several topics in the document, the overall topic will be the union of all the cluster of words. If a word has dual or more meanings and appears in different clusters, that word would likely appear more than the individual topics. Taken as layers between clusters and nodes, any activity in the cluster layer will cause activity in the nodes such that the latter reflects how often the word appears in the document. The idea is to tell apart the clusters till we have probabilites for these words.
MCMM tries to discover these hidden clusters by looking for patterns in high-dimensional data. It differs from other models by permitting clusters not only to compete for data points but also to co-operate for accounting of observed data.
There can be many different functions that map the activities m in the cluster layer to the weights c for the documents.One example is the soft disjunction which is represented by the equation rj = 1 - PI(1-mk.cjk) wher rj an cjk are both between 0 and 1.