Sunday, November 24, 2013

I want to talk about using the K-means clustering algorithm to categorize documents based on Kullback-leibler distance. We have discussed finding the Kullback leibler distance between any document and another based on the probabilities of the terms from the vocabulary. Next we normalize the distance and use this KLD distance for clustering.
Today we talk about how to cluster these documents. Let us take an example with a K-means clustering. We take the sample set of documents from three categories say news, reviews and editorials. We initially set the centroids to be representatives of their clusters. This we can pick from any of the documents that represent their categories and they should ideally be as far apart from each other in terms of their cluster memberships as possible. Then for each of the documents, we find the KLD distance between the document and the centroids and assign it to the nearest cluster. After a round of assigning, we fix the centroids. We do this by finding the average of all normalized distances within the cluster and picking the new centroid as the one closest to the average. Once the centroids are updated, we assign the labels again and re-adjust the cluster centers. We can repeat this process until cluster centers stabilize.
Clustering is independent of how the distance is calculated so we can be as articulate about the distance as we want. In this case the strategy for distance was chosen as KLD but this could be cosine, Jensen or other distances as well. This distance computations can also be materialized if we know the vocabulary or features beforehand for say things like keyword extraction. That is an optimization for known samples. In general, we have distance algorithms that we use for each clustering.
Similarly, we can even vary the clustering algorithms from the choices we listed earlier. These are several mentioned in the earlier posts. We can visualize the different algorithms for clustering as K-means and agglomerative for the same distances. In the implementation letting the clustering algorithm vary independently from the distance computation lets us be more flexible and compare the algorithms.
When we take the sample data for three categories, we will need to extract the vocabulary to use as features for the documents. This is possible with the KLD as well.  The documents can be selected to be representatives of their categories.
Initially to test the clusterer, we could use select the sample data to be a small set and pre-label these. Then we can look at the precision and recall. However, the sample sets we choose for unsupervised runs must at least be five to ten times the number of features.

Saturday, November 23, 2013

The main advantage of the structural approach is that in addition to classification, it provides a description of the pattern. It specifies how the given pattern can be constructed from primitives. This paradigm has been used in the situations where the patterns have a definite structure which can be captured in terms of a set of rules Here there is a correlation between the structure of the pattern and the syntax of a language.  The patterns belonging to a category are viewed as sentences belonging to a language. A large set of complex patterns can be described by using a small number of primitives and grammatical rules. The grammar for each pattern class must be inferred from training samples.
Syntactic patterns have a great deal of intuitive appeal. However, implementation of this approach leads to many difficulties, such as segmenting noisy patterns to detect the primitives, and inferring grammars.

We talked about distributions. Lets  look into this a little more closely. We mentioned that Gaussian distribution leads to optimal rules. This is because the mathematical properties of this distribution is well understood i.e they are justified by a central limit theorem. Second, many applications find it easier to summarize data in the form of a mean vector and a covariance matrix. If the density
function is unimodal, then this is simple and easy to apply. If X is a vector with d features, the density function f(x1, x2, ...., xd) is described in terms of the d * d covariance matrix, its determinant, mu- the d-vector of expected values and x which is a d vector containing variables x1, x2, ... xd. The mean vector mu and the covariance matrix define a Gaussian distribution. The parameters for a multivariate Gaussian density are estimated by sample co-variances and sample means to fit a Gaussian distribution to a set of  patterns The mean is found with some objective or expectation function i.e mu-i  = E(Xi). The population covariance between Xi and Xj is sigma-ij = E[(Xi-mui)(xj - muj)] and this works for diagonal elements. This can also be written as sigma-ij = ro-ij.sigma-i.sigma-j where
ro-ij is the correlation co-efficient between sigma-i and sigma-j. If ro-ij turns out to be zero, then Xi and Xj are both linearly and statistically independent. Features are pictured as random variables and each pattern is a realization of these random variables.
 
In today's post we will continue our discussion on the algorithms and the issues encountered: We list some of them here.
One shot versus hierarchical tree classifier:. In the former, the distinction between all the classes are made in one-stage and is especially helpful when the number of features is large. In the latter, the classification structure is a binary tree where the most obvious discriminations are done first and the subtle ones done later. When the number of features per node is smaller than the total number of features, this is must faster.
 Parametric versus non-parametric classification: Both rely on knowing the forms of the class conditional densities so that the decision rules can be written. If the class conditional densities are Gaussian, parametric techniques can be applied. So the decision rules can be optimal or plug-ins. However, non-parametric rules that are not based on pattern class distributions could also be very helpful.
Dimensionality and Sample size relationship When the number of a features for a vector is very large, the classifier designer has to choose a smaller feature set. This is often referred to as the curse of dimensionality and results in some loss of fidelity. It is overcome by choosing number of training samples per class to be at least five to ten times the number of features.
Feature Selection : Even among the set of possible features, the ones that are selected depend upon computational ease and cost considerations. Moreover, determining the optimal subset of features of size m from the d available features requires an exhaustive search over all possible subsets. Some intelligent ones to alleviate these have been proposed and I've mentioned a few in my previous posts.
Error Estimation : In order for us to know if the samples are correctly classified, we measure the errors with classification. We do this by splitting the available samples into training sets and test sets. The classifier is designed using the training set and then evaluated on the samples from the test sets. Precision and recall are two metrics with which the classifier performance is evaluated. These are also referred to as the hold-out method and leave-one-out method. The split between the training sets and test sets has been investigated.
So far we have discussed the geometrical or statistical pattern recognition algorithms.
The second paradigm of pattern recognition algorithms are based on structural or syntactic approach. When the number of features required to establish a decision boundary is very large, its helpful to view such patterns as being composed of simple sub-patterns. A sub pattern could itself be built from simpler parts with grammatical techniques.

Friday, November 22, 2013

The book I mentioned in my previous post also has a section on pattern recognition. Pattern recognition refers to the classification or description of objects or patterns. The patterns themselves can range from characters in an image of printed text to biological waveforms. The recognition involves indentifying the patterns and assigning labels for categories. We start with a set of training patterns. The main difference between pattern recognition and cluster analysis is the role of pattern class labels. In pattern recognition, we use the labels to formulate decision rules. In cluster analysis we use it to verify the results. Pattern recognition requires extrinsic information. In cluster analysis, we use only the data.
There are two basic paradigms to classify a pattern into one of K different classes. The first is a geometric or statistical approach.  In this approach, a pattern is represented in terms of d features and the pattern features are as independent from one another as possible. Then given training patterns for each pattern class, the objective is to separate the patterns belonging to different classes.
In statistical pattern recognition, the features are assumed to have a probability density function that is conditioned on the pattern class. A pattern vector x belonging to a class wj is a data point drawn from the conditional probability distribution P(x/wj) where j is one of the K different classes. Concepts from statistical decision theory and discriminant analysis are utilized to establish decision boundaries between the pattern classes.If the class conditional densities are known, then Bayes decision theory gives optimal decision rule. Since they are generally not known, a classifier is used based on the nature of the information available. A classifier requires a supervised or unsupervised learning In the supervised learning, if the form of class conditional densities are known, we use a parametric or non-parametric decision rules. In unsupervised learning, the density functions are estimated from training samples.  Here the labels one each training pattern represents the category to which the pattern belongs. The categories may be known beforehand or they may be unknown
When the number of pattern classes is unknown, it tries to find natural groupings in the data. In the tree representing these dichotomies in statistical pattern recognition algorithms, the problems get more difficult as we traverse from top to bottom and from left to right in the picture given below.
                                Prior Information
                                  -----------------
                                 |                       |
                          Complete    Incomplete
 (Bayes Decision Theory)               |
                                                        |
                                       --------------------------------
                                       |                                         |
                                 Supervised                Unsupervised
                                       |                                         |
                  ------------------------                       ---------------
                  |                               |                       |                    |
 Parameteric     Non-Parametric        Categories known   Categories Unknown
            |                        |                                    |                    |
----------------              --------------                    |                    |
|                   |                |               |                     |                    |
Optimal      Plugin   Density    Geometric   Mixture  Cluster analysis
rules        rules     estimation  Rules     resolving

Thursday, November 21, 2013

I came across a book Algorithms for clustering data by Jain and Dubes. I want to start posting from that book. But first I want to jump to a section towards the end of the book which talks about Graph theory. As we may know, a graph has a finite set of nodes and they can represent objects being clustered.  A set of edges records the interactions between pairs of vertices. The vertices and edges are related by a function f that maps edges into unordered pairs of vertices. we discuss finite undirected graphs when we say unordered pairs of vertices. Therefore a graph is a tuple V, E, f.
Some applications require that labels or weights be assigned to the vertices or to the edges.
As an example, the edges could be labeled by the distance between patterns representing the objects. There are some common definitions as well. A subgraph is a subset of nodes and vertices of the original graph where there are no edges or vertices that are not in the original graph. A path in graph is an alternating sequence of vertices and edges that contains no repeated edges and no repeated vertices and for which edge ei is incident to vertices vi and vi+1 for i = 1 to n.
 A graph is connected if a path exists between any two vertices in the graph.
 A component is a maximal piece of a connected graph. Maximal means as many nodes as possible. So a component is not a proper subgraph. of another graph because an edge may be drawn where it didn't exist in the original graph.  A graph is complete if an edge is assigned to every possible pair of nodes. A complete graph on n nodes contains exactly n(n-1)/2 edges.
 There are some interesting observations as well. A maximal complete subgraph of G is a complete subgraph of G that is not a proper subgraph of any other complete subgraph of G. In other words, H is a complete subgraph of G if an edge is incident to every pair of vertices in H but no additional vertices of G can be added to H without destroying the completeness.  A maximal complete subgraph is also called a clique.
A cycle is the same as a path except that vertices v1 and vn are the same vertex.  A tree is a connected graph with no cycles. If a subgraph has m vertices, a tree containing these vertices has m-1 edges. A spanning tree is a tree containing all vertices in the graph. The weight of a tree is the sum of the weights of the edges when the weights represent dissimilarity or ratio scale. A minimal spanning tree is the one with the least weight  among all other spanning trees.
A number of algorithms are available for finding the minimum spanning tree. MSTs of complete graph are important in cluster analysis.
Prim's algorithm is easy to try out by hand. The edges are ranked by their weights and insert the edges by order of dissimilarity. The shortest is included first until a tree is formed. We prevent any cycles when building an MST.
A threshold graph is one which has n labeled nodes and N unlabeled edges. A random graph is one which has edges inserted at random. A rank graph is a threshold graph in which the edges are labeled. N nodes can be sampled from n(n-1)/2 edges without replacement or ordering and combined with N! to get the number of rank graphs.
In the unsupervised category, MCMM was compared to K-means clustering on the same DS2 dataset. Three of the four clusters align very well with the labels given by Reuters, achieving very high precision score, especially considering label information was unavailable during training. The algorithm stops after it has added an additional cluster center that does not increase the gain function. That is why the first three clusters are clearly delineated than the last one. The fourth cluster found by the MCMM is a general subtopic within the dataset almost like a miscellaneous group.
When compared with the earlier supervised run, where we found clusters that did not match the reuters labels, a closer inspection of the same showed that the clusters were well formed and that some documents were originally mislabeled.
Thus MCMM appears to be a practical machine learning method for text categorization. The ability to run essentially the same algorithm in both supervised and unsupervised mode helps evaluate existing labels. And there is flexibility to include the  documents in more than one category along with a ranked list of the degree of the inclusion in the cluster.

Wednesday, November 20, 2013

In the experimental results for MCMM, we find cluster coherency. The datasets used by MCMM involved the Reuters document collection.  Two small subsets of
documents were used. The first consisted of 983 documents with 9 labels involving 372 dimensions and the second consisted of 240 article with 3 labels and
360 dimensions or words.  This dataset was chosen so that MCMM could first be run on pre-labeled data before running in an unsupervised manner. There was a
70-30% split in each data set for training and test purposes. For comparision, a naive Bayesian classifier was also run. With the MCMM, the cluster
activities produce a ranking for category assignment instead of a binary decision which allows us to compare the tradeoff between precision and recall. If
we increase the threshold for activity, there would be more precision but less recall. It was found that the precision and recall were great for all except
three labels which seemed more of an exception than the norm. The documents from the first set in the reuter collections from the mentioned three labels had
multiple labels assigned to them. However, it was also found that the Naive Bayes Classifier performed better. And the performance was compared against a
much larger reuters collection.
MCMM was expected to have better performance in the unsupervised mode.