Sunday, November 24, 2013

Johnson's method of clustering scheme was first written during the summer of 1965 when he was working at Bell Labs during a break from his PhD. It introduced cluster hierarchies and came with a program written in FORTRAN. Agglomerative hierarchical clustering places each object into its own cluster and gradually merges these atomic clusters into larger and larger clusters until there is only one. Divisive hierarchical clustering reverses this process by starting with all objects in one cluster and then subdividing into small clusters. Hierarchical classification is  a special sequence of partition-al classification.
Serial clustering handles the classification one by one while whereas simultaneous clustering works with the entire set of patterns at the same time.  A clustering algorithm can be expressed mathematically either by graph theory or by matrix algebra.  The former is used to describe connectedness and completeness and the latter is used to describe say mean-square-error.
For the purposes of clustering vectors based on KLD distances, we will use a simple partition scheme like the K-means we talked about. This is because we already know the sources categories when we pick the sample document. Besides, when we do keyword extraction, we will build a vocabulary from these documents. We may come across a large number of lemmas and we planned on adding the terms to an empty document that keeps the KLD distance above a threshold i.e D(avg-pt/q) > 1.
In addition, we mentioned varying the threshold on a sliding scale to retrieve more and more of the terms as needed. This we can set based on the samples at the hand that gives reasonable results. There is no pegging of the threshold that may be universally true however larger and larger values of the threshold are certain to give discriminative keywords. This can be built with a visual tool that renders the keywords in a slightly bigger font than the rest. This way the keywords will attract attention in place with the rest of the text. Together with a tool to render the sliding scale, we can increase the number of terms that have this bigger font and make it easier for the reader to find the keywords and see the changes in the number of keywords. Building this visual tool should be fairly straightforward in that it renders the same text in different forms. The bigger font or magnifying glass display for keywords can be enabled by setting the font and size markup to be different from the others.
I want to talk about using the K-means clustering algorithm to categorize documents based on Kullback-leibler distance. We have discussed finding the Kullback leibler distance between any document and another based on the probabilities of the terms from the vocabulary. Next we normalize the distance and use this KLD distance for clustering.
Today we talk about how to cluster these documents. Let us take an example with a K-means clustering. We take the sample set of documents from three categories say news, reviews and editorials. We initially set the centroids to be representatives of their clusters. This we can pick from any of the documents that represent their categories and they should ideally be as far apart from each other in terms of their cluster memberships as possible. Then for each of the documents, we find the KLD distance between the document and the centroids and assign it to the nearest cluster. After a round of assigning, we fix the centroids. We do this by finding the average of all normalized distances within the cluster and picking the new centroid as the one closest to the average. Once the centroids are updated, we assign the labels again and re-adjust the cluster centers. We can repeat this process until cluster centers stabilize.
Clustering is independent of how the distance is calculated so we can be as articulate about the distance as we want. In this case the strategy for distance was chosen as KLD but this could be cosine, Jensen or other distances as well. This distance computations can also be materialized if we know the vocabulary or features beforehand for say things like keyword extraction. That is an optimization for known samples. In general, we have distance algorithms that we use for each clustering.
Similarly, we can even vary the clustering algorithms from the choices we listed earlier. These are several mentioned in the earlier posts. We can visualize the different algorithms for clustering as K-means and agglomerative for the same distances. In the implementation letting the clustering algorithm vary independently from the distance computation lets us be more flexible and compare the algorithms.
When we take the sample data for three categories, we will need to extract the vocabulary to use as features for the documents. This is possible with the KLD as well.  The documents can be selected to be representatives of their categories.
Initially to test the clusterer, we could use select the sample data to be a small set and pre-label these. Then we can look at the precision and recall. However, the sample sets we choose for unsupervised runs must at least be five to ten times the number of features.

Saturday, November 23, 2013

The main advantage of the structural approach is that in addition to classification, it provides a description of the pattern. It specifies how the given pattern can be constructed from primitives. This paradigm has been used in the situations where the patterns have a definite structure which can be captured in terms of a set of rules Here there is a correlation between the structure of the pattern and the syntax of a language.  The patterns belonging to a category are viewed as sentences belonging to a language. A large set of complex patterns can be described by using a small number of primitives and grammatical rules. The grammar for each pattern class must be inferred from training samples.
Syntactic patterns have a great deal of intuitive appeal. However, implementation of this approach leads to many difficulties, such as segmenting noisy patterns to detect the primitives, and inferring grammars.

We talked about distributions. Lets  look into this a little more closely. We mentioned that Gaussian distribution leads to optimal rules. This is because the mathematical properties of this distribution is well understood i.e they are justified by a central limit theorem. Second, many applications find it easier to summarize data in the form of a mean vector and a covariance matrix. If the density
function is unimodal, then this is simple and easy to apply. If X is a vector with d features, the density function f(x1, x2, ...., xd) is described in terms of the d * d covariance matrix, its determinant, mu- the d-vector of expected values and x which is a d vector containing variables x1, x2, ... xd. The mean vector mu and the covariance matrix define a Gaussian distribution. The parameters for a multivariate Gaussian density are estimated by sample co-variances and sample means to fit a Gaussian distribution to a set of  patterns The mean is found with some objective or expectation function i.e mu-i  = E(Xi). The population covariance between Xi and Xj is sigma-ij = E[(Xi-mui)(xj - muj)] and this works for diagonal elements. This can also be written as sigma-ij = ro-ij.sigma-i.sigma-j where
ro-ij is the correlation co-efficient between sigma-i and sigma-j. If ro-ij turns out to be zero, then Xi and Xj are both linearly and statistically independent. Features are pictured as random variables and each pattern is a realization of these random variables.
 
In today's post we will continue our discussion on the algorithms and the issues encountered: We list some of them here.
One shot versus hierarchical tree classifier:. In the former, the distinction between all the classes are made in one-stage and is especially helpful when the number of features is large. In the latter, the classification structure is a binary tree where the most obvious discriminations are done first and the subtle ones done later. When the number of features per node is smaller than the total number of features, this is must faster.
 Parametric versus non-parametric classification: Both rely on knowing the forms of the class conditional densities so that the decision rules can be written. If the class conditional densities are Gaussian, parametric techniques can be applied. So the decision rules can be optimal or plug-ins. However, non-parametric rules that are not based on pattern class distributions could also be very helpful.
Dimensionality and Sample size relationship When the number of a features for a vector is very large, the classifier designer has to choose a smaller feature set. This is often referred to as the curse of dimensionality and results in some loss of fidelity. It is overcome by choosing number of training samples per class to be at least five to ten times the number of features.
Feature Selection : Even among the set of possible features, the ones that are selected depend upon computational ease and cost considerations. Moreover, determining the optimal subset of features of size m from the d available features requires an exhaustive search over all possible subsets. Some intelligent ones to alleviate these have been proposed and I've mentioned a few in my previous posts.
Error Estimation : In order for us to know if the samples are correctly classified, we measure the errors with classification. We do this by splitting the available samples into training sets and test sets. The classifier is designed using the training set and then evaluated on the samples from the test sets. Precision and recall are two metrics with which the classifier performance is evaluated. These are also referred to as the hold-out method and leave-one-out method. The split between the training sets and test sets has been investigated.
So far we have discussed the geometrical or statistical pattern recognition algorithms.
The second paradigm of pattern recognition algorithms are based on structural or syntactic approach. When the number of features required to establish a decision boundary is very large, its helpful to view such patterns as being composed of simple sub-patterns. A sub pattern could itself be built from simpler parts with grammatical techniques.

Friday, November 22, 2013

The book I mentioned in my previous post also has a section on pattern recognition. Pattern recognition refers to the classification or description of objects or patterns. The patterns themselves can range from characters in an image of printed text to biological waveforms. The recognition involves indentifying the patterns and assigning labels for categories. We start with a set of training patterns. The main difference between pattern recognition and cluster analysis is the role of pattern class labels. In pattern recognition, we use the labels to formulate decision rules. In cluster analysis we use it to verify the results. Pattern recognition requires extrinsic information. In cluster analysis, we use only the data.
There are two basic paradigms to classify a pattern into one of K different classes. The first is a geometric or statistical approach.  In this approach, a pattern is represented in terms of d features and the pattern features are as independent from one another as possible. Then given training patterns for each pattern class, the objective is to separate the patterns belonging to different classes.
In statistical pattern recognition, the features are assumed to have a probability density function that is conditioned on the pattern class. A pattern vector x belonging to a class wj is a data point drawn from the conditional probability distribution P(x/wj) where j is one of the K different classes. Concepts from statistical decision theory and discriminant analysis are utilized to establish decision boundaries between the pattern classes.If the class conditional densities are known, then Bayes decision theory gives optimal decision rule. Since they are generally not known, a classifier is used based on the nature of the information available. A classifier requires a supervised or unsupervised learning In the supervised learning, if the form of class conditional densities are known, we use a parametric or non-parametric decision rules. In unsupervised learning, the density functions are estimated from training samples.  Here the labels one each training pattern represents the category to which the pattern belongs. The categories may be known beforehand or they may be unknown
When the number of pattern classes is unknown, it tries to find natural groupings in the data. In the tree representing these dichotomies in statistical pattern recognition algorithms, the problems get more difficult as we traverse from top to bottom and from left to right in the picture given below.
                                Prior Information
                                  -----------------
                                 |                       |
                          Complete    Incomplete
 (Bayes Decision Theory)               |
                                                        |
                                       --------------------------------
                                       |                                         |
                                 Supervised                Unsupervised
                                       |                                         |
                  ------------------------                       ---------------
                  |                               |                       |                    |
 Parameteric     Non-Parametric        Categories known   Categories Unknown
            |                        |                                    |                    |
----------------              --------------                    |                    |
|                   |                |               |                     |                    |
Optimal      Plugin   Density    Geometric   Mixture  Cluster analysis
rules        rules     estimation  Rules     resolving

Thursday, November 21, 2013

I came across a book Algorithms for clustering data by Jain and Dubes. I want to start posting from that book. But first I want to jump to a section towards the end of the book which talks about Graph theory. As we may know, a graph has a finite set of nodes and they can represent objects being clustered.  A set of edges records the interactions between pairs of vertices. The vertices and edges are related by a function f that maps edges into unordered pairs of vertices. we discuss finite undirected graphs when we say unordered pairs of vertices. Therefore a graph is a tuple V, E, f.
Some applications require that labels or weights be assigned to the vertices or to the edges.
As an example, the edges could be labeled by the distance between patterns representing the objects. There are some common definitions as well. A subgraph is a subset of nodes and vertices of the original graph where there are no edges or vertices that are not in the original graph. A path in graph is an alternating sequence of vertices and edges that contains no repeated edges and no repeated vertices and for which edge ei is incident to vertices vi and vi+1 for i = 1 to n.
 A graph is connected if a path exists between any two vertices in the graph.
 A component is a maximal piece of a connected graph. Maximal means as many nodes as possible. So a component is not a proper subgraph. of another graph because an edge may be drawn where it didn't exist in the original graph.  A graph is complete if an edge is assigned to every possible pair of nodes. A complete graph on n nodes contains exactly n(n-1)/2 edges.
 There are some interesting observations as well. A maximal complete subgraph of G is a complete subgraph of G that is not a proper subgraph of any other complete subgraph of G. In other words, H is a complete subgraph of G if an edge is incident to every pair of vertices in H but no additional vertices of G can be added to H without destroying the completeness.  A maximal complete subgraph is also called a clique.
A cycle is the same as a path except that vertices v1 and vn are the same vertex.  A tree is a connected graph with no cycles. If a subgraph has m vertices, a tree containing these vertices has m-1 edges. A spanning tree is a tree containing all vertices in the graph. The weight of a tree is the sum of the weights of the edges when the weights represent dissimilarity or ratio scale. A minimal spanning tree is the one with the least weight  among all other spanning trees.
A number of algorithms are available for finding the minimum spanning tree. MSTs of complete graph are important in cluster analysis.
Prim's algorithm is easy to try out by hand. The edges are ranked by their weights and insert the edges by order of dissimilarity. The shortest is included first until a tree is formed. We prevent any cycles when building an MST.
A threshold graph is one which has n labeled nodes and N unlabeled edges. A random graph is one which has edges inserted at random. A rank graph is a threshold graph in which the edges are labeled. N nodes can be sampled from n(n-1)/2 edges without replacement or ordering and combined with N! to get the number of rank graphs.
In the unsupervised category, MCMM was compared to K-means clustering on the same DS2 dataset. Three of the four clusters align very well with the labels given by Reuters, achieving very high precision score, especially considering label information was unavailable during training. The algorithm stops after it has added an additional cluster center that does not increase the gain function. That is why the first three clusters are clearly delineated than the last one. The fourth cluster found by the MCMM is a general subtopic within the dataset almost like a miscellaneous group.
When compared with the earlier supervised run, where we found clusters that did not match the reuters labels, a closer inspection of the same showed that the clusters were well formed and that some documents were originally mislabeled.
Thus MCMM appears to be a practical machine learning method for text categorization. The ability to run essentially the same algorithm in both supervised and unsupervised mode helps evaluate existing labels. And there is flexibility to include the  documents in more than one category along with a ranked list of the degree of the inclusion in the cluster.