Saturday, November 30, 2013

We can talk about some cluster evaluation or validation techniques. Each clustering algorithm can define its own type of validation. For example we discussed that K-means can be evaluated based on SSE. Almost any clustering algorithm will find clusters in a dataset. Even if we take uniformly distributed data and apply the algorithms we discussed such as DBSCAN, K-means, etc. they will find three clusters. This prompts us to do cluster validation.  We involve the following aspects:
1) Detecting whether there is non-random structure in the data
2) Detecting the clustering tendency of a set of data.
3) Determining the correct number of clusters
4) Comparing the results of cluster analysis with an external labeling
5) Comparing two sets of clusters to determine which is better.
1 to 3 does not rely on external data. 3 to 5 can be applied to entire clustering or just individual clusters.
If the validation were to be done with a measure, it may not apply to all the clusters. Such a measure may only be applicable to two or three dimensional data. Further if we obtain a value for the measure, we may still have to evaluate if its a good match. The goodness of a match can be measured by looking at the statistical distribution of the value to see how likely such a value is to occur.
The measures may be classified into one of following three categories:
1) Unsupervised where there is no external information. eg. SSE. Hence they are also called internal indices.  There are two subcategories here: one that measures cluster cohesion and another that measures cluster separation.
2) Supervised. measures the extent to which the clustering structure matches external classification. Hence they are also called external indices. An example of this type of measure is entropy which measures how well cluster labels match external labels.
3) Relative. Compares different clustering or clusters and they are not a category per se but the comparison of say two clusters.  A relative cluster evaluation measure can work with both supervised or unsupervised method.
These and some previous posts have been reviewed from Clustering : Basic concepts and algorithms book

bisecting k-means

A bisecting k-means algorithm overcomes some of the limitations with the choice of centroids and finding the minimum SSE. It is based on the idea that to obtain k clusters, split the set of all points into two clusters, select one of these clusters to split, and so on until K clusters have been produced.
The steps for the bisecting K-means algorithm include the following:
1. Initialize the list of clusters with one that contains all the points
2. repeat
3. Remove a cluster from the list of clusters
4. Perform several trial bisections of the chosen cluster
5. for i - 1 to number of trials do
6.    Bisect the selected cluster using basic k-means
7 end for
8. Select the two clusters from the bisection with the lowest total SSE
9. Add these two clusters to the list of clusters
10. until the list of clusters contains K clusters.
There are a number of ways to choose which cluster to split. We can choose the largest cluster at each split or the one with the largest SSE or both.
Further, we can refine the resulting clusters by using their centroids as the initial centroids for the basic K-means algorithm.

Friday, November 29, 2013

A discussion on SSE as brought up in the previous post. SSE is the sum of squared error and it describes the scatter of a cluster. When the SSE is low, the clusters are well formed and cohesive. Thus SSE gives a measure of the quality of clustering.  Also, when the measure is low, the clustering is better represented by the centroids. As the name suggests, SSE is computed by aggregating the squares of the distance for each point from its centroid. This distance is the Euclidean distance between two points. It can be shown that the centroid that minimizes the SSE of the cluster is the mean. This is valuable because it describes the optimum.  In K-means for example, the points are assigned to their nearest centroid, thus minimizing the SSE for the given set of centroids. And the centroids are also recomputed so as to minimize the SSE. Notice, however that these steps of the K-means only find the local minimum with respect to the SSE. As we will see shortly, this leads to sub-optimal clustering.  Also, random initialization of the centroids results in different SSEs in each run. This causes variations in the end result. Typically the run with the lowest SSE is chosen as the best result. Consider the following two examples: 1) Poor initial centroids - Even when the centroids are better distributed, we can obtain a sub-optimal clustering with higher squared error,  and 2) Limits of random initialization - one way to overcome the first factor could be to perform multiple runs with varying initial points and then choosing the results with the least SSE. However, this also could make only local minima.
Since we talked about DBSCAN clustering method in the previous post, let's quickly revisit the agglomerative hierarchical clustering as well. They are an important category of clustering algorithms. Hierarchical clustering can be agglomerative or divisive. Both are often displayed visually using a tree-like diagram called a dendrogram which displays both the cluster and the sub-cluster relationships and the order in which the clusters were merged or split. For sets of two dimensional points,  a hierarchical clustering can also be illustrated using a nested cluster diagram
The algorithm proceeds like this:
1. Compute the proximity matrix, if necessary.
2. repeat
3. Merge the closest two clusters
4. Update the proximity matrix to reflect the proximity between new and original clusters.
5. until only one cluster remains.
Step 3. can be done using single link, complete link and group average methods.
The single link is the minimum distance between clusters as measured between two points - one from each cluster that are nearest. The complete link is the maximum distance between clusters as measured between two points - one from each cluster that are farthest. And the group average takes the average of pairwise distances of all pairs of points from the two different clusters.
Proximity can also be measured in terms of the distances between the centroids of the cluster. Ward's method also uses centroid but it measures the proximity between two clusters but it measures the proximity between two clusters in terms of the increase in sum-of-squared-error (SSE) that results from the merging of two clusters.  SSE gives an indication of the cluster quality and Ward's method attempts to minimize the SSE of points from their cluster centroids.
Due to the use of proximity matrix, the space complexity is O(m^2). Since the algorithm updates the proximity matrix and these are updated in m-1 iterations, the time complexity is also O(m^3). With the use of some data-structures for efficiently organizing as heap, the complexity could be reduced to O(m^2.logm).
This kind of clustering does not globally optimize an objective function. Instead it uses criteria described above to determine the clusters to be merged (or split) at each step. The local minima is easily calculated but the time complexity and space complexity for the overall run is high.

Thursday, November 28, 2013

Today we talk about DBSCAN algorithm. It's a simple and effective density-based clustering algorithm that illustrates a number of important concepts. We first consider the notion of density. There are several methods to measure density that are distinct and different from measuring similarity. In the center based approach, density is estimated for a particular point by counting the number of points within a certain radius including that point. This method is simple but the density measured this way depends on radius. If the radius is large enough, then all points will have a density of m which is the number of points in a dataset. If the radius is too small, then all points will have a density of 1.
By this approach, points will either lie in the core of a density based cluster, or they will lie in the border between the neighborhoods of several core points or they may be outliers where they are neither part of a cluster nor in the neighborhood of two or more clusters. Thus points in this approach are classified as core points, border points and noise points.
The algorithm proceeds like this. Any two core points that are close enough i.e. within the radius of one another are put in the same cluster. Any border point that is close enough to a cluster is included within the same cluster as a core point. If there are ties between two or more clusters, they are resolved by choosing the one that's closest. Noise points are discarded.
So the steps can be enumerated as follows:
1. Label the points as core, border and noise points
2. Eliminate the noise points
3. Put an edge between all core points that are within the specified distance of one another
4. Make a cluster out of each group of connected core points that are within the specified distance of one another.
5. Assign each border point to one of the clusters of its associated core points.
For each point we find the points that are within the neighborhood of this point, so the complexity of this algorithm is O(m^2). However, there are data structures that can help reduce the complexity in retrieving the points adjacent to this center to O(mlogm). The space complexity is linear because we only persist a small amount of metadata for each of the points namely the cluster label and the classification of each point as the core, border or noise point.
The parameters for this algorithm are the radius and the minimum number of points in a cluster to distinguish the core from the border points.
This algorithm has trouble when the density of the clusters vary widely. It also has difficulty with the high-dimensional data but it's one of the simple ways of clustering.
In order for us to work with a plugin for an editor such as MSWord as mentioned in the previous post, we need to visit the object model exposed by that application. A word object model comprises of the following five top level objects:
Application object - represents the word application
Document object - represents the document and all of its content
Selection object - represents the area that is currently selected
Range object - represents a contiguous area in a document with a start and end character. It is active while the document is open and not visible.
Bookmark object - represents a contiguous area in a document with a start and end position and is persisted.
As you may see, they closely represent the hierarchy seen in the user interface.
At the top of this hierarchy is the application object. The application object contains document, selection, bookmark and range objects. Both the document object and the selection object can contain bookmark and Range objects.
Word also has content control objects. This let us control the input and the presentation of text and other types of content such as a date picker or a combo box. They help you do the following
prevent users from editing or deleting parts of document and
bind parts of a document or template to data.
One of the content controls is the Rich text. A rich text control contains text or other items such as tables, pictures or other controls.
A rich text content control has a DefaultTextStyle  and a PlaceholderText member. The former gets the name of the character style to format the text and the latter gets or sets the text that is displayed in the rich text content control.
A document also has XMLNodeControl to directly manipulate the markups and their hierarchy, This lets you define the xml content and schema that can be queried with XPath and displayed with content control objects. This is done by adding the custom xml part with the data from the xml and then binding the controls to custom xml part. XmlNodes can have a SmartTag object that represents the tag associated with the object.
The Range object can be used to format text in a Word document.
var start = this.Content.Start; // this.Sentences[0].Start
var end = this.Content.End;
Word.Range rng = this.Range(ref start, ref end); // this.Paragraphs[1].Range;
rng.Font.Size = 14;
rng.Font.Name = "Arial";
rng.ParagraphFormat.Alignment = Word.WdParagraphAlignment.wdAl
object indentStyle = "Normal Indent"
rng.set_Style(ref indentStyle);
rng.Select();
rng.Find.ClearFormatting();
rng.Find.Text = "find me";
rng.Replacement.ClearFormatting();
rng.Replacement.Text = "Found";
var replaceAll  = Word.wdReplace.wdReplaceAll;
rng.Find.Execute(ref findText, ref missing);

Wednesday, November 27, 2013

In the post about Keyword extraction by Kullback-Leibler technique, we talked about having a sliding scale of threshold for selecting keywords. We also talked about a viewer for users to see the keywords as highlighted in place using fonts bigger than the others in the text. As the threshold on the sliding scale is increased, we talked about more keywords appearing on the viewer.
 In today's post, I'm not only going to cover how to render the text but also cover how to make it independent of the actual text.
We will use offsets or word counts to find relative positions of keywords from start.
With the relative positions, we are able to find the keywords dynamically. We use pagination for blocks of texts where each page can having varying word counts as can fit the page view determined independently. We represent the page with an abstraction that lets us keep track of blocks of text which we will refer to as pages. We represent what the user sees with a page view that can change to accommodate text spanning one or more of the pages. Whenever that happens, re-pagination is involved and the page views is updated with the corresponding page. Thus there is a one to one relationship between the page and the page view. The page for each page view is expected to change as the number of keywords in the text is increased and the space they occupy grows. By decoupling the view from the page, we let both of them vary independently. The views are also dependent on the available space on the user screen and the overall font size and the stretching or skewing that users do via the window in which it displays. The views could also change by the changes to the font size of the text. On the other hand, the page could change whenever text is added or removed from the overall document. The viewer could be built with tools to enable flipping through the pages and to goto a specific in the document via bookmarks. Decoupling the viewer altogether to programs that are dedicated editors is possible by just marking the keywords for the text with special markups that those editors recognize. It's also possible that those very editors may support custom rendering in case they don't already support the special treatment for keywords.
Whether a viewer is written as part of the program that selects the keywords, or written as a plugin for existing editors, the viewer is concerned with only the user attention to keywords. The program for selecting keywords need not happen for each document rendering but done either before hand or as part of document life cycle. That is the scope in which the keyword extraction works is determined by what's convenient to the tool. As an example, the text of each document could be sent to a centralized database where the documents are processed and keywords are selected. The viewers that reside on the devices accessed by the user could choose to download this processed document whenever necessary. The uploading and downloading of the text may need to be done only when the contents change otherwise the processed document can just be cached. This eliminates the cost in terms of time to serve the document to the user.
The page and the page views are artifacts for the viewer and have no place in the processed text that we keep in such a proposed centralized repository. The repository and processing only sees unpaginated structured text.
There are some interesting properties of Pearsons' hashing function we brought up in the previous post. We will discuss some variations now:
1) When compared with the hashing that updates the hash in each step as h[I] = h[I-1] + c[I] mod table_size, the function does not separate anagrams and demonstrates a significant deviation from uniformity.
2) That said, the non-uniform distribution of the additive hashing function above does not necessarily confer a large performance penalty.
3) If the range of the input characters can be limited, a smaller auxiliary table may be used. i.e alphanumeric input
4) In some cases, the hash values may be required to be greater than the eight-bit representations To increase it to sixteen bit values, we could do the following:
apply h to the string, called the result H1, 
add 1 modulo 256 to the first character of the string,
apply h to the modified string to get H2,
concatenate H1 and H2 to get a  16-bit index.
Note that when this hashing is applied to the same spell-checker words as the original hash function there was marginal improvement collisions indicating that both perform just about the same.
5) Collision handling can be improved by using a succession of hash indices. The original hashing function is well suited for this. By repeatedly incrementing the first character of the input string, modulo 256, one causes the hash index returned by h to pass through all 256 possible hash index values since strings differing by one character cannot produce the same hashing value.
6) This hashing algorithm lends itself to perfect and minimal hashing. Recall that perfect hashing is one in which there are no collisions and minimal hashing is one in which the integers onto which the particular list of words mapped form a contiguous set. Minimal perfect hashing is useful when a predetermined set of high frequency words are expected as input and the resulting hash has to be used as an index into an array relating those words. We can use this hashing algorithm to take a set of 31 common English words and map them onto unique integers between 1 and 31, in alphabetical order by using a special case of the table T.
7) The advantages of Pearson's hashing function include the following:
there is no restriction on the length of the string
the length does not need to be computed beforehand
there is very little computation for each character of the input being hashed
similar string are not likely to collide
 The disadvantages are that the output value ranges for the hash are in powers of 2 and to get others is more complicated  and more auxiliary memory is required due to the table used.

Tuesday, November 26, 2013

Pearsons function for fast hashing of variable length text strings is an interesting hashing technique.
It proposes a hashing function that takes a word W as an input consisting of n characters C1, C2, ... Cn, each character being represented by one byte. The function returns an integer in the range 0 to 255. A table T of 256 randomish bytes used in the process for an indexed memory read and to perform an exclusive or operation against the input. The length of the string doesn't matter because the null terminator at the end of the string is used instead.
The steps for the hashing are described as follows:
h[0] = 0;
for( int i = 1; i <= N; i++)
  h[i] = T[ h[i-1] xor C[i] ];
return h[n];
There are two desirable properties of this algorithm which derive from cryptographic checksums and message authentication codes. First, it ensures that small changes to the data result in large and nearly random changes to the checksum. Second, the effect of changing one part of data must not be canceled by an easily computed change to some other part and this is preserved by this algorithm.
The downside is that it is length reducing the input and based on exclusive-OR of substrings. There is no undesirable effect of using a particular T or another as long as it is random. On the other hand, a manipulation of T to intentionally disperse the result for short similar strings turned out to be a poor choice.
h can be evaluated on the following:
h[n] is random
if two input strings differ by a single bit, they will never collide
h was applied to a spelling checker dictionary with 26,662 words and the resulting output values were not very uneven.
We may recall the textbook example for hashing:
template<class T> size_t Hash<T>::operator()(const T& key) const
{
   size_t res = 0;
   size_t len = sizeof(T);
   const char* p  = reinterpret_cast<const char*> (&key);
   while (len--) res = (res << 1 )  ^ *p++;
   return res;
}
Let us take a closer look at the insertion and deletion of nodes in B-Trees. To ensure that there are very few disk access to locate a key, we have to maintain a balanced tree during insertion and deletion. During insertion, we check if the child node is able to hold an additional node. If not, then a new sibling node is added to the tree and the child's key are redistributed to make room for the new node. During redistribution, the middle key of the overfilled childs node could be sent to the parent and split at that key. If we descend the tree during insertion and the root is full, then some of the keys are given to the new child. This change in the level of the tree maintains a balanced tree. During deletion we check to see if the root can absorb the child nodes i.e the key to be removed is replaced with the largest key of the left subtree or the smallest key in the right subtree. This replacement is guaranteed to come from a leaf. If a node has very few keys, say half full, on deletion, it looks for keys from sibling. If such a sibling exists, the node takes a key from the parent and the parent takes the extra keys from the sibling. If the sibling also has fewer keys, then the two nodes are joined to form one full node.
B* trees are similar. The only difference is that the nodes are key 2/3 full. This results in better utilization of space in the tree and slightly better performance.
B+ trees store all the keys at the leaf level, with their associated data values. The parent nodes keep the duplicates of the keys to guide the search. During insertion and deletion, the parent nodes have to be updated correctly. For example, when modifying the first key in a leaf, the last right pointer found while descending the nodes will require modification to reflect the new key value. Also, since all the keys are in the leaf level, they may be linked for sequential access.
In the B++ tree, the keys are again all in the leaves. Each node can hold k keys and the root node can hold 3k keys. During insertion, we check to see if a child node is full before we descend the level in the tree. If it is, we take the child node and the two adjacent nodes, and merge and redistribute them. If the two adjacent nodes are also full, then another node is added resulting in four nodes each three-quarters full.  Similarly during deletion, we check to see if the child node is half full before we descend the level. If it is, the keys in the child nodes and the two adjacent nodes are all merged and redistributed. If the two adjacent nodes are also half-full, then they are merged into two nodes each three-quarters full.
We increase the level of the tree when the root is full during insertion in which case we distribute the keys to four new nodes, each three-quarters full. During deletion, we check the child nodes and if there are only three that are all half-full, then they are merged into the root and reduce the level of the tree.

Monday, November 25, 2013

we mentioned external sorting and dictionaries in the previous post for sorting from very large data. Dictionaries are very large files too and they are implemented as an index to the actual file and contains the key and record address of data. A dictionary can be implemented with a red-black tree replacing pointers with offsets from the beginning of the index file, and use random access to reference nodes of the tree. However, every transition on a link could imply a disk access and would be prohibitively expensive. Since disk access is by sectors, its btter to group several keys together in each node to minimize the number of I/O operations This is what B-Trees do.
B-Trees arrange keys surrounded by pointers or offsets that are less than or greater than the key value.For example, all keys less than the key value are to the left and all keys greater than the key value are to the right. We can locate any key in this tree with fewer disk access than a regular tree because if we were to group 100 keys / node, we would search over 1,000,000 keys in with just these three nodes accesses. To ensure this property holds, we must maintain a balanced tree during insertion and deletion.
B-Trees come in many flavors. For example, we have B-Tree, B*-Tree and B+-Tree. The standard B-Tree stores keys and data in both internal and leaf nodes.B* trees are similar only that the nodes are kept 2/3 full for better utilization of space in the tree and better performance.
In a B+ tree, all keys are stored at the leaf level with their associated data values. Duplicates of the keys appear in internal parent nodes to guide the search. Also, the pointers have a slightly different meaning. The left pointer designates all keys less than the value, while the right pointer designates all keys greater than or equal to the value. The author also mentions a  B++-Tree that is similar to the B+ trees except for the split/join strategy i.e the keys are scattered to and gathered from other nodes for insertion and deletion.
In this post, I want to take a break to talk about sorting data from very large files. For data sets in memory, we can use one of several sorting algorithms. For data that is bigger than what can reside in memory, we will look at techniques for sorting externally and implementing dictionaries or B-Trees.
In external sorting, we implement what is known as replacement selection to establish initial runs followed by poly-phase merge sort to merge the runs into one sorted file. A run is a sequence of number that is already sorted.
To create the initial runs, a buffer is allocated in memory to act as a holding pane for several records. Initially the buffer is filled, then the following steps are repeated until the input is exhausted.
Select the record with the smallest key that is >= the last record written
If all the keys are smaller than the key of the last record written, then we have reached the end of the run.
Select the record with the smallest key for the first record of the next run.
Write the selected record
Replace the selected record with a new record from the input.
If we take a two record buffer, this is easy to illustrate. We load the buffer and write the smallest key to the destination. We replace this record with the next one and repeat the process. This process continues until all the keys written are smaller than the last key. At this point, we terminate the run and start another.
This strategy simply utilizes an intermediate buffer to hold values until the appropriate time for output.
Initially all the data is in one place. The source is read and the runs are distributed to other places. After the initial runs are created, they are merged. We use Fibonacci number of runs with each input so that the number of passes can be minimized.  Fibonacci numbers have the property that each number is the sum of two preceding numbers. When we take the number of runs in each input and merge in the third, we are left with run counts of smaller and smaller number in this sequence. Each merge takes the numbers down by a notch. As mentioned, this reduced the number of passes on the data and is very helpful when we are considering sequential access. We want optimized access because we want to be efficient on the large dataset.
In poly-phase merge, we work with frames of data at a time and we take runs of data that are sequential from both source and merge them into a longer run on a third storage in each phase.  Then we relabel the storage so that the merged data acts as a source and merge this with the remainder of one of the sources. Finally we write it back to the destination.
The runs help us to merge perfectly in the sense that there are no extra runs on any source. That is why the step to creating the initial runs helps.
courtesy : Niemann
 

Sunday, November 24, 2013

Johnson's method of clustering scheme was first written during the summer of 1965 when he was working at Bell Labs during a break from his PhD. It introduced cluster hierarchies and came with a program written in FORTRAN. Agglomerative hierarchical clustering places each object into its own cluster and gradually merges these atomic clusters into larger and larger clusters until there is only one. Divisive hierarchical clustering reverses this process by starting with all objects in one cluster and then subdividing into small clusters. Hierarchical classification is  a special sequence of partition-al classification.
Serial clustering handles the classification one by one while whereas simultaneous clustering works with the entire set of patterns at the same time.  A clustering algorithm can be expressed mathematically either by graph theory or by matrix algebra.  The former is used to describe connectedness and completeness and the latter is used to describe say mean-square-error.
For the purposes of clustering vectors based on KLD distances, we will use a simple partition scheme like the K-means we talked about. This is because we already know the sources categories when we pick the sample document. Besides, when we do keyword extraction, we will build a vocabulary from these documents. We may come across a large number of lemmas and we planned on adding the terms to an empty document that keeps the KLD distance above a threshold i.e D(avg-pt/q) > 1.
In addition, we mentioned varying the threshold on a sliding scale to retrieve more and more of the terms as needed. This we can set based on the samples at the hand that gives reasonable results. There is no pegging of the threshold that may be universally true however larger and larger values of the threshold are certain to give discriminative keywords. This can be built with a visual tool that renders the keywords in a slightly bigger font than the rest. This way the keywords will attract attention in place with the rest of the text. Together with a tool to render the sliding scale, we can increase the number of terms that have this bigger font and make it easier for the reader to find the keywords and see the changes in the number of keywords. Building this visual tool should be fairly straightforward in that it renders the same text in different forms. The bigger font or magnifying glass display for keywords can be enabled by setting the font and size markup to be different from the others.
I want to talk about using the K-means clustering algorithm to categorize documents based on Kullback-leibler distance. We have discussed finding the Kullback leibler distance between any document and another based on the probabilities of the terms from the vocabulary. Next we normalize the distance and use this KLD distance for clustering.
Today we talk about how to cluster these documents. Let us take an example with a K-means clustering. We take the sample set of documents from three categories say news, reviews and editorials. We initially set the centroids to be representatives of their clusters. This we can pick from any of the documents that represent their categories and they should ideally be as far apart from each other in terms of their cluster memberships as possible. Then for each of the documents, we find the KLD distance between the document and the centroids and assign it to the nearest cluster. After a round of assigning, we fix the centroids. We do this by finding the average of all normalized distances within the cluster and picking the new centroid as the one closest to the average. Once the centroids are updated, we assign the labels again and re-adjust the cluster centers. We can repeat this process until cluster centers stabilize.
Clustering is independent of how the distance is calculated so we can be as articulate about the distance as we want. In this case the strategy for distance was chosen as KLD but this could be cosine, Jensen or other distances as well. This distance computations can also be materialized if we know the vocabulary or features beforehand for say things like keyword extraction. That is an optimization for known samples. In general, we have distance algorithms that we use for each clustering.
Similarly, we can even vary the clustering algorithms from the choices we listed earlier. These are several mentioned in the earlier posts. We can visualize the different algorithms for clustering as K-means and agglomerative for the same distances. In the implementation letting the clustering algorithm vary independently from the distance computation lets us be more flexible and compare the algorithms.
When we take the sample data for three categories, we will need to extract the vocabulary to use as features for the documents. This is possible with the KLD as well.  The documents can be selected to be representatives of their categories.
Initially to test the clusterer, we could use select the sample data to be a small set and pre-label these. Then we can look at the precision and recall. However, the sample sets we choose for unsupervised runs must at least be five to ten times the number of features.

Saturday, November 23, 2013

The main advantage of the structural approach is that in addition to classification, it provides a description of the pattern. It specifies how the given pattern can be constructed from primitives. This paradigm has been used in the situations where the patterns have a definite structure which can be captured in terms of a set of rules Here there is a correlation between the structure of the pattern and the syntax of a language.  The patterns belonging to a category are viewed as sentences belonging to a language. A large set of complex patterns can be described by using a small number of primitives and grammatical rules. The grammar for each pattern class must be inferred from training samples.
Syntactic patterns have a great deal of intuitive appeal. However, implementation of this approach leads to many difficulties, such as segmenting noisy patterns to detect the primitives, and inferring grammars.

We talked about distributions. Lets  look into this a little more closely. We mentioned that Gaussian distribution leads to optimal rules. This is because the mathematical properties of this distribution is well understood i.e they are justified by a central limit theorem. Second, many applications find it easier to summarize data in the form of a mean vector and a covariance matrix. If the density
function is unimodal, then this is simple and easy to apply. If X is a vector with d features, the density function f(x1, x2, ...., xd) is described in terms of the d * d covariance matrix, its determinant, mu- the d-vector of expected values and x which is a d vector containing variables x1, x2, ... xd. The mean vector mu and the covariance matrix define a Gaussian distribution. The parameters for a multivariate Gaussian density are estimated by sample co-variances and sample means to fit a Gaussian distribution to a set of  patterns The mean is found with some objective or expectation function i.e mu-i  = E(Xi). The population covariance between Xi and Xj is sigma-ij = E[(Xi-mui)(xj - muj)] and this works for diagonal elements. This can also be written as sigma-ij = ro-ij.sigma-i.sigma-j where
ro-ij is the correlation co-efficient between sigma-i and sigma-j. If ro-ij turns out to be zero, then Xi and Xj are both linearly and statistically independent. Features are pictured as random variables and each pattern is a realization of these random variables.
 
In today's post we will continue our discussion on the algorithms and the issues encountered: We list some of them here.
One shot versus hierarchical tree classifier:. In the former, the distinction between all the classes are made in one-stage and is especially helpful when the number of features is large. In the latter, the classification structure is a binary tree where the most obvious discriminations are done first and the subtle ones done later. When the number of features per node is smaller than the total number of features, this is must faster.
 Parametric versus non-parametric classification: Both rely on knowing the forms of the class conditional densities so that the decision rules can be written. If the class conditional densities are Gaussian, parametric techniques can be applied. So the decision rules can be optimal or plug-ins. However, non-parametric rules that are not based on pattern class distributions could also be very helpful.
Dimensionality and Sample size relationship When the number of a features for a vector is very large, the classifier designer has to choose a smaller feature set. This is often referred to as the curse of dimensionality and results in some loss of fidelity. It is overcome by choosing number of training samples per class to be at least five to ten times the number of features.
Feature Selection : Even among the set of possible features, the ones that are selected depend upon computational ease and cost considerations. Moreover, determining the optimal subset of features of size m from the d available features requires an exhaustive search over all possible subsets. Some intelligent ones to alleviate these have been proposed and I've mentioned a few in my previous posts.
Error Estimation : In order for us to know if the samples are correctly classified, we measure the errors with classification. We do this by splitting the available samples into training sets and test sets. The classifier is designed using the training set and then evaluated on the samples from the test sets. Precision and recall are two metrics with which the classifier performance is evaluated. These are also referred to as the hold-out method and leave-one-out method. The split between the training sets and test sets has been investigated.
So far we have discussed the geometrical or statistical pattern recognition algorithms.
The second paradigm of pattern recognition algorithms are based on structural or syntactic approach. When the number of features required to establish a decision boundary is very large, its helpful to view such patterns as being composed of simple sub-patterns. A sub pattern could itself be built from simpler parts with grammatical techniques.

Friday, November 22, 2013

The book I mentioned in my previous post also has a section on pattern recognition. Pattern recognition refers to the classification or description of objects or patterns. The patterns themselves can range from characters in an image of printed text to biological waveforms. The recognition involves indentifying the patterns and assigning labels for categories. We start with a set of training patterns. The main difference between pattern recognition and cluster analysis is the role of pattern class labels. In pattern recognition, we use the labels to formulate decision rules. In cluster analysis we use it to verify the results. Pattern recognition requires extrinsic information. In cluster analysis, we use only the data.
There are two basic paradigms to classify a pattern into one of K different classes. The first is a geometric or statistical approach.  In this approach, a pattern is represented in terms of d features and the pattern features are as independent from one another as possible. Then given training patterns for each pattern class, the objective is to separate the patterns belonging to different classes.
In statistical pattern recognition, the features are assumed to have a probability density function that is conditioned on the pattern class. A pattern vector x belonging to a class wj is a data point drawn from the conditional probability distribution P(x/wj) where j is one of the K different classes. Concepts from statistical decision theory and discriminant analysis are utilized to establish decision boundaries between the pattern classes.If the class conditional densities are known, then Bayes decision theory gives optimal decision rule. Since they are generally not known, a classifier is used based on the nature of the information available. A classifier requires a supervised or unsupervised learning In the supervised learning, if the form of class conditional densities are known, we use a parametric or non-parametric decision rules. In unsupervised learning, the density functions are estimated from training samples.  Here the labels one each training pattern represents the category to which the pattern belongs. The categories may be known beforehand or they may be unknown
When the number of pattern classes is unknown, it tries to find natural groupings in the data. In the tree representing these dichotomies in statistical pattern recognition algorithms, the problems get more difficult as we traverse from top to bottom and from left to right in the picture given below.
                                Prior Information
                                  -----------------
                                 |                       |
                          Complete    Incomplete
 (Bayes Decision Theory)               |
                                                        |
                                       --------------------------------
                                       |                                         |
                                 Supervised                Unsupervised
                                       |                                         |
                  ------------------------                       ---------------
                  |                               |                       |                    |
 Parameteric     Non-Parametric        Categories known   Categories Unknown
            |                        |                                    |                    |
----------------              --------------                    |                    |
|                   |                |               |                     |                    |
Optimal      Plugin   Density    Geometric   Mixture  Cluster analysis
rules        rules     estimation  Rules     resolving

Thursday, November 21, 2013

I came across a book Algorithms for clustering data by Jain and Dubes. I want to start posting from that book. But first I want to jump to a section towards the end of the book which talks about Graph theory. As we may know, a graph has a finite set of nodes and they can represent objects being clustered.  A set of edges records the interactions between pairs of vertices. The vertices and edges are related by a function f that maps edges into unordered pairs of vertices. we discuss finite undirected graphs when we say unordered pairs of vertices. Therefore a graph is a tuple V, E, f.
Some applications require that labels or weights be assigned to the vertices or to the edges.
As an example, the edges could be labeled by the distance between patterns representing the objects. There are some common definitions as well. A subgraph is a subset of nodes and vertices of the original graph where there are no edges or vertices that are not in the original graph. A path in graph is an alternating sequence of vertices and edges that contains no repeated edges and no repeated vertices and for which edge ei is incident to vertices vi and vi+1 for i = 1 to n.
 A graph is connected if a path exists between any two vertices in the graph.
 A component is a maximal piece of a connected graph. Maximal means as many nodes as possible. So a component is not a proper subgraph. of another graph because an edge may be drawn where it didn't exist in the original graph.  A graph is complete if an edge is assigned to every possible pair of nodes. A complete graph on n nodes contains exactly n(n-1)/2 edges.
 There are some interesting observations as well. A maximal complete subgraph of G is a complete subgraph of G that is not a proper subgraph of any other complete subgraph of G. In other words, H is a complete subgraph of G if an edge is incident to every pair of vertices in H but no additional vertices of G can be added to H without destroying the completeness.  A maximal complete subgraph is also called a clique.
A cycle is the same as a path except that vertices v1 and vn are the same vertex.  A tree is a connected graph with no cycles. If a subgraph has m vertices, a tree containing these vertices has m-1 edges. A spanning tree is a tree containing all vertices in the graph. The weight of a tree is the sum of the weights of the edges when the weights represent dissimilarity or ratio scale. A minimal spanning tree is the one with the least weight  among all other spanning trees.
A number of algorithms are available for finding the minimum spanning tree. MSTs of complete graph are important in cluster analysis.
Prim's algorithm is easy to try out by hand. The edges are ranked by their weights and insert the edges by order of dissimilarity. The shortest is included first until a tree is formed. We prevent any cycles when building an MST.
A threshold graph is one which has n labeled nodes and N unlabeled edges. A random graph is one which has edges inserted at random. A rank graph is a threshold graph in which the edges are labeled. N nodes can be sampled from n(n-1)/2 edges without replacement or ordering and combined with N! to get the number of rank graphs.
In the unsupervised category, MCMM was compared to K-means clustering on the same DS2 dataset. Three of the four clusters align very well with the labels given by Reuters, achieving very high precision score, especially considering label information was unavailable during training. The algorithm stops after it has added an additional cluster center that does not increase the gain function. That is why the first three clusters are clearly delineated than the last one. The fourth cluster found by the MCMM is a general subtopic within the dataset almost like a miscellaneous group.
When compared with the earlier supervised run, where we found clusters that did not match the reuters labels, a closer inspection of the same showed that the clusters were well formed and that some documents were originally mislabeled.
Thus MCMM appears to be a practical machine learning method for text categorization. The ability to run essentially the same algorithm in both supervised and unsupervised mode helps evaluate existing labels. And there is flexibility to include the  documents in more than one category along with a ranked list of the degree of the inclusion in the cluster.

Wednesday, November 20, 2013

In the experimental results for MCMM, we find cluster coherency. The datasets used by MCMM involved the Reuters document collection.  Two small subsets of
documents were used. The first consisted of 983 documents with 9 labels involving 372 dimensions and the second consisted of 240 article with 3 labels and
360 dimensions or words.  This dataset was chosen so that MCMM could first be run on pre-labeled data before running in an unsupervised manner. There was a
70-30% split in each data set for training and test purposes. For comparision, a naive Bayesian classifier was also run. With the MCMM, the cluster
activities produce a ranking for category assignment instead of a binary decision which allows us to compare the tradeoff between precision and recall. If
we increase the threshold for activity, there would be more precision but less recall. It was found that the precision and recall were great for all except
three labels which seemed more of an exception than the norm. The documents from the first set in the reuter collections from the mentioned three labels had
multiple labels assigned to them. However, it was also found that the Naive Bayes Classifier performed better. And the performance was compared against a
much larger reuters collection.
MCMM was expected to have better performance in the unsupervised mode.
In the previous post, I talked about MCMM, Zipf's law and assumptions made.
 In Zipf's law, we relate the number of occurrences of a word r with the number of r occurrences and in this model, we find a threshold for the number of times a word must occur before we indicate the word as occurring in the binary document vector. Z curves are drawn for the document's word frequencies to find this threshold for each document. Longer document will have a larger threshold which will eliminate spurious word occurrences from the document vector. As a support for the binary representation of the terms in the document vector, other studies have shown that there is very little difference in the results between binary and weighted terms. And that categorization can be achieved without term weights.
 Another assumption made is the independence between the components of the input vector (i.e. words)  There are no hidden units in the input..
To test the viability of MCMM, because it had high computational cost, it was tried out with different factors. For example,  the model was made to work with reduced dimensions and Zipf's law was used to eliminate words with very few or very many occurrences in  the documents in the corpus.

Tuesday, November 19, 2013

We will continue the discussion on the Multiple Cause Mixture Model (MCMM). We looked at the global objective function as the aggregation of all the local objective functions pertaining to individual documents. We saw that the maximization of the objective function is performed in two steps :
1) First the cluster centroids c are fixed and then the mi,k value is found for a local maximum.
2) Second, the cluster activation values mi,k are fixed  and then the cj,k are found for a local maximum.
These steps are repeated until the objective function cannot be maximized further.
When the centers stabilize, one of the cluster centers is split into two  and the steps are repeated.
This increase in the number of cluster centers continues until the addition of a cluster center does not increase the objective function.
Note that the first step to find the activity vectors mi can be offloaded as in the case of supervised learning where a teacher provides these activity vectors and the algorithm focuses on finding the cluster centers.
we can represent this solution in a diagram with the cluster centers in a separate layer above the layer representing words. Activity m in cluster layer topic nodes flows top-down to cause activity in nodes r which represents the predictions for how likely the words are to appear in the document. So measurements from observed dj flow up and the predictions flow down during iterations
The assumptions made by the model are as follows:
First, all input data is binary. For text categorization
Second, terms are selected based on Zipf's law which states that the number of times a word appears is invesely proportional to that number of times.
In the Multiple cause mixture model, we mentioned there can be more than one mapping functions between activity m and weights c. One example was the soft disjunction function which explains that the likelihood for any given word to appear in a document only increases with the presence of activity in multiple topic nodes capturing the document's topical content. Here the direction for flow is from activity in clusters to topics in node. The inverse flow can also be described. Any prediction vector r can be combined with a binary vector d representing the words present in the document dj with an objective function Sum-j[log (djrj + (1-dj)(1-rj)] By getting higher values for this function, we have a way to find the vector of cluster activities m that optimize this function. Now given a corpus of documents, indexed by i, the global objective is the aggregation of the individual objective functions. When the global is maximized, we arrive at the set of weights c reflecting clusters of words that co-occur in documents.
Training begins by initializing a single cluster centroid at a random point. For this initial cluster centroid and for later stages when there are more centroids, the maximization occurs in two steps. First the cluster centroids are fixed and the local maximum is found over all data points. In the second step, the cluster activation values are fixed at the values found in the previous step and the gradient ascent is performed until the over the cluster centers. Both the steps are repeated until the function cannot be maximized further.

Monday, November 18, 2013


I want to discuss the P versus Q divergence variation based on the sample size for P and Q even though the event space is the same for both. The sample size for P is negligible compared to Q so after a certain size Q yields roughly the same KLD for any larger size. This means that we just have to choose a collection which has representations involving the terms we would be interested in.
Today I want to discuss a slightly different take on text categorization. This was presented in the paper "Applying the multiple cause mixture model to text categorization" by Sahami, Hearst and Saund. Here they talk about a novel approach using unsupervised learning that does not require a pre-labeled training data set. In other words, it discovers from the samples itself. I'm looking to see if we can consider such an approach for keyword extraction where we don't have to come up with a pre-determined vocabulary set. Supervised algorithms require a training set of pre-labeled documents and there has been studies to intelligently reduce the size of this training set. However, this approach talks about automatically inducing multiple category structure underlying an unlabeled document corpus. In addition, many algorithms assign one label per document, or else treat the classification task as a sequence of binary decision tree while this approach treats documents as member of multiple categories. We will see why this is important in just a bit.
Documents are represented by term vectors that have binary values. That is the number of occurrences of the word does not matter after a certain point. A topic is associated with a substantial number of words i.e. a cluster of words will be indicative of topics. If there are several topics in the document, the overall topic will be the union of all the cluster of words. If a word has dual or more meanings and appears in different clusters, that word would likely appear more than the individual topics. Taken as layers between clusters and nodes, any activity in the cluster layer will cause activity in the nodes such that the latter reflects how often the word appears in the document. The idea is to tell apart the clusters till we have probabilites for these words.
MCMM tries to discover these hidden clusters by looking for patterns in high-dimensional data. It differs from other models by permitting clusters not only to compete for data points but also to co-operate for accounting of observed data.
There can be many different functions that map the activities m in the cluster layer to the weights c for the documents.One example is the soft disjunction which is represented by the equation rj = 1 - PI(1-mk.cjk) wher rj an cjk are both between 0 and 1.
Wartena mentioned in this paper that the keyword selection was done based on the filtering out general terms and this was done by requiring that a keyword has to be different from the background distribution. The technique is to use Kullback-Leibler divergence to see whether the term is different enough and using a cutoff based on this KLD(ptk', q) > 1.  In Kullback divergence measure, we normalized the distances based on an empty document with probability distribution all consisting of epsilon values. And we measured the P(tk/q) as [tf(tk, q) / Sum-for-all-terms-x-in-q (tf (tx,q))] . When we take the KLD distance for a term in a document containing just that term weighted against the KLD for the same term in the distribution Q, we will find that some terms have higher ratios. This we use with a threshold to select the ones that are not too frequent and overly general.  This threshold lets us reduce the number of terms in Q which may be quite large to a few that have higher discrimination. Choosing the terms this way lets us come up with representative keywords for use with any sample document.
From all the words in a sample document, each term can be weighted this way. This gives us a keyword selection technique. The number of terms to be extracted can be varied by adjusting the threshold on a sliding scale.
 By increasing the size of the distribution Q or choosing a category that is similar to the sample document, we could further improve the selection technique.
Thereafter, we could use the document categories and sample document clustering for simultaneous feature extraction and document categorization.

Sunday, November 17, 2013

Today I wrote a document instead of my post. But I did follow up on my suggestions to use KLD starting from an empty document and adding terms to differentiate from the background based on a threshold.

Saturday, November 16, 2013

In the previous post, we discussed using KLD for narrow domain short texts. Here we will cover the feature selection techniques used in that discussion. The first technique is the DocumentFrequency (DF)  which assigns the value dft to each term t where dft is the number of texts in each collection where t occurs. The second technique is the term strength where weights are assigned to each term t from similar texts. The third is the transition point. This is found from the vocabulary frequencies of each text.
Apart from these, there was a mention for extracting the keywords based on KLD itself. i.e we find terms that contribute most to the distance between two probability distributions P and Q. We had proposed two approaches. The first one was the selection of a few terms at a time and choosing the ones that was consistently higher than a threshold. The second one was the selection of a term at a time and adding it to an empty document so that the KLD distance is maintained higher than a threshold.
The clustering techniques used by the authors include the agglomerative clustering proposed by Johnson.
This is described by Borgatti as follows:
Let there be N items to be clustered and a N * N distance matrix to begin with:
Assign each item to its own cluster so that we have N clusters each containing just one term. The distances between the clusters equals the distances between the items they contain.
The closest most similar pair of clusters are found and merged into a single cluster, so that we now have one less cluster.
The distances between the new cluster and each of the old cluster is computed
The above two steps of merging and computing the distances is repeated until we are left with one cluster
Computing the distance between the new cluster and the old clusters can be done in one of the following three ways:
single link -  here the distance between one cluster and another is the shortest distance between any member of the cluster to any member of the other cluster. This method is also called the connectedness or minimum method
complete link - here the distance between one cluster and another is the longest distance from any member of one cluster to any member of other cluster. This method is also called the diameter or maximum method.
average link - here the distance between one cluster and another is the average distance between any member of one cluster to any member of the other cluster.
Notice that in the N * N matrix the distances along the diagonal between same elements is zero. And as the elements are combined into clusters, the cluster names are the combination of the names of the elements.
KLD and clustering
In a paper on how to do clustering with KLD on narrow-domain short texts by Pinto, Benedi & Rosso, there is a good example of such a clustering. Although KLD is used to calculate a distance between two probability distributions, here the authors use it compare how similar or dissimilar two documents are so that they can be clustered. In this study, they use narrow domain texts to show that this method works even in such case. The challenge with narrow domain texts is that term frequencies are very low and term selection has to be done carefully. If we take the vocabulary across all the text, only a very small percentage of those terms might appear in the short text and that too their frequency may be one or two or so. Moreover, a small change in the frequency affects the clustering results widely. Therefore this is a good test for the clustering. A real life example of such a problem space is seen with the categorization of the abstracts of scientific papers by repositories. Although authors provide keywords for the same, the keywords are not sufficient to classify the text and often some are misleading resulting in poor classification. Moreover, the number of categories are not known before hand.
In their study, the authors have used hierarchical clustering methods and they use the symmetric KLD distance for divergence. There are various ways of expressing KLD.
We have seen asymmetric version as D(P/Q) = Sum(P(x)log(P(x)/Q(x))) .
Then we have seen the symmetric one as D(P/Q) = Sum((P(x)-Q(x))log(P(x)/Q(x)))
and then there is the D(P/Q) = 1/2[ D(P / (P + Q)/2) + D (Q / (P + Q)/2)]
and finally D(P/Q) = max [D (P / Q) + D (Q / P)]
Since the texts vary in terms, the frequency of many terms is zero. To avoid these, a back off scheme is used similar to what we discussed earlier. The document probability of the term is given as a constant times the Probability of term tk in document dj, if tk occurs in document dj or a constant epsilon otherwise. i.e. all the words from the vocabulary that don't appear in the document are treated the same as unknown words and given a default value. This probability of a term tk in dj is calculates as P(tk/dj) = tf(tk, dj) / Sum(x belongs to dj)[tf(tx, dj)]. A document j is represented by a term vector of probabilities dj and the distance between two documents is the KLD measure. Finally all the term vectors are normalized so that the document vectors are of unit length. Therefore the scaling factors for the terms occurring in the document are found as 1 - sum(all epsilons for terms not occurring in the document). Note the convention has been to use a predetermined vocabulary. We don't invert the problem by building the vocabulary from the documents as we process them. In the case of this study, most of the corpora have several thousands of terms, the vocabulary consists of a few thousands of terms  and the document texts consist of about a hundred terms
 

Friday, November 15, 2013

Today let us just focus on an implementation topic for text categorizing software. How do we design a class for a clusterer ? Let us take a specific example for the clusterer. We can generalize it later. We want a K-means clusterer that classifies text documents to categories/clusters. Text documents are represented by term vectors. Distance between documents and categories are measured by a distance function or metric. Documents can vary in number. They can be quite large. Number of clusters are fixed. Clusters will have centers. Distance to the nearest center for a document can be found. The clusterer has a method called Classify. It reads the input documents, computes the distances, assigns the clusters, adjusts the cluster centers and repeats until the cluster centers stabilize. The mechanism of the classify method is known. The class needs to be aware of the following
cluster centers
distance function
document set
With the classify method, there will be an output where each document has been assigned a category label or cluster.
How do we test the classifier ? We give it different sets of inputs and check for the labels. What can we vary for this classifier. Surely we can vary the sample space. We can include a small number or a large number of documents. We can check for the proper assignments of each documents We can vary the number of clusters. We can check the output for different distance functions. We may have to manually classify the samples  before we run the clusterer. But how do we check the correctness of the program ? We want an invariant, an initial condition and a final state. How do we check the progress especially when the classify method takes a long time ? We could add some supportability.
Next we could consider some design factors for flexibility.
We could decouple the distance calculating function from the clusterer.  We could use design patterns to switch different distance functions to the class. We could use a strategy pattern.
We could also parameterize the input and the output so that we don't need to know what the vectors are that we cluster. We could create a base class for the clustered and derive others.
pseudo-code in python syntax for finding KLD:

tokens = nltk.word_tokenize(document)

words = sorted(set(tokens))

porter = nltk.PorterStemmer()
words = [porter.stem(t) for t in words]


wnl = nltk.WordNetLemmatizer()
lemma = [wnl.lemmatize(t) for t in words]
vocabulary = lemma[1:200] 
# categories
news = brown.words(categories = 'news')
editorial = brown.words(categories = 'editorial')
reviews = brown.words(categories = 'reviews')
hobbies = brown.words(categories = 'hobbies')

fdist-news = nltk.FreqDist(news)
fdist-editorial = nltk.FreqDist(editorial)
fdist-reviews = nltk.FreqDist(reviews)
fdist-hobbies = nltk.FreqDist(hobbies)
fdist-document = nltk.FreqDist(tokens)

def Ptk-d(term): return fdist-document[term] / sum([fdist-document[term] for term in words])
def Ptk-editorial(term): return fdist-document[term] / sum([fdist-document[term] for term in editorial])
def Ptk-reviews(term): return fdist-document[term] / sum([fdist-document[term] for term in reviews])
def Ptk-hobbies(term): = return fdist-document[term] / sum([fdist-document[term] for term in hobbies])
def Ptk-document(term): = return fdist-document[term] / sum([fdist-document[term] for term in document])

KLD-editorial = sum([(Ptk-editorial(term) - Ptk-d(term)) * log (Ptk-editorial(term)/Ptk-d(term) for term in vocabulary)] 


[in progress]

 

Thursday, November 14, 2013

In the previous post, we discussed the algorithm to differentiate probability distribution. With this we have seen how to differentiate a sample document against document categories.
 Here we will walk through how to implement it. First we take a document collection that is already categorized. We then select a vocabulary V from these documents using the lemmatization, stop words. With the terms in the vocabulary we compute the probability distribution as by counting the term frequencies in a document versus and taking the ratio with respect to all the frequencies.  We can do the same for the categories so that we have vectors for both.  Computing the vectors and normalizing them helps us with the preparation for computing the distance. This is measured using the four cases we discussed and normalizing them with the distance corresponding to the empty document vector. After computing the KLD distance, we assign the document to the category with the minimum distance.
Notice that these are no iterations involved since we are finding the distances only. With the distances, we get a measure of how similar or dissimilar the vectors are.
We have said so far that the document can be assigned a category with which it has the minimum distance. This distance is computed based on the difference between the documents distribution and the category's distribution. Notice that the sample for the distribution can be the category or the document and we evaluate it on a term by term basis.
We will discuss a variation where given a fairly large set of vocabulary, we select the keywords that are most distinguishing from the background that consists of general terms. Perhaps, we can choose a subset of the vocabulary and find the terms that repeatedly contribute most to the distinguishing from the background distribution or we could start out with an empty document and add the terms that keep the distance higher than a threshold.
If we choose the latter approach, we may find it easier to build the subset of keyword we are interested in. At the same time, we may need to experiment with different threshold values. This is not a bad option given that the words will keep reducing in number in the subset with higher and higher thresholds. The advantage we have with this approach is that we keep the distribution in the background the same. Consequently the distance is directly related to the subset of keywords.
This way we have a higher likelihood of reducing the number of selections significantly.
In the former approach, we might have to try out different subsets and again since the distribution to compare with remains the same the words that repeatedly contribute more and more could be chosen as candidates. The sample can be of fixed size with only those terms swapped that don't contribute significantly to the distance and then stopped when we have a desired size of selection.

Wednesday, November 13, 2013

In the paper titled Kullback-Leibler distance for text categorization, Brigitte Bigi introduces the tf-idf classifier and the KLD classifier for text categorization. Documents are represented by their term vectors where each term is a content term and the documents represent weights in terms of the content terms selected. Each term is assumed to occur at least once in one document.  In the tf-idf case, each weight is calculated as the combination of term frequency and the inverse document frequency. If the idf of the term is lower, it appears in more documents and consequently not a good discriminating content word. The weight is zero when the term is not assigned to document dj and for the assigned terms, it equals the count of the term in that document with the idf calculated as the log of the total number of documents in the training divided by the document frequency of the term. The classifier for a tf-idf scheme is based on the cosine similarity where we sum the component wise products over their magnitudes.
The KLD classifier works similarly but measures how different two probability distributions are. The KL divergence of probability distributions P,Q on a finite set X is defined as the probability of selecting a term from the probability distribution P given Q and is expressed as the aggregated conditional probability P(x)log(P(x)/Q(x)) . The author mentions that a better distance metric would be the aggregated P(x) - Q(x) log P(x) / log Q(x). With this distance measure, we can choose the categories that contribute most to this distance. Note that the document vectors may have a lot of zeros for the terms that don't appear in the document. For these, there is a backoff scheme mentioned by the author. In general, C is the number of categories and V is the number of terms from all the categories. For each category, a distribution based only on the selected terms is computed from the corpus. The term distribution of a sample document is compared with the distribution of these categories. In the event of a backoff, term frequencies are discounted and all those terms that don't occur in the document are given a default value and treating them as if they were unknown words. This is done so that the distances don't come out to be infinite.
A document is represented by a term vector of probabilities dj while a category is represented by a term vector of probabilities ci. The probability distribution for a term tk in a document dj is computed as P(tk/dj) which is the ratio of the term frequencies in that document as compared to the overall term frequency across all documents in the case that the term appears in the document otherwise zero.  Similarly, the probability of a term tk in a category ci is given as the ratio of the term frequency in the category c1 as compared to the term frequencies over all the categories. The probability distributions across the documents and the categories for the same term are normalized to unit length i.e the default values and the computed ratios add up to 1. The categorization method based on the KL distance computes the distance as the aggregated sum of the difference between the probability distribution of a term in a category versus a document taken together with the log of their ratio.
Notice that this has four cases  : tk could appear in document dj and category ci, tk appears in document dj but not category ci, tk appears in category ci but not document dj and tk appears in neither. In the last case, the term degenerates to an unknown word. Finally, we normalize the distance based on the distance computed for the document dj and the distance computed for an empty document.
So now we add terms to the empty document that maximizes this distance to extract keywords. And it is also clear that we need the categories and documents  for this keyword extraction and we do not want to rely in the test document or it's summary itself.

Tuesday, November 12, 2013

While the FuzzySKWIC works with a selected finite set of vocabulary for the documents in consideration, Wartena's paper focuses on treating documents as a bag of words and attempting to find the topic by clustering those words. Here too we have a term collection T each occurrence of which can be found in exactly one source document in a collection  C. Wartena's approach included Lemmatization that reduced all inflected forms of the verb, nouns and adjectives to their lexical form, substantially reducing the variation in the input data. Tagging was also used to distinguish content words from function words. This pre-processing let them select around a 160 most frequent keywords for their term collection.
In this approach, they consider the probability distributions that measure the probability to select  an occurrence of a term t, the probability to select a document from the collection, and the probability to select a term and a document from C x T. They have two conditional probabilities - one for the probability that a randomly selected occurrence of term t has source d also called the source distribution of t and the other for the randomly selected term occurrence  from document d is an instance of term t also called the term distribution of d.  These help them define the distribution of co-occurring terms with which they perform clustering to determine the topic.
The concern here is that  the vocabulary set may not be sufficient to find the topic in a test document or to extract keywords. In order to evaluate available words, a different scheme may be needed. Word similarities and clustering could be used.
In order that we pick keywords out of the document, we must identify centers for the clusters within the document itself. Do we have a limit on the number of words we can cluster, no not as long as the terms are considered by their weights or vectors. Typically the documents have vectors and not the terms themselves because the terms only have weights. The document vectors suffer from a high dimension due to the number of terms in the vocabulary set. The high dimensions have problems for the documents but not for the keywords themselves. The weights associated with the keywords can be based on tf-idf or Jensen-Shannon divergence. We find the most frequent content words. We can differentiate the keywords from the background based on the Kullback-Leibler divergence and a cutoff. however, documents and keywords are not treated independently. Often topics have to be categorized and keywords have to be extracted together because they independently don't give correct information. Take the example of just keyword extraction. If we were looking merely for the content words and we worked only with the test document and the test document was reduced to a bag of words, we would be selecting words out of this document without any knowledge of the distribution of the words in any other documents. That may or may not be satisfactory since the document corpus gives valuable information for keywords that generally appear together in a category. Perhaps, then the documents from the corpus could be summarized in a term-attributes lookup table that we pre-populate and summarize based on the studied corpus. Then given the words we encounter in the document, we find the matches and using the corresponding attributes, we extract the keywords. This is where wordnet was a start but we are talking about categories the words appear in as attributes among others. In addition, we are considering performing Kullback-Leibler divergence with a cut-off to extract keywords. The suggestion is that we use the existing knowledge as well as the differentiation in the given document focusing on keyword extraction

Monday, November 11, 2013

// PreOrder A B C D E F

// InOrder C B D A E F

// Build Tree

public static Node GetTree(char[] PreOrder, char[] InOrder, int index = 0)



{
Node root = new Node();



root.data = PreOrder[index];
int inIndex = Array.IndexOf(InOrder, PreOrder[index]);

if ( index + 1 < PreOrder.Length && IsLeftSubtree(PreOrder[index + 1], InOrder, inIndex) == true)



root.left = GetTree(PreOrder, InOrder, index + 1);
else

root.left = null;

if (inIndex + 1 < InOrder.Length && IsPredecessor(InOrder[inIndex + 1], PreOrder, index) == false)



root.right = GetTree(PreOrder, InOrder, inIndex + 1);
else

root.right = null;

return root;



}
private static bool IsPredecessor(char c, char[] PreOrder, int index)



{
return Array.IndexOf(PreOrder, c) < index;



}
private static bool IsLeftSubtree(char c, char[] InOrder, int index)



{
return Array.IndexOf(InOrder, c) < index;



}