Wednesday, October 30, 2013

Today we will look at the distance measures used with the Fuzzy SKWIC, specifically the aggregated sum of cosine based distances along the ith dimension. For a program to implement it, we will review what we need to save from our computations for use later so we can be more efficient.
 The aggregated sum of cosine based distance we said was computed as Dwcij = 1/n - xjk.cik where i ranges from 1 to C, j ranges from 1 to N and k ranges from 1 to n. n is the total number of terms in a collection of N documents. For this we maintain a C x N matrix. When we have computed the individual cosine based distances, we will be computing their weighted aggregated  sum and will need this D'wcij to be saved. The feature weights are initialized so we compute this weighted aggregated sum for the first iteration. But note that the Dwcij is used when updating the feature weights. Dwcij could change when the cluster center changes but for the duration of iterations, (our iterations are to update the cluster centers) we can use the same Dwcij.  Looking at the equation for updating the feature weights vik = 1/n  + (1 /2tau)(SumN(uij^m)[D'wcij/n - Dwcij], we note that it uses both the weighted aggregated sum  from the previous iteration and the Dwcij we computed. With a suitable value for tau and initializing the fuzzy memberships and m, we are good to compute the feature weights.  The tau changes in each iteration and it serves to maintain the same order of magnitude in the objective function we discussed, so we initialize the values in the equation to update tau for each iteration with what we have in the first iteration. Since tau affects the feature weights, we will need to adjust the negative feature relevance values. So far all the variables except for the cosine based distance along the individual dimension are subject to change in each iteration.

Tuesday, October 29, 2013

In the previous post, we mentioned a few caveats and perhaps we can elaborate on some of them. These are for example, the document may not contain terms that are all in the corpus, terms may be missing from both, some document terms may need to be discounted and finding the KLD for these may result in infinite value, so we treat them the same as unknown terms and cap it with an epsilon value. Further, we mentioned that the KLD may not be used for clustering as opposed to its use in pre-processing for clustering. This is because the KLD has the following shortcomings: It doesn't work for all the data and the choice of the inclusions makes a difference to the end result. Instead we wanted something for clustering that is independent of the order of selections even if they can belong to different clusters to varying degrees. Furthermore, the distance metric used for clustering has several components often including the aggregated cosine distances.
On the other hand, we can better tune our clustering when we try out with a smaller dataset. So while KLD fits in very well to reduce the data set, the clustering works well with to use it with smaller dataset. The technique in clustering should work regardless. With improvements in the data set, the clustering can be refined to the point where we can find the different datasets.
One way to test the clustering would be to use a smaller sample. If it works for smaller sample, it should work for more data. The clustering relies on a metric that is independent of the number of data points. If its just one word, then the cluster center is the word itself.
When we talk of smaller data sets for clustering, we will find it easier to verify with manual classification and categorization. This will help with the precision and recall. Note that from our previous examples, the precision is the ratio of the true positive to the total positive cases i.e. the sum of both true positives and false positives. On the other hand, recall is the ratio of the true positives to the sum of true positives and false negatives. In other words, recall is how much of the relevant items were retrieved while precision is how many items that were retrieved were relevant.
With smaller samples, we can easily measure the precision and recall and fine tune our clustering.
One suggestion is that regardless of the sample we work with, we try text from different categories and not just from the corpus. Web data can be used for the trials since this has a lot more variety.
The discussion on KLD above was read from the paper on the same by Brigitte Bigi and the discussion on the clustering was read from the paper on topic detection by Christian Wartena and Roger Brussee. The paper by Wartena and Brussee  describes the clustering and the pre-processing we refer.
We discussed processing text documents and I wanted to mention GATE framework in this regard. It stands for General Architecture for Text Engineering. Its an open source software and is widely adopted by organizations, corporations and universities and has a community of developers, educators and scientists. It allows different text processing tasks to be completed by creole plugins. The tokenizer splits text into very simple tokens. The grammar rules are more adaptable and can be used for interpreting the text.  A gazetteer is used for identifying proper nouns and named entities in the text. Annotations are used with this gazetteer list to mention the type and sub-types. These can then be looked up via grammar rules. For example, to find the 'day' of the week in a text, this gazetteer list can be used to lookup Thursday.
Part of Speech tagging is also handled by this framework although we can rely on TreeTagger for the same. The tree tagging lets us filter the content words by part of speech further.
We haven't yet reviewed the application of the Kullback-Leibler mechanism to differentiate the keywords from the background text. The suggestion is that the background text could be considered a different document or bag of words.  We want to differentiate the keyword candidates from the background using the individual probability distributions of the keywords and applying the  Kullback-Leibler cutoff. This technique is also called relative entropy.  The more dissimilar the two probability distributions are, the higher the divergence and more useful for keyword selection. The terms could be selected only when it contributes significantly to the divergence.
Each word represents a probability and each document refers to a term vector of probabilities.

The goal is to require keywords P to be different from the words with little discerning power and the general background distribution Q. We can then define a threshold P\Q to a value as a cutoff. Typically greater than 1 is a good threshold.
something along the lines given by the small python syntax based pseudocode
import re, math, collections, numpy
def tokenizeandfilter(sent):
  // tokenize
 // remove stop words
 // lemmatize

def ProbabilityDistributions( document ):
 // assign probabilities to terms

def KLDifferentiation( PD1, PD2 )
     // add words to document 2 from 1 such that the divergence is greater than 1

A caveat with  Kullback-Leibler divergence is that it doesn't work well when the probability distribution is zero in either of the documents, or words are included in both documents that could have otherwise been left out.

That is why we plan to use it for our pre-processing and not with our clustering.

Monday, October 28, 2013

We talked about Kullback-Leibler divergence in the previous post. We will look into it now.
How it comes useful is in its ability to differentiate two probability distributions P and Q where P represents the true distribution and Q represents the computed statistical distribution.
The computation for this divergence measure is computed using expected values in terms of information entropy.
The divergences is given as the entropy of P - the cross entropy of P and Q. It can also be represented as -Sum(p(x)log(qx)) + Sum(p(x)log(p(x))). We will defer the application of the Kullback-Leibler divergence for later. Let us discuss the next steps in the processing such as lemmatization and tagging.
Both lemmatization or finding the lemma and tagging with parts of speech such as noun phrase, verb, adjective etc can be done with a TreeTagger. It can be downloaded and experimented with separately. But given a sentence of words, it can detect the parts of speech and the lemma that go with it into two lists. Internally, it uses a decision tree classifier to identify the parts of speech. the decision tree is built recursively from a training set of trigrams. At each recursion step, a test is added which divides the trigram set into two subsets with maximum distinctness regarding the probabilistic ditribution of the third tag. The test examines if one of the two preceding tags is identical to a tag t from a known tagset. In the same recursion step, all possible tests are compared and the best one yielding most information is attached to the current node. Then this node is expanded recursively on each of the two subsets of the training set. This results in a yes-no binary decision tree.
Stemmers are included with the tree tagger to help with identifying the lemma. The stemmers for example has the ability to detect common modifying suffixes to words. Lemmatization generally reduces the number of unique terms drastically and this helps reduce the arbitrarily high dimensions to work with for terms.
Stop lists are another tool to filter out the frequently occurring words or words that have no discernible information content. This list can be maintained external to the processing and is quite useful for reducing the count of relevant terms.
Together both the above steps are prerequisites for clustering based on document frequency.

We alluded to several stages of text processing during implementation before we can do clustering. We also compared agglomerative to K-means clustering. Recently, agglomerative processing has been made very fast and efficient. And since it represents bottom up merge strategy, it has some advantages we discussed. However, the K-means is considered better for text purposes. For the rest of our implementation discussions, we will focus on K-means only.
We select C Documents based on random sampling.  We Initialize the fuzzy partition matrix C x N where each of the values of the matrix are < 1 and the row wise sum is < N and the column wise sum = 1.
We populate the document frequency vectors xi for each of the n terms in a collection of N documents.
To calculate the cosine similarity between document frequency vectors xi and xj, we calculate their aggregated dot product divided by their magnitudes.
To calculate the dissimilarity between document frequency vector xj and the ith cluster center vector from the matrix C x N above, we aggregate the weighted distance for each of the n terms. This distance is the cosine based distance along the individual dimension as 1/n - xjk.cik where k varies from one to n
Then the rest of the processing is per the algorithm we discussed earlier.
Note that N is typically much smaller than n and that the matrix for the C x N has C typically much much smaller than n.
This means that all our aggregations over n terms have to be efficient. This could be achieved with reducing the number of iterations over n and possibly some pre-computations.
Moreover, the aggregations over C and n may need to be repeated for many components such as the error to the clusters and the feature weights. Similarly the cosine based distance could also be computed once and stored for lookup later so that we don't have to perform the same operations agains. This could speed up processing.
The fuzzy labels are found by aggregated over each of the clusters . They are aggregated over the number of documents N with a proper choice of m  - the degree to which the fuzzy membership is raised to compute the feature weight. They are also aggregated to update the cluster centers.

Sunday, October 27, 2013


In the previous two posts, we looked at K-means clustering and improving its performance for large data sets. K-means is very useful for forming known K clusters. Besides we can choose what we want to cluster. For example, we can bisect the data points into two clusters. Then we can bisect one of the clusters further if it does not satisfy say a threshold we set. The threshold could be based on distance or density. If we bisect one of the topics cluster further recursively because its larger than we expected, we will now have a binary tree of clusters. In topic analysis, these clusters correlate to topics and subtopics without restricting to the syntax or layout of the text. On the other hand, text segmentation relies on the layout and generates segmentation blocks based on a stochastic topic model. Here the topics are rudimentarily spotted from the text and formed into word clusters and then word clusters sharing common words are merged together. This gives us independent topics. Next, we follow the text to see how similar the preceding block is to the succeeding block and wherever there are valleys or low similarity values, we can segment the text. Contrast this with the bisecting method where the granularity of text segmentation could also be left to user discretion to have as much segmentation  or as little. Thus clustering can be applied in both bisecting as well as agglomerative ways.
For the purposes of keyword extraction, we rely on term weighting. Here the weights are determined based on probabilistic latent semantic analysis or some other technique. These include distances based on cosine, Jensen-Shannon or the feature weights we discussed in the previous posts. We could also consider Kullback-Leibler distance based background suppression to extract keywords.
Meanwhile, We have yet to look at several other implementation considerations such as the following:
Lemmatization - Stemming can help remove the various forms in which words appear be it in noun, form or adjective form.
Text tagging - could possibly be done with TreeTagger
MultiWord Lookup - We may have to look at collocations and words that appear in titles.
named-entity recognition and filtering - Proper nouns and stop word filterings will help with improving the data.
All of the above are necessary before the topic / keyword extraction or assigning term weights.
We talked about summarizing data points when clustering very large data sets. This approach is common to many improvement algorithms to clustering and in particular today we will discuss a simple summarizing technique called BFR for K-means clustering we discussed earlier.
Data Points are read into memory and here too the initial k-cluster centroids are chosen by some simple approach. For example, we can take a sample, pick a random point and then k-1 points as far apart from each other as possible.
We maintain three sets of data points and these are called discard set, compression set, and retained set. The discard set are points close enough to a centroid to be summarized. The compression set is a group of  points that are close together but not close to any centroid.  They are summarized but not assigned to a cluster. The retained sets are isolated points.
The discard set is summarized by the number of points N.
It has a vector SUM for the ith component which is the sum of the co-ordinates of the point in the ith dimension.
The vector SUMSQ for the  ith component which is the sum of the squares of co-ordinates in the ith dimension
2d +1 values that represent any number of points where d is the number of dimensions
Centroid in the ith dimension is calculated as SUMi/N
And the variance is calculated as the (SUMSQi/N) - (SUMi/N)^2
This summarization data is used with the processing of the data points from the very large data set.
The data points are processed in the following manner:
Data points that are sufficiently close to a cluster centroid are added to that cluster and its discard set.
Cluster the remaining points and the old retained set. Clusters that are formed goto the cohesive set and outliers go to the retained set. Since the clusters are not assigned yet, we will not add them to the discard set.
The data summarized for these clusters is then updated with the inclusion of these new points. Some cohesive set clusters can even be merged. If this is the last round, merge all the cohesive set and the retained set to their nearest cluster.