Friday, November 1, 2013

After the discussion on the results of simulations conducted by the authors of  Fuzzy SKWIC, here are their observations on their simulation using the newsgroups. Since the results showed that the documents that were inconsistently classified were from four miscellaneous classes. the run was repeated without these documents.  Specifically, these documents included those that lie in areas of overlap or fuzziness between distinct categories, documents that are outliers, and those that affect the purity of the resulting partition. In a way they studied where the clustering did not perform so well. After discarding these documents, the results showed that the clusters were more uniform , pure and the keywords were richer in terms of their relevance weights and partitions. 
This implies that without the noise, the clusters were homogenous, compact and pure and the cluster dependent keywords were more relevant and representative.
In order to measure the performance of the clustering in any scenario, they came up with a metric called the average entropy measure of all C clusters.   This average is found by taking the weighted entropy of a cluster as the ratio of the documents in that cluster to the overall number of documents. The cluster entropy for the ith cluster is then represented as the document class wise sum of this ratio times the log of this ratio and tuned by a constant. This constant is the inverse of the log of K documents
By using this metric, they arrive at the conclusion that the real unlabeled text data  is the most challenging. Moreover, with any manually labeled benchmark document set, there are plenty of errors  adding to the noise. Documents that straddle more than one topic  also end up with an inadequate label. So there's no baseline to compare the unsupervised clustering but the results have shown that automatic labeling is superior to manual labeling.
The authors also suggest that the improvements they see to this kind of clustering is in using some kind of probabilistic latent semantic indexing which allows us to differentiate contexts further. Fuzzy SKWIC does simultaneous partitioning in two different hyperspaces - the document space to capture spatial document organization and the keyword space to capture context. Context can be inferred because they are not one from one keyword but from co-occurring relevant keywords. By providing fuzziness or memberships to different clusters to varying degrees, the cluster dependent keywords are richer and better suited to classify the documents in that clusters. The documents in a cluster are also more uniform, compact and homogeneous.

Thursday, October 31, 2013

Some observations from the implementation by the authors of Fuzzy SKWIC. We talk about their experiments with general SKWIC as well as the fuzzy improvement.
They tried simulation results with hard clustering. Their collection consisted of 200 documents spread out equally from four categories. They did filter stop words and used stemming. IDF of the terms were used in descending order to pick out 200 keywords.  Each document was represented by a vector of its document frequencies. This vector was normalized to unit-length. Four clusters were used and the SKWIC converged in five iterations. Six cluster dependent keywords were chosen and verified and they correctly represented the cluster category. Only one partition showed most of the error. Documents that straddled two or more topics were found in this partition.
One observation I would like to highlight here is regardless of the keyword extraction, their choice for the number of keywords was in the order of a few hundreds. The ratio of keywords to documents is incidental here but the choice to keep the number of keywords to a small number is deliberate. We should follow a similar practice.
Now on to the fuzzy labels based run, the results were again with four clusters and m = 1.1. Their cluster center updates converged after 27 iterations. The results did indicate a significant improvement over the results from the former run. The improvement was seen particularly with respect to the partition that had errors with documents. In this run, those documents were correctly classified. A closer look at those documents suggested that they could have had soft labels from the start i.e they should not have been assigned one label. The results also indicate that the labels generated by the Fuzzy SKWIC was richer and more descriptive of their categories, thus indicating a good application.
A third simulation was done with a much varied data from 20 newsgroups with 20000 messages. Here the messages were stripped from their structural data in addition to being processed with stemming and stop word filtering.
Here the messages were stripped from their structural data in addition to being processed with stemming and stop word filtering. Here the number of clusters were 40. The feature weights varied quite a bit from cluster to cluster  A lot of document migrations to clusters were seen Ten relevant keywords for each cluster were chosen. Again, they reflected their clusters well. Some indicated that the discussions were about an event, while others indicated that they pertained to an issue found interesting by the participants.
We were discussing the implementation of Fuzzy SKWIC where were going through the steps within each iteration which we repeat until cluster centers stabilize. So far we had looked into finding cosine distances, updating and adjusting relevance weights, computing the weighted aggregated cosine  and adjusting it. We will discuss the updates to partition matrix, cluster centers and tau next. The partition matrix was initialized by weights in a matrix of C x N and each with values ranging from 0 to 1. The partition matrix had the additional constraints that the column wise sum along clusters must equal 1 and the row wise sum along document  must be less than N. To update this matrix in each iteration, we use the computed weighted aggregated cosine distances. Here we take the ratio of the weighted aggregated Dwcij and Dwckj and aggregate the ratio for all the clusters. Note we may feel like we did not compute the denominator till now since it has not been mentioned so far. It follows the same computation as the numerator but against a cluster k ranging from 1 to C instead of the specific ith cluster.  Both i and k are cluster iterators and the k changes from 1 to C for the same i in this computation. Therefore Dwcij an Dwckj are both known before this step to update the matrix. Once we calculate the aggregated ratios we raise it to power 1/(m-1). m is just a degree for the partition matrix values. And then we inverse the result to calculate the new partition matrix value uij.
Next we look at updating the cluster centers. Here if the feature weight vik = 0, then document xj has no bearing on the cluster so we don't update the cluster center. If the feature weight is greater than zero, we update the cluster center with the mean of the fuzzy membership weight based documents.
Finally, we update tau. Tau is last to be updated in the iteration and here the only thing that has changed  from the last computation is the partition matrix and the newly computed Dwcij.
Thus we see that the data structures we need are just the partition matrix and variables per iteration.  The steps in the iteration are as discussed.
The authors for the Fuzzy SKWIC mention a very interesting phenomenon called noise magnets. A cluster starts accumulating outliers and noise while the others are more compact with their constituents at similar distances from the center of the cluster.  This happens because the outlier documents are neither similar to each other nor are they similar to any clusters. In addition, they get assigned a maximum distance of 1. They don't have any effect on the objective function. They join the cluster with the seed nearest to them. Further, they bring down the feature weights of this cluster due to extreme averaging. Consequently, they draw all the outliers to this cluster.

Wednesday, October 30, 2013

In the previous post we talked about the steps to compute the terms for the cosine distances and the feature/relevance weights in each iteration. This we could do with  the parameters we had during each iteration. We started talking about how to update and adjust the relevance weights. So let's wrap that up and get to computing the weighted aggregated Dwcij and adjusting it if it's negative. So picking up the thread from the relevance weights, we saw that it depends on a default value and a bias. We computed the bias by initializing the fuzzy membership values. Other than the tau, everything else was known to compute the relevance weights. For tau, we wanted to compute it for iteration t with something like the following equation tau-t = K-constant * ((Sum-fuzzy-memberships)(Sum-weighted-cosine))/Sum-relevance-weight-squared) Here we see that we have everything we need to compute for t given K is a constant. So relevance weights can be calculated. we just need initial values for the first iteration and the previous iteration values in the next one.
Now let's look at the weighted aggregated Dwcij and the Dwcij,
 Computing the weighted aggregated Dwcij is know as equal to  Summation over N (vik.Dwcij).
Its the negative values we need to make adjustment on. Here for each of the clusters we compute the absolute value of weighted aggregated Dwcij and take the minimum. This is the adjustment value we add to the weighted aggregated Dwcij that comes out to be minimum.
Note that the previous iteration values of Dwcij are not used in the next iteration but they are recomputed. This is because they are based on cluster centers which we update in each iteration. The adjustment to the negative value for Dwcij follows a similar pattern as adjusting the negative value for relevance weights that we have seen earlier.
Today we will look at the distance measures used with the Fuzzy SKWIC, specifically the aggregated sum of cosine based distances along the ith dimension. For a program to implement it, we will review what we need to save from our computations for use later so we can be more efficient.
 The aggregated sum of cosine based distance we said was computed as Dwcij = 1/n - xjk.cik where i ranges from 1 to C, j ranges from 1 to N and k ranges from 1 to n. n is the total number of terms in a collection of N documents. For this we maintain a C x N matrix. When we have computed the individual cosine based distances, we will be computing their weighted aggregated  sum and will need this D'wcij to be saved. The feature weights are initialized so we compute this weighted aggregated sum for the first iteration. But note that the Dwcij is used when updating the feature weights. Dwcij could change when the cluster center changes but for the duration of iterations, (our iterations are to update the cluster centers) we can use the same Dwcij.  Looking at the equation for updating the feature weights vik = 1/n  + (1 /2tau)(SumN(uij^m)[D'wcij/n - Dwcij], we note that it uses both the weighted aggregated sum  from the previous iteration and the Dwcij we computed. With a suitable value for tau and initializing the fuzzy memberships and m, we are good to compute the feature weights.  The tau changes in each iteration and it serves to maintain the same order of magnitude in the objective function we discussed, so we initialize the values in the equation to update tau for each iteration with what we have in the first iteration. Since tau affects the feature weights, we will need to adjust the negative feature relevance values. So far all the variables except for the cosine based distance along the individual dimension are subject to change in each iteration.

Tuesday, October 29, 2013

In the previous post, we mentioned a few caveats and perhaps we can elaborate on some of them. These are for example, the document may not contain terms that are all in the corpus, terms may be missing from both, some document terms may need to be discounted and finding the KLD for these may result in infinite value, so we treat them the same as unknown terms and cap it with an epsilon value. Further, we mentioned that the KLD may not be used for clustering as opposed to its use in pre-processing for clustering. This is because the KLD has the following shortcomings: It doesn't work for all the data and the choice of the inclusions makes a difference to the end result. Instead we wanted something for clustering that is independent of the order of selections even if they can belong to different clusters to varying degrees. Furthermore, the distance metric used for clustering has several components often including the aggregated cosine distances.
On the other hand, we can better tune our clustering when we try out with a smaller dataset. So while KLD fits in very well to reduce the data set, the clustering works well with to use it with smaller dataset. The technique in clustering should work regardless. With improvements in the data set, the clustering can be refined to the point where we can find the different datasets.
One way to test the clustering would be to use a smaller sample. If it works for smaller sample, it should work for more data. The clustering relies on a metric that is independent of the number of data points. If its just one word, then the cluster center is the word itself.
When we talk of smaller data sets for clustering, we will find it easier to verify with manual classification and categorization. This will help with the precision and recall. Note that from our previous examples, the precision is the ratio of the true positive to the total positive cases i.e. the sum of both true positives and false positives. On the other hand, recall is the ratio of the true positives to the sum of true positives and false negatives. In other words, recall is how much of the relevant items were retrieved while precision is how many items that were retrieved were relevant.
With smaller samples, we can easily measure the precision and recall and fine tune our clustering.
One suggestion is that regardless of the sample we work with, we try text from different categories and not just from the corpus. Web data can be used for the trials since this has a lot more variety.
The discussion on KLD above was read from the paper on the same by Brigitte Bigi and the discussion on the clustering was read from the paper on topic detection by Christian Wartena and Roger Brussee. The paper by Wartena and Brussee  describes the clustering and the pre-processing we refer.
We discussed processing text documents and I wanted to mention GATE framework in this regard. It stands for General Architecture for Text Engineering. Its an open source software and is widely adopted by organizations, corporations and universities and has a community of developers, educators and scientists. It allows different text processing tasks to be completed by creole plugins. The tokenizer splits text into very simple tokens. The grammar rules are more adaptable and can be used for interpreting the text.  A gazetteer is used for identifying proper nouns and named entities in the text. Annotations are used with this gazetteer list to mention the type and sub-types. These can then be looked up via grammar rules. For example, to find the 'day' of the week in a text, this gazetteer list can be used to lookup Thursday.
Part of Speech tagging is also handled by this framework although we can rely on TreeTagger for the same. The tree tagging lets us filter the content words by part of speech further.
We haven't yet reviewed the application of the Kullback-Leibler mechanism to differentiate the keywords from the background text. The suggestion is that the background text could be considered a different document or bag of words.  We want to differentiate the keyword candidates from the background using the individual probability distributions of the keywords and applying the  Kullback-Leibler cutoff. This technique is also called relative entropy.  The more dissimilar the two probability distributions are, the higher the divergence and more useful for keyword selection. The terms could be selected only when it contributes significantly to the divergence.
Each word represents a probability and each document refers to a term vector of probabilities.

The goal is to require keywords P to be different from the words with little discerning power and the general background distribution Q. We can then define a threshold P\Q to a value as a cutoff. Typically greater than 1 is a good threshold.
something along the lines given by the small python syntax based pseudocode
import re, math, collections, numpy
def tokenizeandfilter(sent):
  // tokenize
 // remove stop words
 // lemmatize

def ProbabilityDistributions( document ):
 // assign probabilities to terms

def KLDifferentiation( PD1, PD2 )
     // add words to document 2 from 1 such that the divergence is greater than 1

A caveat with  Kullback-Leibler divergence is that it doesn't work well when the probability distribution is zero in either of the documents, or words are included in both documents that could have otherwise been left out.

That is why we plan to use it for our pre-processing and not with our clustering.