Monday, October 28, 2013

We talked about Kullback-Leibler divergence in the previous post. We will look into it now.
How it comes useful is in its ability to differentiate two probability distributions P and Q where P represents the true distribution and Q represents the computed statistical distribution.
The computation for this divergence measure is computed using expected values in terms of information entropy.
The divergences is given as the entropy of P - the cross entropy of P and Q. It can also be represented as -Sum(p(x)log(qx)) + Sum(p(x)log(p(x))). We will defer the application of the Kullback-Leibler divergence for later. Let us discuss the next steps in the processing such as lemmatization and tagging.
Both lemmatization or finding the lemma and tagging with parts of speech such as noun phrase, verb, adjective etc can be done with a TreeTagger. It can be downloaded and experimented with separately. But given a sentence of words, it can detect the parts of speech and the lemma that go with it into two lists. Internally, it uses a decision tree classifier to identify the parts of speech. the decision tree is built recursively from a training set of trigrams. At each recursion step, a test is added which divides the trigram set into two subsets with maximum distinctness regarding the probabilistic ditribution of the third tag. The test examines if one of the two preceding tags is identical to a tag t from a known tagset. In the same recursion step, all possible tests are compared and the best one yielding most information is attached to the current node. Then this node is expanded recursively on each of the two subsets of the training set. This results in a yes-no binary decision tree.
Stemmers are included with the tree tagger to help with identifying the lemma. The stemmers for example has the ability to detect common modifying suffixes to words. Lemmatization generally reduces the number of unique terms drastically and this helps reduce the arbitrarily high dimensions to work with for terms.
Stop lists are another tool to filter out the frequently occurring words or words that have no discernible information content. This list can be maintained external to the processing and is quite useful for reducing the count of relevant terms.
Together both the above steps are prerequisites for clustering based on document frequency.

We alluded to several stages of text processing during implementation before we can do clustering. We also compared agglomerative to K-means clustering. Recently, agglomerative processing has been made very fast and efficient. And since it represents bottom up merge strategy, it has some advantages we discussed. However, the K-means is considered better for text purposes. For the rest of our implementation discussions, we will focus on K-means only.
We select C Documents based on random sampling.  We Initialize the fuzzy partition matrix C x N where each of the values of the matrix are < 1 and the row wise sum is < N and the column wise sum = 1.
We populate the document frequency vectors xi for each of the n terms in a collection of N documents.
To calculate the cosine similarity between document frequency vectors xi and xj, we calculate their aggregated dot product divided by their magnitudes.
To calculate the dissimilarity between document frequency vector xj and the ith cluster center vector from the matrix C x N above, we aggregate the weighted distance for each of the n terms. This distance is the cosine based distance along the individual dimension as 1/n - xjk.cik where k varies from one to n
Then the rest of the processing is per the algorithm we discussed earlier.
Note that N is typically much smaller than n and that the matrix for the C x N has C typically much much smaller than n.
This means that all our aggregations over n terms have to be efficient. This could be achieved with reducing the number of iterations over n and possibly some pre-computations.
Moreover, the aggregations over C and n may need to be repeated for many components such as the error to the clusters and the feature weights. Similarly the cosine based distance could also be computed once and stored for lookup later so that we don't have to perform the same operations agains. This could speed up processing.
The fuzzy labels are found by aggregated over each of the clusters . They are aggregated over the number of documents N with a proper choice of m  - the degree to which the fuzzy membership is raised to compute the feature weight. They are also aggregated to update the cluster centers.

Sunday, October 27, 2013


In the previous two posts, we looked at K-means clustering and improving its performance for large data sets. K-means is very useful for forming known K clusters. Besides we can choose what we want to cluster. For example, we can bisect the data points into two clusters. Then we can bisect one of the clusters further if it does not satisfy say a threshold we set. The threshold could be based on distance or density. If we bisect one of the topics cluster further recursively because its larger than we expected, we will now have a binary tree of clusters. In topic analysis, these clusters correlate to topics and subtopics without restricting to the syntax or layout of the text. On the other hand, text segmentation relies on the layout and generates segmentation blocks based on a stochastic topic model. Here the topics are rudimentarily spotted from the text and formed into word clusters and then word clusters sharing common words are merged together. This gives us independent topics. Next, we follow the text to see how similar the preceding block is to the succeeding block and wherever there are valleys or low similarity values, we can segment the text. Contrast this with the bisecting method where the granularity of text segmentation could also be left to user discretion to have as much segmentation  or as little. Thus clustering can be applied in both bisecting as well as agglomerative ways.
For the purposes of keyword extraction, we rely on term weighting. Here the weights are determined based on probabilistic latent semantic analysis or some other technique. These include distances based on cosine, Jensen-Shannon or the feature weights we discussed in the previous posts. We could also consider Kullback-Leibler distance based background suppression to extract keywords.
Meanwhile, We have yet to look at several other implementation considerations such as the following:
Lemmatization - Stemming can help remove the various forms in which words appear be it in noun, form or adjective form.
Text tagging - could possibly be done with TreeTagger
MultiWord Lookup - We may have to look at collocations and words that appear in titles.
named-entity recognition and filtering - Proper nouns and stop word filterings will help with improving the data.
All of the above are necessary before the topic / keyword extraction or assigning term weights.
We talked about summarizing data points when clustering very large data sets. This approach is common to many improvement algorithms to clustering and in particular today we will discuss a simple summarizing technique called BFR for K-means clustering we discussed earlier.
Data Points are read into memory and here too the initial k-cluster centroids are chosen by some simple approach. For example, we can take a sample, pick a random point and then k-1 points as far apart from each other as possible.
We maintain three sets of data points and these are called discard set, compression set, and retained set. The discard set are points close enough to a centroid to be summarized. The compression set is a group of  points that are close together but not close to any centroid.  They are summarized but not assigned to a cluster. The retained sets are isolated points.
The discard set is summarized by the number of points N.
It has a vector SUM for the ith component which is the sum of the co-ordinates of the point in the ith dimension.
The vector SUMSQ for the  ith component which is the sum of the squares of co-ordinates in the ith dimension
2d +1 values that represent any number of points where d is the number of dimensions
Centroid in the ith dimension is calculated as SUMi/N
And the variance is calculated as the (SUMSQi/N) - (SUMi/N)^2
This summarization data is used with the processing of the data points from the very large data set.
The data points are processed in the following manner:
Data points that are sufficiently close to a cluster centroid are added to that cluster and its discard set.
Cluster the remaining points and the old retained set. Clusters that are formed goto the cohesive set and outliers go to the retained set. Since the clusters are not assigned yet, we will not add them to the discard set.
The data summarized for these clusters is then updated with the inclusion of these new points. Some cohesive set clusters can even be merged. If this is the last round, merge all the cohesive set and the retained set to their nearest cluster.

Saturday, October 26, 2013

We discussed SKWIC and Fuzzy SKWIC in that order. I wanted this post to be in favor of implementation starting from the basics. Actual implementation details may follow subsequently but in this post we start with the basic.
Given a set of tuples, with fixed set of attributes, we can implement different clustering algorithms. Let us pick at something real simple like classifying each tuple into one of three different categories say X, Y and Z. Let's say the tuples represent points with (x,y) axis co-ordinates. On a graph with x and y axis, these points appear in a cartesian plane that we may be familiar with. Visualizing the points on this graph helps us find the clusters X, Y and Z as separate from each other. It also helps us calculate the distance between any two points as the sqrt of the sum of squares of x axis projections and y-axis projections.
One approach is to use a centroid - the point closest to other points in a cluster and represent the cluster with the centroid. This works in our case since we can keep track of three points for three categories we want our points in. This is called the K-means. and in our case K = 3
So the steps involved are something like this:
1) Initialize the cluster by picking three points for centroids. Pick one point randomly, then the other two points as far away as possible from the previous points.
2) For each point, place it in the nearest cluster, the distance to the centroids is minimum. Recompute the centroid for this cluster
3) After all the points are assigned, fix the centroids of these three clusters.
4) Re-assign all the points to their closest centroid.
Sometimes K is not clear initially
And we may need to try different k, then we see the changes in the average distance to centroid. When K increases, this drops rapidly until after the right k, it changes very little.
This is K-means clustering. Next we will look at optimizing memory and CPU.
For now, we can assume that initialization costs can be done away with random and re-selection picks if they are not far apart.
In general when there are lots of tuples, we load them in batches that can fit in memory.  Then we maintain three classes of points and use summarization so that we can represent them with shorter data structures and continue our clustering.
For each cluster, the summarization info includes the number of points N the vectors for the sum of the co-ordinates of the points in the ith dimension and the sum of the  square of the co-ordinates in the ith dimension.
we now summarize the Fuzzy SKWIC algorithm below
Fix the number of clusters C;
Fix m, m can range from 1 to infinity; m goes with fuzzy memberships so suitable value can be chosen to amplify that.
Initialize the centers by randomly selecting C documents.
Initialize the fuzzy partition matrix U  = C x N matrix = [uii] where uii is the fuzzy memberships subject to the following constraints: a) uij is bounded between 0 and 1. b)sum of uij for j = 1 to N  is bounded by 0 to N and c) sum of uij for clusters 1 to C must equal 1.
REPEAT
1) Find the cosine distance along the individual dimensions k as 1/n - xjk.cik where the cik is the kth component of the ith cluster center vector. Do this for i ranging from 1 to C, j ranging from 1 to N and k ranging from 1 to n ( n is the total number of terms in a collection of N documents)
2) Before we find the aggregated cosine distance, update the relevance weights vik by using the solution from partial differential equations  where vik is assigned a  default value of 1/n and a bias involving no fuzzy memberships and only the aggregated differences in cosine distances.
3) Now compute the aggregated cosine based distance for each cluster i from 1 to C and each document j from 1 to N using the relevance weight and the cosine distances along the individual dimensions  and aggregating them
4) If the aggregated cosine based distance from above turns out to be negative, adjust it so that it does not affect the sign of the fuzzy memberships. We adjust by updating it with the minimum magnitude of the aggregated cosine based distance
5) Update the partition matrix Uk by the ratio of the aggregated cosine based distances Dwcij and Dwckj and the ratio summed for each cluster. The result is then inverted to update the matrix.
6) Update the centers by taking into account the fuzzy memberships. Specifically, we set cik to 0 when the relevance weight is zero and set it to fuzzy membership normalized average xjk.  The fuzzy membership is amplified by raising to power m and aggregated over the number of document j ranging from 1 to N in both the numerator and the denominator for normalizing the document frequency vector
7) Update the tau for the ith cluster by using the cosine based distance along the individual dimensions using the cosine based distance along individual dimension and the feature weights summed for all the terms and divided by the sum of the square of the feature weights. This is then multiplied by the sum of the amplified fuzzy memberships and a constant K. We find the value of tau by repeating the iterations with the values for the fuzzy membership, the feature weight, cluster center cik from the previous iteration.
(Continue the iterations) UNTIL centers stabilize.
Note that we have used the fuzzy memberships in the initialization and steps 4,5,6 and 7. So fuzzy SKWIC improves the partition matrix, the center updates and the choice of tau in each iteration.

Friday, October 25, 2013


we will continue to solve our objective function for the overlapping clusters so that documents can have fuzzy labels. Our fuzzy SKWIC objective function had similar components as the one we discussed earlier i.e the first component is the error to the clusters and the second component is the feature weights. In addition, the first component now had the fuzzy membership degrees aggregated on various clusters and terms. The objective function was subject to the condition that the feature weights will range between 0 and 1 and the sum of all feature weights will equal 1. The second component has a tau parameter that lets us tune the function so that both components have equal orders of magnitude. This we mentioned was required to keep both components equally relevant. Each cluster has its owns set of feature weights Vi = vi1, .. vin. 
Now we use the same Lagrange multiplier technique to optimize the objective function with respect to V. Since the rows of V are independent of each other, we reduce the objective function to C independent problems.
Then we find the partial differential equations by setting the gradient of the objective function to zero. We do this with respect to both the multiplier and the feature weights.
Solving the differential equations for feature weight vik we get two components where the first term is the default value again 1/n in this case and the next component is the bias and this one has the fuzzy membership degree included in the aggregated dot product.
The choice of tau is very important since it reflects the importance of the second term relative to the first term. The consequence of choosing a small tau will result in choosing only one keyword per cluster with feature weight 1 whereas choosing a larger value will result in all the words in a cluster to be relevant and assigned equal weights.
Since the second component of the objective function does not depend on the fuzzy membership, the update equation of the fuzzy membership is similar to Fuzzy C Means which is explained as the inverse of the cluster-wise aggregation of the ratio of aggregated cosine based distances raised to m-1.