Thursday, October 31, 2013

Some observations from the implementation by the authors of Fuzzy SKWIC. We talk about their experiments with general SKWIC as well as the fuzzy improvement.
They tried simulation results with hard clustering. Their collection consisted of 200 documents spread out equally from four categories. They did filter stop words and used stemming. IDF of the terms were used in descending order to pick out 200 keywords.  Each document was represented by a vector of its document frequencies. This vector was normalized to unit-length. Four clusters were used and the SKWIC converged in five iterations. Six cluster dependent keywords were chosen and verified and they correctly represented the cluster category. Only one partition showed most of the error. Documents that straddled two or more topics were found in this partition.
One observation I would like to highlight here is regardless of the keyword extraction, their choice for the number of keywords was in the order of a few hundreds. The ratio of keywords to documents is incidental here but the choice to keep the number of keywords to a small number is deliberate. We should follow a similar practice.
Now on to the fuzzy labels based run, the results were again with four clusters and m = 1.1. Their cluster center updates converged after 27 iterations. The results did indicate a significant improvement over the results from the former run. The improvement was seen particularly with respect to the partition that had errors with documents. In this run, those documents were correctly classified. A closer look at those documents suggested that they could have had soft labels from the start i.e they should not have been assigned one label. The results also indicate that the labels generated by the Fuzzy SKWIC was richer and more descriptive of their categories, thus indicating a good application.
A third simulation was done with a much varied data from 20 newsgroups with 20000 messages. Here the messages were stripped from their structural data in addition to being processed with stemming and stop word filtering.
Here the messages were stripped from their structural data in addition to being processed with stemming and stop word filtering. Here the number of clusters were 40. The feature weights varied quite a bit from cluster to cluster  A lot of document migrations to clusters were seen Ten relevant keywords for each cluster were chosen. Again, they reflected their clusters well. Some indicated that the discussions were about an event, while others indicated that they pertained to an issue found interesting by the participants.
We were discussing the implementation of Fuzzy SKWIC where were going through the steps within each iteration which we repeat until cluster centers stabilize. So far we had looked into finding cosine distances, updating and adjusting relevance weights, computing the weighted aggregated cosine  and adjusting it. We will discuss the updates to partition matrix, cluster centers and tau next. The partition matrix was initialized by weights in a matrix of C x N and each with values ranging from 0 to 1. The partition matrix had the additional constraints that the column wise sum along clusters must equal 1 and the row wise sum along document  must be less than N. To update this matrix in each iteration, we use the computed weighted aggregated cosine distances. Here we take the ratio of the weighted aggregated Dwcij and Dwckj and aggregate the ratio for all the clusters. Note we may feel like we did not compute the denominator till now since it has not been mentioned so far. It follows the same computation as the numerator but against a cluster k ranging from 1 to C instead of the specific ith cluster.  Both i and k are cluster iterators and the k changes from 1 to C for the same i in this computation. Therefore Dwcij an Dwckj are both known before this step to update the matrix. Once we calculate the aggregated ratios we raise it to power 1/(m-1). m is just a degree for the partition matrix values. And then we inverse the result to calculate the new partition matrix value uij.
Next we look at updating the cluster centers. Here if the feature weight vik = 0, then document xj has no bearing on the cluster so we don't update the cluster center. If the feature weight is greater than zero, we update the cluster center with the mean of the fuzzy membership weight based documents.
Finally, we update tau. Tau is last to be updated in the iteration and here the only thing that has changed  from the last computation is the partition matrix and the newly computed Dwcij.
Thus we see that the data structures we need are just the partition matrix and variables per iteration.  The steps in the iteration are as discussed.
The authors for the Fuzzy SKWIC mention a very interesting phenomenon called noise magnets. A cluster starts accumulating outliers and noise while the others are more compact with their constituents at similar distances from the center of the cluster.  This happens because the outlier documents are neither similar to each other nor are they similar to any clusters. In addition, they get assigned a maximum distance of 1. They don't have any effect on the objective function. They join the cluster with the seed nearest to them. Further, they bring down the feature weights of this cluster due to extreme averaging. Consequently, they draw all the outliers to this cluster.

Wednesday, October 30, 2013

In the previous post we talked about the steps to compute the terms for the cosine distances and the feature/relevance weights in each iteration. This we could do with  the parameters we had during each iteration. We started talking about how to update and adjust the relevance weights. So let's wrap that up and get to computing the weighted aggregated Dwcij and adjusting it if it's negative. So picking up the thread from the relevance weights, we saw that it depends on a default value and a bias. We computed the bias by initializing the fuzzy membership values. Other than the tau, everything else was known to compute the relevance weights. For tau, we wanted to compute it for iteration t with something like the following equation tau-t = K-constant * ((Sum-fuzzy-memberships)(Sum-weighted-cosine))/Sum-relevance-weight-squared) Here we see that we have everything we need to compute for t given K is a constant. So relevance weights can be calculated. we just need initial values for the first iteration and the previous iteration values in the next one.
Now let's look at the weighted aggregated Dwcij and the Dwcij,
 Computing the weighted aggregated Dwcij is know as equal to  Summation over N (vik.Dwcij).
Its the negative values we need to make adjustment on. Here for each of the clusters we compute the absolute value of weighted aggregated Dwcij and take the minimum. This is the adjustment value we add to the weighted aggregated Dwcij that comes out to be minimum.
Note that the previous iteration values of Dwcij are not used in the next iteration but they are recomputed. This is because they are based on cluster centers which we update in each iteration. The adjustment to the negative value for Dwcij follows a similar pattern as adjusting the negative value for relevance weights that we have seen earlier.
Today we will look at the distance measures used with the Fuzzy SKWIC, specifically the aggregated sum of cosine based distances along the ith dimension. For a program to implement it, we will review what we need to save from our computations for use later so we can be more efficient.
 The aggregated sum of cosine based distance we said was computed as Dwcij = 1/n - xjk.cik where i ranges from 1 to C, j ranges from 1 to N and k ranges from 1 to n. n is the total number of terms in a collection of N documents. For this we maintain a C x N matrix. When we have computed the individual cosine based distances, we will be computing their weighted aggregated  sum and will need this D'wcij to be saved. The feature weights are initialized so we compute this weighted aggregated sum for the first iteration. But note that the Dwcij is used when updating the feature weights. Dwcij could change when the cluster center changes but for the duration of iterations, (our iterations are to update the cluster centers) we can use the same Dwcij.  Looking at the equation for updating the feature weights vik = 1/n  + (1 /2tau)(SumN(uij^m)[D'wcij/n - Dwcij], we note that it uses both the weighted aggregated sum  from the previous iteration and the Dwcij we computed. With a suitable value for tau and initializing the fuzzy memberships and m, we are good to compute the feature weights.  The tau changes in each iteration and it serves to maintain the same order of magnitude in the objective function we discussed, so we initialize the values in the equation to update tau for each iteration with what we have in the first iteration. Since tau affects the feature weights, we will need to adjust the negative feature relevance values. So far all the variables except for the cosine based distance along the individual dimension are subject to change in each iteration.

Tuesday, October 29, 2013

In the previous post, we mentioned a few caveats and perhaps we can elaborate on some of them. These are for example, the document may not contain terms that are all in the corpus, terms may be missing from both, some document terms may need to be discounted and finding the KLD for these may result in infinite value, so we treat them the same as unknown terms and cap it with an epsilon value. Further, we mentioned that the KLD may not be used for clustering as opposed to its use in pre-processing for clustering. This is because the KLD has the following shortcomings: It doesn't work for all the data and the choice of the inclusions makes a difference to the end result. Instead we wanted something for clustering that is independent of the order of selections even if they can belong to different clusters to varying degrees. Furthermore, the distance metric used for clustering has several components often including the aggregated cosine distances.
On the other hand, we can better tune our clustering when we try out with a smaller dataset. So while KLD fits in very well to reduce the data set, the clustering works well with to use it with smaller dataset. The technique in clustering should work regardless. With improvements in the data set, the clustering can be refined to the point where we can find the different datasets.
One way to test the clustering would be to use a smaller sample. If it works for smaller sample, it should work for more data. The clustering relies on a metric that is independent of the number of data points. If its just one word, then the cluster center is the word itself.
When we talk of smaller data sets for clustering, we will find it easier to verify with manual classification and categorization. This will help with the precision and recall. Note that from our previous examples, the precision is the ratio of the true positive to the total positive cases i.e. the sum of both true positives and false positives. On the other hand, recall is the ratio of the true positives to the sum of true positives and false negatives. In other words, recall is how much of the relevant items were retrieved while precision is how many items that were retrieved were relevant.
With smaller samples, we can easily measure the precision and recall and fine tune our clustering.
One suggestion is that regardless of the sample we work with, we try text from different categories and not just from the corpus. Web data can be used for the trials since this has a lot more variety.
The discussion on KLD above was read from the paper on the same by Brigitte Bigi and the discussion on the clustering was read from the paper on topic detection by Christian Wartena and Roger Brussee. The paper by Wartena and Brussee  describes the clustering and the pre-processing we refer.
We discussed processing text documents and I wanted to mention GATE framework in this regard. It stands for General Architecture for Text Engineering. Its an open source software and is widely adopted by organizations, corporations and universities and has a community of developers, educators and scientists. It allows different text processing tasks to be completed by creole plugins. The tokenizer splits text into very simple tokens. The grammar rules are more adaptable and can be used for interpreting the text.  A gazetteer is used for identifying proper nouns and named entities in the text. Annotations are used with this gazetteer list to mention the type and sub-types. These can then be looked up via grammar rules. For example, to find the 'day' of the week in a text, this gazetteer list can be used to lookup Thursday.
Part of Speech tagging is also handled by this framework although we can rely on TreeTagger for the same. The tree tagging lets us filter the content words by part of speech further.
We haven't yet reviewed the application of the Kullback-Leibler mechanism to differentiate the keywords from the background text. The suggestion is that the background text could be considered a different document or bag of words.  We want to differentiate the keyword candidates from the background using the individual probability distributions of the keywords and applying the  Kullback-Leibler cutoff. This technique is also called relative entropy.  The more dissimilar the two probability distributions are, the higher the divergence and more useful for keyword selection. The terms could be selected only when it contributes significantly to the divergence.
Each word represents a probability and each document refers to a term vector of probabilities.

The goal is to require keywords P to be different from the words with little discerning power and the general background distribution Q. We can then define a threshold P\Q to a value as a cutoff. Typically greater than 1 is a good threshold.
something along the lines given by the small python syntax based pseudocode
import re, math, collections, numpy
def tokenizeandfilter(sent):
  // tokenize
 // remove stop words
 // lemmatize

def ProbabilityDistributions( document ):
 // assign probabilities to terms

def KLDifferentiation( PD1, PD2 )
     // add words to document 2 from 1 such that the divergence is greater than 1

A caveat with  Kullback-Leibler divergence is that it doesn't work well when the probability distribution is zero in either of the documents, or words are included in both documents that could have otherwise been left out.

That is why we plan to use it for our pre-processing and not with our clustering.

Monday, October 28, 2013

We talked about Kullback-Leibler divergence in the previous post. We will look into it now.
How it comes useful is in its ability to differentiate two probability distributions P and Q where P represents the true distribution and Q represents the computed statistical distribution.
The computation for this divergence measure is computed using expected values in terms of information entropy.
The divergences is given as the entropy of P - the cross entropy of P and Q. It can also be represented as -Sum(p(x)log(qx)) + Sum(p(x)log(p(x))). We will defer the application of the Kullback-Leibler divergence for later. Let us discuss the next steps in the processing such as lemmatization and tagging.
Both lemmatization or finding the lemma and tagging with parts of speech such as noun phrase, verb, adjective etc can be done with a TreeTagger. It can be downloaded and experimented with separately. But given a sentence of words, it can detect the parts of speech and the lemma that go with it into two lists. Internally, it uses a decision tree classifier to identify the parts of speech. the decision tree is built recursively from a training set of trigrams. At each recursion step, a test is added which divides the trigram set into two subsets with maximum distinctness regarding the probabilistic ditribution of the third tag. The test examines if one of the two preceding tags is identical to a tag t from a known tagset. In the same recursion step, all possible tests are compared and the best one yielding most information is attached to the current node. Then this node is expanded recursively on each of the two subsets of the training set. This results in a yes-no binary decision tree.
Stemmers are included with the tree tagger to help with identifying the lemma. The stemmers for example has the ability to detect common modifying suffixes to words. Lemmatization generally reduces the number of unique terms drastically and this helps reduce the arbitrarily high dimensions to work with for terms.
Stop lists are another tool to filter out the frequently occurring words or words that have no discernible information content. This list can be maintained external to the processing and is quite useful for reducing the count of relevant terms.
Together both the above steps are prerequisites for clustering based on document frequency.

We alluded to several stages of text processing during implementation before we can do clustering. We also compared agglomerative to K-means clustering. Recently, agglomerative processing has been made very fast and efficient. And since it represents bottom up merge strategy, it has some advantages we discussed. However, the K-means is considered better for text purposes. For the rest of our implementation discussions, we will focus on K-means only.
We select C Documents based on random sampling.  We Initialize the fuzzy partition matrix C x N where each of the values of the matrix are < 1 and the row wise sum is < N and the column wise sum = 1.
We populate the document frequency vectors xi for each of the n terms in a collection of N documents.
To calculate the cosine similarity between document frequency vectors xi and xj, we calculate their aggregated dot product divided by their magnitudes.
To calculate the dissimilarity between document frequency vector xj and the ith cluster center vector from the matrix C x N above, we aggregate the weighted distance for each of the n terms. This distance is the cosine based distance along the individual dimension as 1/n - xjk.cik where k varies from one to n
Then the rest of the processing is per the algorithm we discussed earlier.
Note that N is typically much smaller than n and that the matrix for the C x N has C typically much much smaller than n.
This means that all our aggregations over n terms have to be efficient. This could be achieved with reducing the number of iterations over n and possibly some pre-computations.
Moreover, the aggregations over C and n may need to be repeated for many components such as the error to the clusters and the feature weights. Similarly the cosine based distance could also be computed once and stored for lookup later so that we don't have to perform the same operations agains. This could speed up processing.
The fuzzy labels are found by aggregated over each of the clusters . They are aggregated over the number of documents N with a proper choice of m  - the degree to which the fuzzy membership is raised to compute the feature weight. They are also aggregated to update the cluster centers.

Sunday, October 27, 2013


In the previous two posts, we looked at K-means clustering and improving its performance for large data sets. K-means is very useful for forming known K clusters. Besides we can choose what we want to cluster. For example, we can bisect the data points into two clusters. Then we can bisect one of the clusters further if it does not satisfy say a threshold we set. The threshold could be based on distance or density. If we bisect one of the topics cluster further recursively because its larger than we expected, we will now have a binary tree of clusters. In topic analysis, these clusters correlate to topics and subtopics without restricting to the syntax or layout of the text. On the other hand, text segmentation relies on the layout and generates segmentation blocks based on a stochastic topic model. Here the topics are rudimentarily spotted from the text and formed into word clusters and then word clusters sharing common words are merged together. This gives us independent topics. Next, we follow the text to see how similar the preceding block is to the succeeding block and wherever there are valleys or low similarity values, we can segment the text. Contrast this with the bisecting method where the granularity of text segmentation could also be left to user discretion to have as much segmentation  or as little. Thus clustering can be applied in both bisecting as well as agglomerative ways.
For the purposes of keyword extraction, we rely on term weighting. Here the weights are determined based on probabilistic latent semantic analysis or some other technique. These include distances based on cosine, Jensen-Shannon or the feature weights we discussed in the previous posts. We could also consider Kullback-Leibler distance based background suppression to extract keywords.
Meanwhile, We have yet to look at several other implementation considerations such as the following:
Lemmatization - Stemming can help remove the various forms in which words appear be it in noun, form or adjective form.
Text tagging - could possibly be done with TreeTagger
MultiWord Lookup - We may have to look at collocations and words that appear in titles.
named-entity recognition and filtering - Proper nouns and stop word filterings will help with improving the data.
All of the above are necessary before the topic / keyword extraction or assigning term weights.
We talked about summarizing data points when clustering very large data sets. This approach is common to many improvement algorithms to clustering and in particular today we will discuss a simple summarizing technique called BFR for K-means clustering we discussed earlier.
Data Points are read into memory and here too the initial k-cluster centroids are chosen by some simple approach. For example, we can take a sample, pick a random point and then k-1 points as far apart from each other as possible.
We maintain three sets of data points and these are called discard set, compression set, and retained set. The discard set are points close enough to a centroid to be summarized. The compression set is a group of  points that are close together but not close to any centroid.  They are summarized but not assigned to a cluster. The retained sets are isolated points.
The discard set is summarized by the number of points N.
It has a vector SUM for the ith component which is the sum of the co-ordinates of the point in the ith dimension.
The vector SUMSQ for the  ith component which is the sum of the squares of co-ordinates in the ith dimension
2d +1 values that represent any number of points where d is the number of dimensions
Centroid in the ith dimension is calculated as SUMi/N
And the variance is calculated as the (SUMSQi/N) - (SUMi/N)^2
This summarization data is used with the processing of the data points from the very large data set.
The data points are processed in the following manner:
Data points that are sufficiently close to a cluster centroid are added to that cluster and its discard set.
Cluster the remaining points and the old retained set. Clusters that are formed goto the cohesive set and outliers go to the retained set. Since the clusters are not assigned yet, we will not add them to the discard set.
The data summarized for these clusters is then updated with the inclusion of these new points. Some cohesive set clusters can even be merged. If this is the last round, merge all the cohesive set and the retained set to their nearest cluster.

Saturday, October 26, 2013

We discussed SKWIC and Fuzzy SKWIC in that order. I wanted this post to be in favor of implementation starting from the basics. Actual implementation details may follow subsequently but in this post we start with the basic.
Given a set of tuples, with fixed set of attributes, we can implement different clustering algorithms. Let us pick at something real simple like classifying each tuple into one of three different categories say X, Y and Z. Let's say the tuples represent points with (x,y) axis co-ordinates. On a graph with x and y axis, these points appear in a cartesian plane that we may be familiar with. Visualizing the points on this graph helps us find the clusters X, Y and Z as separate from each other. It also helps us calculate the distance between any two points as the sqrt of the sum of squares of x axis projections and y-axis projections.
One approach is to use a centroid - the point closest to other points in a cluster and represent the cluster with the centroid. This works in our case since we can keep track of three points for three categories we want our points in. This is called the K-means. and in our case K = 3
So the steps involved are something like this:
1) Initialize the cluster by picking three points for centroids. Pick one point randomly, then the other two points as far away as possible from the previous points.
2) For each point, place it in the nearest cluster, the distance to the centroids is minimum. Recompute the centroid for this cluster
3) After all the points are assigned, fix the centroids of these three clusters.
4) Re-assign all the points to their closest centroid.
Sometimes K is not clear initially
And we may need to try different k, then we see the changes in the average distance to centroid. When K increases, this drops rapidly until after the right k, it changes very little.
This is K-means clustering. Next we will look at optimizing memory and CPU.
For now, we can assume that initialization costs can be done away with random and re-selection picks if they are not far apart.
In general when there are lots of tuples, we load them in batches that can fit in memory.  Then we maintain three classes of points and use summarization so that we can represent them with shorter data structures and continue our clustering.
For each cluster, the summarization info includes the number of points N the vectors for the sum of the co-ordinates of the points in the ith dimension and the sum of the  square of the co-ordinates in the ith dimension.
we now summarize the Fuzzy SKWIC algorithm below
Fix the number of clusters C;
Fix m, m can range from 1 to infinity; m goes with fuzzy memberships so suitable value can be chosen to amplify that.
Initialize the centers by randomly selecting C documents.
Initialize the fuzzy partition matrix U  = C x N matrix = [uii] where uii is the fuzzy memberships subject to the following constraints: a) uij is bounded between 0 and 1. b)sum of uij for j = 1 to N  is bounded by 0 to N and c) sum of uij for clusters 1 to C must equal 1.
REPEAT
1) Find the cosine distance along the individual dimensions k as 1/n - xjk.cik where the cik is the kth component of the ith cluster center vector. Do this for i ranging from 1 to C, j ranging from 1 to N and k ranging from 1 to n ( n is the total number of terms in a collection of N documents)
2) Before we find the aggregated cosine distance, update the relevance weights vik by using the solution from partial differential equations  where vik is assigned a  default value of 1/n and a bias involving no fuzzy memberships and only the aggregated differences in cosine distances.
3) Now compute the aggregated cosine based distance for each cluster i from 1 to C and each document j from 1 to N using the relevance weight and the cosine distances along the individual dimensions  and aggregating them
4) If the aggregated cosine based distance from above turns out to be negative, adjust it so that it does not affect the sign of the fuzzy memberships. We adjust by updating it with the minimum magnitude of the aggregated cosine based distance
5) Update the partition matrix Uk by the ratio of the aggregated cosine based distances Dwcij and Dwckj and the ratio summed for each cluster. The result is then inverted to update the matrix.
6) Update the centers by taking into account the fuzzy memberships. Specifically, we set cik to 0 when the relevance weight is zero and set it to fuzzy membership normalized average xjk.  The fuzzy membership is amplified by raising to power m and aggregated over the number of document j ranging from 1 to N in both the numerator and the denominator for normalizing the document frequency vector
7) Update the tau for the ith cluster by using the cosine based distance along the individual dimensions using the cosine based distance along individual dimension and the feature weights summed for all the terms and divided by the sum of the square of the feature weights. This is then multiplied by the sum of the amplified fuzzy memberships and a constant K. We find the value of tau by repeating the iterations with the values for the fuzzy membership, the feature weight, cluster center cik from the previous iteration.
(Continue the iterations) UNTIL centers stabilize.
Note that we have used the fuzzy memberships in the initialization and steps 4,5,6 and 7. So fuzzy SKWIC improves the partition matrix, the center updates and the choice of tau in each iteration.

Friday, October 25, 2013


we will continue to solve our objective function for the overlapping clusters so that documents can have fuzzy labels. Our fuzzy SKWIC objective function had similar components as the one we discussed earlier i.e the first component is the error to the clusters and the second component is the feature weights. In addition, the first component now had the fuzzy membership degrees aggregated on various clusters and terms. The objective function was subject to the condition that the feature weights will range between 0 and 1 and the sum of all feature weights will equal 1. The second component has a tau parameter that lets us tune the function so that both components have equal orders of magnitude. This we mentioned was required to keep both components equally relevant. Each cluster has its owns set of feature weights Vi = vi1, .. vin. 
Now we use the same Lagrange multiplier technique to optimize the objective function with respect to V. Since the rows of V are independent of each other, we reduce the objective function to C independent problems.
Then we find the partial differential equations by setting the gradient of the objective function to zero. We do this with respect to both the multiplier and the feature weights.
Solving the differential equations for feature weight vik we get two components where the first term is the default value again 1/n in this case and the next component is the bias and this one has the fuzzy membership degree included in the aggregated dot product.
The choice of tau is very important since it reflects the importance of the second term relative to the first term. The consequence of choosing a small tau will result in choosing only one keyword per cluster with feature weight 1 whereas choosing a larger value will result in all the words in a cluster to be relevant and assigned equal weights.
Since the second component of the objective function does not depend on the fuzzy membership, the update equation of the fuzzy membership is similar to Fuzzy C Means which is explained as the inverse of the cluster-wise aggregation of the ratio of aggregated cosine based distances raised to m-1.



When we discussed how the SKWIC works in the previous post, we described a cluster partitioning method that assigns a document to a single cluster. However, this is rarely the case when a document is for a single subject only. Most documents will tend to straddle the subjects of two or more different documents. And even manual classification is difficult and poor. When a document is classified to a single group, it affects the retrieval ability once a classification model is built. K-means and SKWIC are both hard partitioning models and by assigning the documents to the same category, they have limited use in real large document collections.
Instead if we assign soft labels to documents that is the documents are not confined to  a single category, then this can be improved. Especially when we use fuzzy or soft partition models, the data is modeled better than their hard partitioning counterparts . This is because fuzzy memberships are richer than crisp memberships in that they describe the degrees of associations of data points lying in overlapping clusters.
If we denote this fuzzy membership as uii which a datapoint may have as varying degrees to each cluster Xi for a datapoint x, then the fuzzy partition is described as a C x N matrix U = |uii|. This matrix is subject to the following constraints:
1. The degrees of association uij can range only between 0  and 1. This means that we assign them continuous values that are bounded by 0 and 1
2. The sum of the degrees of association uij for  all the N terms must range between 0 and N
3. The sum of the degrees of association uij for all the clusters must equal 1.
We define a new objective function where we take this fuzzy memberships together with the sum of the errors to the cluster center as the first component . This component is minimized when only one keyword in each cluster is completely relevant and all other keywords are irrelevant. The other keyword is the sum of the squared keyword weights. The global minimum of this keyword is achieved when all the keywords are equally weighted. The goal of the objective function is to reduce the sum of the intracluster weighted distances.
We find similar partial differential equations to solve this objective function in the same manner as the previous one.

Thursday, October 24, 2013

In the previous post, we talked about finding tau and adjusting it to have the same magnitude between the two components of the objective function. What we will see next is that the cluster partition that minimizes J is the one that assigns each data sample to the cluster with the nearest center. In other words, the fraction of xj over the cosine based distance Dwcij will always be less than the weighted aggregated cosine based distance for all i that is not at the center. Ties are resolved arbitrarily for candidates that are vie for center.
Since the optimization function cannot be minimized with respect to the centers, we compute new cluster centroids and normalize them to unit-length.
Notice that this results in two cases for the feature weight vik which we may remember from the previous post was the default value (1/n) and the bias. The latter can be both positive and negative.
So the first case is when the vik = 0
This means that the kth feature is completely irrelevant relative to the ith cluster. Hence, regardless of the value of cik, the value of this feature will not contribute to the overall weighted distance computation. Therefore, in this situation, any arbitrary value can be chosen for cik and is generally set to 0.
The second case is when vik != 0
In this case the center is recomputed as cik is equal to the normalized aggregated sum of the document frequency vectors across the different attribute directions. i.e we are saying the kth feature has some relevance to the ith cluster we will simply pick nearly the mid-point.
Now, let us look at summarizing the above steps for clustering a collection of N normalized document vectors defined over a vocabulary of n keywords:
Fix the number of clusters C
Initialize the centers by randomly selecting C documents
Initialize the partitions X, using equal feature weights (1/n) where n is the total number of items in a collection of N documents.
REPEAT
Compute cosine based distance between document xj and ith cluster center vector using a formula that involves the kth component of the ith cluster center vector as 1/n-(xjk,cik). Repeat this for every cluster and for every term from 1 to N and along every dimension from 1 to k
Update the relevance weights vik by using the default value and computing the bias along this dimension after computing that across all dimensions
Compute the weighted aggregated sum of cosine based distances along the individual dimensions ( dotproduct of feature weights and cosine based distance.
Update the cluster partitioning with the method discussed above
Update the tau
UNTIL centers stabilize
The difference between some similar approaches in pattern recognition and here is that the former has an assumption that data has a multivariate Gaussian distribution where as this treats them as independent.
We saw that the objective function is expressed in cluster center errors and feature weights. Each cluster i is allowed to have its own set of feature weights Vi = [vi1, vi2, ..., vin]  The feature weights are of at most unit length and after clustering, the sum of all feature weights vi1 +vi2 ... + vin = 1. If we denote the objective function by J, then by setting the gradient of J to zero, we get the partial derivative equations we wanted to solve. The partial derivatives are with respect to the Lagrange multiplier in the first equation and with respect to feature weight in the second equation.
The objective function has reached its final state when it does not change with regard to the changes in the multiplier or the feature weight, That is why we set the gradient to zero and then solve for the feature weight vik.
We solve this equation mathematically to write an equation in terms of vik. This is only one time. The resulting equation for vik has two parts. The first part is a constant 1/n and is the default value if all attributes/keywords are treated equally and no discrimination is performed.  The second term is a bias that can be either positive or negative. It comprises of the aggregated difference between the average and the actual cosine based distances along the individual dimensions.It is positive for compact attributes where the distance along this dimension is, on the average, less than the total distance using all of its dimensions. If an attribute is very compact, it is much more relevant to the cluster and could even be the center. Since the individual cosine based distance along a term-wise individual dimension could be negative, we allow them to be dissimilar and shows up as negative value. The resulting bias only increases emphasizing that dimension further. If the total aggregate dissimilarity becomes negative, then it no longer forms part of the cluster since we partition the data based on minimum distance. Thus we have obtained the attribute weight vik and are comfortable with the second component of the objective function.
The choice of the tuning parameter tau with the objective function is important because it reflects the importance of the second term with respect to the first term. If the tau for the ith cluster is very small then only one keyword in that cluster will be relevant. The rest will receive zero weights. If it is large, then all the terms in that cluster will be relevant and may receive equal weights. The tau could be chosen such that both the terms of the objective function have the same order of magnitude.
We compute this tuning parameter tau numerically in iterations. In iteration t we use the aggregated cluster center errors from t-1 iteration divided by the sum of the squared feature weights again from
t-1 iteration. The resulting fraction is taken with a constant for tau to adjust the magnitude.
The feature relevance values could change to a value in excess of 1
In our iterations we may find the equations to change a lot initially however there is little or no changes towards the later iterations.

Wednesday, October 23, 2013

In this post we will take an in depth look at all the formula behind each of the steps mentioned in the previous post. Our goal is to understand not only how the SWKIC works but also describe it in stages that enables an implementation. We will assume that given distance vectors with precomputed distances, we know how to cluster them.
In our case we begin with the similarity measure between document frequency vectors xi and xj.
This distance is measured as the dot product over their magnitudes. By that we mean we take an arbitrary term k between 1 and n which is the number of terms. We know that the document frequency vectors for terms to appear together such as xik and xjk can be calculated. This has to be repeated for k = 1 to n so we take the aggregated sum of the dot product xik * xjk over this range and normalize them by dividing with their magnitudes to compute the cosine similarity. Intuitively, cosine similarity could be positive and negative since the vectors may point in different directions. Here the steps involved are iteration, summation and normalization.  So this is easy to do.
We have a computational complexity of O(N^2) We will call this the first step.
Now lets look at the next formula: Here we said we need to find the dissimilarity measure across the different attribute directions.we defined the dissimilarity between document frequency vector xj and the ith cluster center vector using the cosine measure. This we define as the second equation which requires  a third formula for the distance along the individual dimensions.
The third formula is easy because the distance along individual dimensions is something smaller than 1/n and this is the dot product of xjk and cik
Note that the cik is the kth component of the ith cluster center vector and the cluster centers are terms themselves so cik is a specially marked xik. That is why the third step is easy to compute given that we did first step.
Next we look at the objective function. This has several components so we will take a look at each of them one at a time. The objective function searches for optimal cluster centers and the optimal set of feature weights simultaneously.
The first part of the objective function is the error to the cluster centers. It is minimized when only one keyword in each cluster is completely relevant and all other keywords are irrelevant. So we sum over each of the clusters.
The Second part of the objective function is the sum of the squared keyword weights. When all the keywords have equal weights, we get the global minimum of this component.  We also have a tuning parameter to use with the second component that we can set. Together, the objective function minimizes the sum of intra-cluster weighted distances, where the keyword weights are optimized for each cluster.
Instead of using the objective function directly, it is decomposed into C independent problems. This way we remove the summation for all the clusters. and we use a multiplier technique with partial derivatives. To get the equations for partial derivatives, we set the gradient to zero. We will also look at these equations in detail shortly.
In the previous post, we introduced SWKIC from the paper by Frigui and Nasraoui. In this post, we take a look at how they update keyword weights and their clustering.
SKWIC algorithm is given as :
Fix the number of clusters C
Initialize the centers by randomly selecting C documents
Initialize the partitions X, using equal feature weights (1/n) where n is the total number of items in a collection of N documents.
REPEAT
Compute cosine based distance between document xj and ith cluster center vector using a formula that involves the kth component of the ith cluster center vector
Compute the aggregated sum of the above
Update the cluster partitioning
Update the centers
Update the delta-i
UNTIL centers stabilize
We will discuss  the steps above. Let xi and xj be the document frequency vectors defined on a vocabulary of n terms. Then the cosine similarity measure between xi and xj to categorize the corresponding documents is taken as the fraction of the dot product of xik and xjk over their magnitudes.
They also introduce  a dissimilarity measure between documents xj and the ith cluster center vector as the weighted aggregated sum cosine based distances along the individual dimensions.
This distance is bounded by the inverse of the total number of terms and measures to be the difference between the inverse and the dot product of xjk and the kth component of the cluster I.
The weighted aggregated sum is not normalized with the magnitude because both the weights and the distances are normalized to unit length.
The weights in the weighted aggregated sum is dynamically obtained as the relevance weight of the keyword k in cluster I.
Thus we have both the similarity measure and the dissimilarity measure for use with the document.
SKWIC is designed to search for the optimal cluster centers C and the optimal set of features weights V simultaneously. Each cluster is allowed to have its own Vi = [vi1,...,vin] and there is an objective function to get to the optimal.

Tuesday, October 22, 2013


While we discussed graph based algorithms, we mentioned that there have been improvements since. We will discuss some of these today. We will use the survey by Frigui and Nasraoui who come up with their own technique as well. One improvement has been to use clustering. Clustering is efficient for finding the nearest neighbors be it terms or documents. We use a vector space model where each term is represented by a vector in the document space. The document is considered a bag of words. The vectors can be assigned based on term-frequencies or other statistical or probabilistic metrics. If term-frequencies are used, frequent words with little discriminating power can be discounted with its Inverse Document Frequency in the document collection. Since documents vary a lot, both TF and IDF or their combination is not sufficient. So we go by a large corpus where the text has been manually classified. and this provides a rich base for classifying terms or documents.  However, both the categories and their respective keyword sets need to be  discovered simultaneously. Frigui and Nasraoui mention that selecting and weighting subsets of keyword in text documents is similar to the problem of feature selection and weighting in pattern recognition and data mining. Selecting the best set of features is important for real world tasks because there could be noise from irrelevant features that can severely degrade performance. Even if the data samples have already been classified into known classes, the authors suggest that it is preferable to model each complex class by several simple sub-classes or clusters and to use different set of feature weights for each cluster. This can help in classifying documents into one of the preexisting categories. Since clustering and feature selection have been dealt with independently, here's where the authors suggest a scheme that does both simultaneously.  They call this scheme SCAD which stands for simultaneous clustering and attribute discrimination.
They also improve on weighting by using a continuous feature weighting method. So feature relevance representation is better than binary or other discrete basis. Also, SCAD learns a different feature relevance representation for each cluster in an unsupervised manner.Hence the keyword weighting is both dynamic and category-dependent while being simultaneous with document clustering. One thing to note is that we described metrics for distance vectors and these can be Euclidean however that is not appropriate for text. For text, we rely on cosine similarity or Jaccard index to assess the similarity between documents. Moreover, a text document can belong to several categories  with different degrees. This is done by assigning fuzzy or soft labels instead of assigning them to single label category. There are also noise magnet clusters that take away the outliers from the clusters. So SCAD is refined to work on text documents and this is called Simultaneous Keyword Identification and Clustering of text documents (SKWIC).

We briefly discuss other synonym extraction methods. One is the Distance method and another is the ArcRank method. In the distance method, one possible way of defining  a synonym distance is to see if the words appear in the definitions of many common words or have many common words in their definition. A way of formalizing this is to count the number of words that appear in one of the definitions but not in both in which case the distance can be expressed as |Ai - Aj| + |Aj - Ai| where A is the adjacency matrix.
Unlike other methods presented in this chapter, we can apply this algorithm directly to the entire dictionary graph rather than the neigborhood graph. However, it has been found that better results are obtained when the neighborhood graph is used.
ArcRank is a method for building a thesaurus. The intent with the approach was not to find synonyms but related words. Both the ArcRank method and the method discussed in the previous post had used the 1913 Webster dictionary for source. Here a pagerank is associated with each vertex of the dictionary graph in the following way: All vertices start with identical initial ranking and then iteratively distribute it to the vertices they point to. And they receive the sum of ranks from the vertices they point to. Here too the normalized ranking converges to the principal eigen- vector of the adjacency matrix of the graph so there is a natural order in the ranking. The algorithm is slightly modified so that the edge cases don't have extreme values i.e nodes with no incoming edges or outgoing edges. If a were the number of outgoing edges and pi the page rank of vertex i, the edge relevance between s and t is defined as rs,t = (ps/as)/pt
These edge relevances are then coverted to rankings which are computed only once. Related words for a word w can then be found by selecting the edges with the best ranking and extract the incident vertices.

Monday, October 21, 2013

In today's post we describe the automatic discovery of similar words in a dictionary. In particular we discuss the method proposed by the same authors as in the previous post. This is a refinement over the generalized method provided by Kleinberg for searching the web. These methods could have  been improved by others but their reference here is a good read nevertheless.
First we construct a dictionary graph G where each word is a vertex in G and there is an edge from u to v if v appears in the definition of u. Then associated with a given query word we construct a neighborhood graph which is a subgraph of G whose vertices are those that are pointing to or pointed by the vertex in G.
The problem is simplified to searching for authoritative pages in a subgraph that is focused on the original page. These pages are similar to the hubs in the structure graph.  An authority page is one that categorizes the other pages such as automobile makers for Ford, Toyota and other car makers whereas web pages that list these as home pages are hubs.
The method to find hubs and authorities proceeds by initially setting a weight of 1 to both hub and authority and simultaneously updating them according to a mutually reinforcing rule.  The hub score of the vertex i, xj is set to equal the sum of the authority scores pointed to by i and similarly the authority scores of the vertex j is equal to the sum of the hub squares of all vertices pointing to by j. The hub score and the authority score can be seen as a measure of similarity between vertices in each other's graph.
However hub and authority pages are both not good at discovering synonyms.
This can be improved  by maintaining three scores or as many scores as there are vertices in the structure graph. We then iteratively update the scores simultaneously based on a mutually reinforcing rule : the first scores are set to the sum of the scores of all vertices j pointed to by i, the second scores are set equal to the sum of third scores of vertices pointed to by i and the third scores as the sum of the second squares pointed to by i.  Let M be the adjacency matrix associated with G. The process converges with the list of synonyms appearing as those in decreasing order of Eigen value of M Mtransposed + MtransposedM.
I came across a paper on Automatic Discovery of Similar Words by Senellart and Blondel. It talks about algorithms to extract similar words from a large corpus of documents. In this paper it refers to a partial syntactic  analysis from another paper which is interesting. This method is called the SEXTANT (Semantic Extraction from Text via Analyzed Network of Terms) and uses the steps as mentioned below:
Lexical Analysis :
Words in the corpus are separated using a simple lexical analysis. Proper names are also recognized. Then each word is looked up in a lexicon and is assigned a part of speech. If a word has several possible parts of speech, a disambiguator is used to choose the most probable one.
Noun and verb phrase bracketing:
Noun and verb phrases are then detected in the sentences of the corpus. This is done with starting, ending and continuation rules such as a determiner can start a noun phrase, a noun can follow a determiner in a noun phrase, an adjective cannot start, end or follow any kind of word in a verb phrase and others.
Using multiple passes for parsing the text, several syntactic relations or contexts are then extracted from the bracketed
These are denoted as follows
ADJ : an adjective modifies a noun eg. civil unrest
NN :  a noun modifies  a noun eg. animal rights
NNPREP : a noun that is the object of a preposition modifies a preceding noun eg. measurements along the chest
SUBJ: a noun is the subject of a verb. eg. the table shook
DOBJ:  a noun is the direct object of a verb. (e.g. shook the table )
IOBJ: a noun in a prepositional phrase modifying a verb. (eg. the book was placed on the table)

This kind of parsing was computationally intensive and multiple passes over large corpus added to it. Hence only the few above were used.
Only the similarity between nouns is focused on and other parts of speech are not discussed. After the parsing step, a noun has a number of attributes, all the words that modify it along with a kind of syntactical relation. This leads to high dimensions such as a noun that appears 83 times has 67 unique attributes.
Each attribute is assigned a weight using the probability of the attribute among all those for the noun.
These attributes are then used for with a similarity measure such as a weighted Jaccard similarity.
An interesting observation from such similarity measures is that the corpus has a great impact on the meaning of the word according to which similar words are selected. 

Sunday, October 20, 2013

Search algorithms are of different types. We discuss four algorithms
1) Unordered Linear Search algorithm : This is very similar to a student searching for an exam score in a collection of unordered exams
2) Ordered Linear Search : Sorting helps exclude portions of the list during search.
3) Chunk Search : This is the search typically used in a phone book. A few pages may be looked at a time. These are called chunks. Each chunk could then be searched using an ordered linear search.
4) Binary Search : We used a divide and conquer approach where we split the data set into two halves and proceed along only one half of the set at each level.
I mentioned the Search algorithms as an introduction to a book I want to pick up for readings on this topic. But meanwhile today I'm experimenting with an https based web service deployment to cloud that will try out OAuth redirections.
This is very similar to a usual deployment of any .Net web application to the cloud but differs in the following steps:
1) obtain an SSL certificate. This is done by generating a certificate signing request with IIS Manager to send to the Certificate Authority.
2) After submitting the certificate to the CSR, an SSL certificate is obtained.
3) the CSR is then completed with the certificate provided by the certificate authority.
4) The certificate is then exported from the IIS Manager.
5) It is then added to the certificate store for the hosted service.
6) The thumbprint of the certificate is then added to the service configuration file.
7) The service definition file is updated to identify the certificate to the hosted service.
8) The HTTPs endpoint is then configured in the service definition file.
9) a CNAME record for the domain name is then setup. This is the name the SSL certificate is issued against to redirect traffic to the service.
The HTTPS endpoint can be tested in a standalone environment where all authentication occurs using a self-signed certificate provided for localhost.

With the development of Cassini web server or IIS Express, it is easier to develop the web service for https. With the properties view if the project, one can turn on SSL enabled to true.

Intermediate tutorial on data mining
We continue our discussion on using Analysis Services to provide an integrated environment for creating and working with data mining models. We mentioned how we bind to data sources, create and test models and use with predictive analysis. In this tutorial, we build on the previous review to include new scenarios such as forecasting and market basket analysis.
In forecasting we create a time series model to forecast the sales of products in different regions around the world. You will develop individual models for each region and learn how to use cross prediction.
In market basket analysis, we will create an association model, to analyze groupings of the product that are purchased online so that we can recommend products to customers.
A forecasting structure is created using an existing relational database or data warehouse. The Microsoft Time Series is selected and and the SalesByRegion is selected in this case.
The input and predictable columns can then be selected followed by specifying column contents and data types. The columns can be dragged and dropped into the forecasting mining structure.
the mining models
The forecasting model can be customized with the FORECAST_METHOD and PREDICTION_SMOOTHING. The forecast method parameter controls whether the time series algorithm is optimized for short term or long term predictions. The prediction smoothing parameter combines the way short term and long term predictions are mixed. Typically a fifty fifty split works.
Missing data can also be handled.
The forecasting model can be explored with a mining model viewer. Predictions and deviations can both be viewed here.
Similarly, the market basket analysis is also similar.  The mining structure is created using the wizard by specifying association rules and available data sources. We select the option to allow drill through and display the mining structure we create.
In this case, the algorithm parameters are MINIMUM_PROBABILITY and MINIMUM_SUPPORT. Typical values are 0.1 and 0.01 respectively.
The association viewer contains three tabs Rules, Itemsets, and dependency network. The dependency network allows us to investigate the interaction of the different items in the model. Each node in the viewer represents an item while the line between them represents a rule. A line connecting two items means that these items are likely to appear in a transaction together.
Associations from the model can then be used to make predictions. The is done by building prediction queries against the mining models.

Saturday, October 19, 2013

This post is about the Microsoft basic data mining tutorial  on MSDN. The goal is to create a mining database, set up several data mining models, and use the models to make predictions.  This feature is provided as part of analysis services and I wanted to try out the Microsoft SQL Server 2012 Data Mining Add-Ins together with Excel but turns out that I'm running into an installation bug with Microsoft SQL Server 2012 Data Mining Add-In that keeps asking for a database permissions for my login even when its all there. It doesn't occur with any other database connections. Nevertheless, I will first lay out the interpretations from the text and then attempt to workaround the bug. Meanwhile, back to the data mining tutorial.
The tutorial teaches us how to create and work with different type of data mining models. It also teaches us how to create  a copy of the mining model, and apply a filter to the mining model. After the model is complete, we can use it to drill-through results.
At the time of creating the model, we can split the data into training and test sets. This improves accuracy.
The model filters are filters that can be applied on both training and test sets.
The drill-through results follow from the pattern identified from the mining model which translates to actions on the data source.
The first step in these is to use the Analysis Services Multidimensional and Data-Mining Project templates. We can change the instance where the data mining objects are stored.
The data source can be specified using the connection manager and to select the native OLE DB\SQL Server native client. The tables and Views can then be selected from the source database.
The second step is to build a targeted mining model structure. This can be done by selecting the definition method from existing relational database or data warehouse and then choosing say the Microsoft Decision trees as the data mining structure. Table types and training data will then need to be specified. Specifying the data type and the content type will be next.  There is a wizard option  that automatically detects these but they may need to be reviewed.
The test data set is then carved out from the sample. The default value is thirty percent and this could be retained as-is.
Adding new models is easy and can be done by switching to mining models tab in SSDT and right-clicking the structure column and selecting new models. Naive Bayes and clustering model can be similarly added.
Each of the models can then be explored. For example in the decision tree, we can view all the nodes.
The models can the be tested using lift charts. A lift chart helps compare how well each of the model makes predictions and compare the result of each model directly against the results of other models.
After the accuracy has been found satisfactory,  the prediction query builder can then be used to design and run prediction queries.
In this post, we will talk about fairness both weak and strong. Every program has a skip action and every program has multiple actions. In an informal model, we reach into a hat and pick an action. If we are very unlucky we pick the skip action over and over again. None of the others are ever chosen and the program does nothing. To prevent this kind of selection, we impose a fairness requirement on the selection of actions. There are two kinds of fairness -  weak and strong.
Under weak fairness, every action is guaranteed to be selected infinitely often. Said another way between any two selections of the same action, there are a finite number of selections of other actions. There is no however guarantee how many actions may be selected before all actions have been selected at least once.
In we take an example where the program maintains a state based on a boolean variable  and increments a counter modulo four times. b is assigned the result of the modulo. If n  is zero, the boolean is assigned false.
A directed graph can be drawn representing the program with the vertex as states of the program and the directed edges representing the actions. So weak fairness says that each edge label is guaranteed to repeat often in an infinite path.
If the program begins execution in a state satisfying the predicate n = 1, the initial state is <false,1> and then each possible computation leaves the state unchanged. If on the other hand, the initial state is the state <true, 1>, then there is no guarantee the state reaches the fix point of <false, 1> and it cycles repeatedly between <true,1>, <true, 2>, <true, 3> and <true, 0>. The action that assigns false to b must be selected an infinite number of times, but it may be selected every time n = 3 and the action has no effect. In such a program, every action is selected infinitely often and this satisfies the weak requirement.
Under strong fairness, in addition to the weakness requirement, if an action is enabled infinitely often, it is selected infinitely often. Note that in the previous example, the selection was extraordinarily unlucky because one action was only chosen at particular times when it happens to be unenabled. In the strong fairness program, the cycle would not be allowed to repeat without selecting the action that assigns false to b. This program could then be guaranteed to reach the fixed point and terminate.
In general, if we come up with a property for a program then that property holds regardless of whether the actual execution is weak or strong. However is a property relies on strong fairness, that property may or may not hold for weak fairness. The program could choose to assume weak fairness.


Friday, October 18, 2013

In today's post we will talk about one more distributed computing before we move on to the next topic. We will talk about time synchronization of physical clocks. We mentioned that virtual clocks and logical time can capture a partial order of events in a distributed system. But in this post, we will talk about say a real clock. Such a clock would advance at an appropriate rate. eg. one tick per millisecond. Each process may have its own clock.
The issue here is that the current time displayed by the clock is not a shared state. Moreover, a watch that ticks accurately is not enough. It can not be used to tell time though it can be used to time the duration of some interval. The clocks may need to be synchronized to tell time and even then they can drift over time requiring synchronization again.
Consider a process receiving a time notification from another process. The receiving process has no way of knowing the delay from the transmission  One solution could involve sending a message first and then waiting for an echo. The receiving process can then know the time elapsed.
Since the process knows the timestamp on the echo, it can split the elapsed time to know what the delay was in receiving the echo.
Sending a request and receiving an echo can be repeated with the hope to improve accuracy. If the delays are the same, there is no improvement. If the delays vary, then the repetitions narrow the possible interval. This improves accuracy of the clock.
As an example, lets say the timestamp of the first echo is 3 and that of the second echo is 3:30. Since the elapsed time for the earlier was ten minutes, and the elapsed time for the other is twelve minutes, we have two intervals for the current time at the other process. The interval intersecting these two intervals is now much smaller thus improving the accuracy in predicting the time at the other process.
When this is repeated with several servers, and each server finds the interval its working on, the intersection of all these intervals will be even narrower thus increasing the accuracy of the time.
It might happen that the intersection for the intervals could come out to be empty.
This means that one or more of the servers have a wider drift. So we take each interval as a likely vote and go with the majority.
One way to do such a tally is as follows is to see how many intervals are satisfied by a given time. We increment a counter whenever we are within an interval and decrement the counter whenever we leave the interval. We also update our minimum and maximum bound on each entry and exit. At the maximum count of overlapping intervals, we know what intersection majority have. 

Thursday, October 17, 2013

In some of the previous posts we discussed distribute computing . My next book reading is going to be on search.
Meanwhile here is a quick summary of things that we discussed.
1) We talked about reasoning of programs and how to give proof of correctness for the program.
2) We talked about some sample programs such as Earliest Meeting time and Greatest Common Divisor of two integers X and Y
3) We talked about time, clocks and synchronization. In particular we discussed what a logical time is and how to maintain a logical clock. There are vector clocks  that keep track of time at all other processes. We talked about synchronizing clocks.
4) We talked about diffusing computations or gossip and how we know that it progresses and terminates. We mentioned interesting applications of gossip.
5) We discussed mutual exclusion as a layer that resolves conflicts between processes.
6) We discussed solutions for mutual exclusion that relied on distributed atomic variables and non-token based solutions including Lamport's algorithm of sending acknowledgements and optimizations.
7) we discussed token based solutions including a simple token ring and token ring with requests and a token tree
8) The token tree approach let us generalize a token graph solution. The key ideas were that tokens were neither created nor destroyed which guarantees safety.
9) Tree is directed and token is always at the root.  Process sends only one request and may maintain a list of pending requests. The tree is generalized to a partial order. To maintain the partial order, make all the edges incoming for node with token.
10) We discussed dining philosophers problem and the partial order solution aka hygienic solution. The key idea was to break the symmetry by forming a partial order. To lower a philosopher in the partial order, make all the edges outgoing. Tokens called forks are used for mutual exclusion. Priority is encoded in clean and dirty forks. Request tokens are used to determine whether a  neighbor is hungry.
11) We discussed snapshots and how to cut through timeline for various processes.  We discussed solutions to taking snapshot including logical time based solution and marker algorithm. We included a proof of correctness. The marker algorithm was meant to flush the channel of messages.
12) We discussed termination detection and the use of special detector process. We applied snapshot to help with the detection.
13) We discussed garbage collection and its relationship to termination detection.  We discussed the principle of superposition and the use of a marker witht the mutator. We made first and second attempts with a propagator to spread the marks  in a gossip like manner and check to see which garbage is not root i.e. it's manure.
14) We talked about tolerating faults in a distributed system both for byzantine failures and crashes. We talked about the use of authenticated messages and building a consensus tree.
15) We also talked about discrete event simulation with sequential and time driven simulation. We talked about conservative approaches using null events and look aheads. We also talked about optimistic approaches where we allow mistakes that can be undone i.e allow stragglers and do roll-backs.

Wednesday, October 16, 2013

In addition to the previous post, we can consider some optimizations as well. When a request enters the queue, it was acknowledged immediately. However, one optimization is that not all acknowledgements are needed. For example, if a request has already been sent with a later timestamp, then that request acts as an acknowledgement. The timestamp in the packet is used to guarantee a known time for this process that is greater than the request time.
Another optimization from Ricart-Agrawala involves eliminating acknowledgements for even more requests. If a request has already been sent with a time stamp with an earlier timestamp than a received request and that request is still pending, there is no need to send an acknowledgement. When that request is granted, we will send a release message and that will serve as an acknowledgement. When the process Pt receives a request,timestamp from a process Pj, it defers an acknowledgement if Pt is in critical section or if the request from Pj with a later timestamp than the one we sent, is still pending. When the process Pt leaves the critical section, it will send out an acknowledgement.
This completes the non-token based approach for mutual exclusion that we wanted to discuss.
Reverting to the program proof of correctness topic, here is one more:
Take the program to find the greatest common divisor of two given integers. The program is described with the following:
Let X and Y be two integers that are given, we select two numbers x and y such that x = y = gcd(X,Y)
Initially the program sets x to be greater than zero and equal to X, y to be greater than zero and equal to Y.
at each step the program assigns x to x-y if x > y or y to y - x if y > x. By doing this repeatedly, the program converges to the gcd.
The fixed point of the program can be interepreted to be y = 0 when x > y and x = 0 when y > x from the above.
The invariant adds the condition the gcd of (x,y) should be the same as the gcd for (X,Y) in addition to requiring that both x and y are positive integers.
We could use the metric as x+y since this value continually decreases and remains bounded below because on each step a positive number is subtracted from either x or y. We know the program can guarantee the metric will change because if x and y are different, at the next computation, their sum will be decreased. Thus the program terminates.


Non token based solutions to mutual exclusion in distributed computing include Lamport Algorithm. There was a single centralized pending request queue which allowed the processes to enter their critical section on a first come first serve basis. This queue is just a data structure like any other we could support a distributed version of a data structure i.e an integer with increment or double operations. We can use this same strategy to support a distribution of the pending request queue data structure.
Each process keeps a queue of time-stamped requests for critical section sorted in ascending order and a list of known times for all other processes.
To request entry to its critical section P broadcasts <req,t> to all other processes.
When Pj receives a request to enter critical section, a timestamped acknowledgement is returned.
To enter the critical section, P must have req at the head of the reqQ and its timestamp should be smaller than all the known times for all other processes.
On the other hand, to release the critical section, P removes the request from its reqQ and broadcasts a release message to all the other processes.
When a process receives a release message, it removes the P's request from its reqQ. Note that this may cause this process to enter its critical section since it may now have the request with the minimum timestamp.
We can prove that this works by using contradiction. Suppose two processes were both in their critical section These processes Pa and Pb would have their <reqa,ta> and <reqb, tb> respectively at the head of their sorted reqQ.
If we assume that ta < tb. But Pb is in its critical section and so its time stamp must be less than the known time of a at Pb. Hence reqa with timestamp of ta must be in the reqQ at Pb. Therefore reqb is not at the head of Pb's reqQ.
Thus we know that only one process can be in the critical section meeting our goal.

Tuesday, October 15, 2013

When defining the correctness of a program, it is helpful to do it in steps. First, consider the invariant of the program and a fixed point. If the program reaches the fixed point, it terminates. If the program terminates, then we know both the conditions are satisfied - the invariant and the termination.
The second step of the program is to establish that the program indeed terminates. This step involves finding a metric. A metric should be guaranteed to change eventually otherwise the program does not progress. Further a metric should be non-decreasing and bounded below or non-increasing and bounded above for it to indicate that the fixed point has been reached.
Skilled programmers can easily correlate this to the recursive functions. The termination condition for the recursion is specified first so that the function can terminate. The metric is the condition on which the recursion continues. The condition is enforcing that there is a convergence to the fixed point.
Take the example of the Earliest meeting time. This problem consists of finding the first time at which three people are all simultaneously available. The three people are named A, B and C. With each person, we associate a function f, g and h respectively. These functions represent the earliest time each is available. For example, f.t is the earliest time at or after time t at which A is available.
f.t = t => A being available at time t.
The earliest that all three can meet is represented by M which is the minimum time at which all the f.t = g.t = h.t
To calculate this M, we define a metric r that denotes time. We try different values of time. We initialize it to zero. We assign r to f.r or g.r or h.r
The goal is to get to a fixed point which we define as
r = f.r = h.r = g.r
This fixed point implies that r >= MThe r is guaranteed not to decrease because f.t > t.
The invariant we define as r <= M
Therefore at termination r = M
The steps above are in the order in which we stated earlier to discuss the correctness of the program. It follows the metric.  We already noted the fixed point and the invariant. We guarantee that r will change if r is below M because we consider the case where r is < M. In this case, one of the persons is not available otherwise all would be available and this would be M (proof by contradiction) Therefore f.r > r so the action r = f.r increases the metric.
Since r increases and r reaches M, we know that the program terminates.
In the previous post we discussed the conservative approach to handling event queues together with the refinements or extensions. By conservative we meant that we handled only events that are safe. Safe events are those that have a timestamp greater than the current time.
In todays post, we discuss the optimistic approach that doesn't require that condition to be met. The processes simulate their queue with the hope that events with an earlier time stamp will not arrive . If an event with an earlier timestamp does arrive, the process must undo part of the simulation. This going back in time is called time warp and the events that arrive with an earlier time stamp is called a straggler. The optimistic algorithms deal with stragglers.
To address the straggler that arrives with a timestamp t, the process records the last valid state before time t. The process then rolls back to this last valid state.
However, events may have been sent to other processes after time t. These events must be undone as well. So the process can send a so called anti event to these other processes. This anti-event cancels the corresponding event.
The anti-event could arrive in two different ways. It could arrive with a time-stamp greater than the current local time. In such a case, the corresponding event has not been processed. So the cancelation action simply removes the corresponding event from the queue.
The second case is when the anti-event arrives with a timestamp less than the current time. The anti-event is itself a straggler now. To handle this straggler, more anti-events may need to be fired. In such a case, there is a cascading anti-events. How do we know that it does not continue indefinitely ? We use the concept of a global virtual time. This is the minimum time stamp in the system. There are two advantages now.
The algorithm never rolls back to before GVT.
The event scheduled at time GVT can be executed without causing a rollback.
In the approach we discussed so far, we need not roll back all actions. As an optimization, we could skip the roll back on actions that would happen regardless. If a process proceeded optimistically on an assumption that later turned out to be false, it maybe that the events were scheduled to happen regardless. So a lazy cancellation is introduced. After a process rolls back, it resumes the simulation without sending anti-events. An anti-event is sent only when the optimistic solution generated an event that is not generated by the new simulation.

Monday, October 14, 2013

Today we will continue our discussion on the DES and see if we can consider each source and server to be a separate process. Clients are passed as messages between the processes. We assume a queue for transit so that the messages appear sequentially.
Now since every process maintains a single queue for event, it has to decide when to process it. If it processed it rightaway, there might be another message from another source that should have been processed earlier. So the server has to wait for the incoming messages on all its inbound channels before deciding which one to execute. It must wait. But this could lead to a deadlock because two process could be waiting on each others inbound channel.In order to advance, we must ensure that such events arrive.
One solution is to send update events called null events. They don't do anything other than to break the impass and ensure progress. But even the null events need to be time-stamped properly.
Since null events don't take any time to process,  they don't advance the clock. For each process in the cycle, the empty queue should contain the minimum time, but this is deadlock. So a null event has to be timestamped strictly larger than the one at the current process. This null event is then sent out to all processes.
The null events break the impass because they represent a guarantee that no events will arrive earlier than this event. The increment of the null message timestamp above the current time is represented by the lookahead.
Several refinements of this basic algorithm are available. Some of these are :
1) pre-compute service time. If the service time is independent of the particulars client, the length of the time required for the next service request can be calculated before the next event has arrived. and this is typically larger than the original look-aheads.
2) Request time-updates explicitly - Rather than sending null events, processes can explicitly request null events. This occurs when a queue for a particular in-neighbor becomes empty.
3) Allow deadlock - Yet another approach is to not use null messages at all. Instead a deadlock detection scheme is used.
4) Increased synchronization -  A tighter synchronization between processes can be used to ensure that no processes get too far ahead of any other. A safety window is maintained.
All of these variants behave differently and often some experimentation is involved. Generally speaking, the magnitude of the look-ahead affects performance and this is highly application specific.


In the time driven simulation approach we mentioned earlier,  how to "discretize" continuous state changes. The simulation is advanced by small increments in physical time. This time interval must be chosen carefully because if its too large, it will not detect the events of interest and if it is too small, it will make the simulation very slow.
Together with the overall amount of physical time being simulated, the time granularity adds to the computational complexity of the simulation. Hence both the factors deserve attention so that we can simulate just what we want.
In a true DES, the absolute sizes of the time intervals between events do not affect how quickly the simulation can proceed. Only the number of events matter.
In the event queue example we mentioned earlier, we use an arrival event to seed the queue. The arrrival queue contained a single event with the description as arrival c1 and timestamp 1 represented as <arrival c1, 1>. This event is now dequeued, the "current time" is updated to 1 and simulated. Simulating this event sends the first client to the server, where there is no wait. The only time taken is the time to service the client. So the client begins service immediately.
This immediately schedules a future event to be scheduled: the completion of service of c1 at this server. This event is represented as say <complete c1, 10>.
When that event occurs, a second event is generated : The arrival of the next client. Say this event is <arrival c2, 4>. Now both the events are inserted into the now empty event queue in the increasing order of time.
Thus we set up a loop for events to be processed at this queue. Note that we keep track of the timestamps only, so from our earlier discussion, this is local. When we want to capture the state across different systems we use snapshots.
Notice that the events are queued in a single queue. To alleviate the bottleneck, we could have a few more queues depending on how many the processor can handle. Typically one more queue per processor could help.
If the processor would like to keep track of the priority of the clients, then that could be maintained as separate queues for priority or another data structure.
In the parallelized case, we still honor the arrival events and mark with appropriate time stamps.
This is how DES systems do time based simulation. 

Sunday, October 13, 2013

Discrete Event simulation
This is yet another interesting topic in distributed computing. Consider a network of nodes and directed links between the nodes. There are three categories of nodes. source servers and sinks.
Source generates clients that travel through this network along the links. The clients get service at the server node. Servers can provide service to at most a single client at a time. Clients are in a queue at the server. Finally, the clients disappear at the sink.
The behavior of the system over time can be defined as a series of discrete events over time.
There are four different kinds of events:
A client arrives in the system
A client arrives at a server
A client completes service
A client leaves the system
With a precise event, we can assume no two events happen at the same time.
DES is therefore characterized by a list of time-stamped events described above that occur in increasing order of time. One use for this list is that we can gather metrics  such as average queue length etc.
These systems can be simulated to gather the relevant measurements and statistics. System can then be tuned to optimize performance.
Events are generally not affected in their service by previous clients at the same server. Therefore clients are independent.
The data structure used in DES is an event queue. This queue maintains a list of scheduled future events in increasing order of time. Simulating an event from this queue could generate new events that are inserted into the event queue.
Clients are generated at the source using an initial arrival event. Now simulating this arrival event involves both the insertion of new clients and the scheduling the next arrival.
One limitation with the approach mentioned here is that we have a single event queue that can be a bottleneck. Since events have to wait for earlier events to be simulated, one way to remove this bottleneck would be to parallelize the events.
There are also other simulations such as time driven simulations. In discrete event simulation, the absolute sizes of the time intervals between events does not affect how quickly the simulation can proceed. Only the number of events matter.
For continuous systems, this is not suitable. For example, take the cueing of billiard balls or when they collide. If we discretize the state changes we cannot predict the next event. So we have to take finer time slices.
In general, the computational complexity is proportional to the amount of physical time being simulated and the granularity of the time steps.

In the previous post, we discussed synchronous network with faults. In today's post, we will continue the discussion and look at specific faults/ One specific failure is called the Byzantine fault. Here two processes must agree but their messages may get lost in transit. If there were n rounds of messages, each could have a failure so the processes may each start out with a binary value and at the end of each round, may still have opposing binary values. Consequently, there is no convergence as the rounds decrease from n to 1.  Such faults are generally more difficult to tolerate than crash faults because the outcome is to simply stop execution . There is no solution when there are three processes.
Authenticated messages alleviate this problem. Because now the processes can now that the message was signed by the sender and no one else. This message is also tamper proof and the receiver can know that this was the intended message by the sender.
Now the same can be used to address the Byzantine fault with just two rounds. The signed messages are compared and the proposed value is discarded if the same value was not proposed to all the parties.
So each process takes the following steps:
1) broadcast signed value to all other processes
2) broadcast set of received (signed) values
3) compare received values for each of the other processes
4) if they agree, include the process that sent the value, else exclude it
5) choose minimum value of the included process
As compared to the crash failure where the fault could occur in the first or second round, in this fault the faulty process continues to execute so that there may be faulty messages in both rounds. And after the second round, the processes know that whether any faulty message was sent in the first round regardless of any faulty messages in the second round.
This problem becomes more interesting when there are more than one processes that can fail. In this case, rounds don't suffice. Instead a consensus tree is constructed.  In this tree, the nodes at level i represent message chains of length i. The message at depth 3 has signed messages contained one within the other from three processes. The root has n children representing the messages received in the broadcast of the first round. Each of those nodes have N-1 children from the messages received in the second round and  each of those nodes have N-2 children from the messages received in the third round and so on.
At a given level say N-3 the messages look like (l,k,j,1), (l,k,j,2), (l,k,j,3)...(l,k,j,N). In this case, Pj relayed the same message regarding Pk which in turn relayed Pl to all the processes. So the messages all agree. The decision value of these children can then be bubbled up to the parent.
In order for this tree to calculate the same value for all processes, every message chain in the leaf nodes must include at least one non-faulty process. Thus in order to tolerate f faults, the leaf nodes must represent message chains of length f + 1. That is a consensus tree of height f+1 is required.
The consensus tree also helps in cases when the authenticated signatures are not available. In this case, the value from the children will be bubbled up to the parent only when a majority is there. This is based on the assumption that the majority will be non-faulty and so a quorum suffices and the correct value will be bubbled up.

Saturday, October 12, 2013

Asynchronous consensus with process failures.
When there is one fault, it is impossible to solve the consensus problem in a distributed system that involves asynchronous model. Unfortunately, asynchronous design is common. Faults also happen such as failures of disk. And it only takes one fault to disrupt consensus.
This has been proved based on the observation that a protocol begins in a decision state value is either 0 or 1. And it continues in which either result is still possible.
In transaction systems, for commit/abort we used the two-phase locking semantics. In group membership, leader election and reliable multicast too , this problem is common Hence, some suggestions to modify the assumptions of the problem include:
1) Synchrony - If a system can provide an upper bound on the delay or recovery, then this problem can be solved.
2) Randomness - Some probabilistic algorithms can help
3) Failure Detectors - a delayed process is difficult to differentiate from a crashed process without a failure detector.
4) Approximation - Even if all the process cannot agree on the same value, some arbitrary value can suffice.
Let us look at synchronous agreement with process failures.
Every process broadcasts to all other processes its initial value and this can be done in a single round of messages. All of them can then agree on say the minimum.
If there were to be a crash before the first round completes, then they may not have all the values. So a second round of messages could be initiated with all the values from the first round so that everyone has the same set.
Notice that if there are more failures, then there must be more rounds before the agreement. In such a case, there must be an upper bound on the number of rounds otherwise this may not converge.
Also there could be authentication and encryption involved in the messages so that it is not tampered  and that it can be shared effectively. We don't focus on those at this time because we are addressing a different issue.
When we discussed the solution for synchronous systems, we may see how the assumption modifications come useful for asynchronous networks.

We will continue our discussion on garbage collection. The process has some steps that are not obvious. We will go over them in this post.
First, we look at the propagator step we added.  The propagator tests a condition for each edge. This condition marks an edge between x and y as ok if x is a root and with an edge between x and y, they are connected with y as root. That is the propagator takes each garbage and connects them to food.  It continues execution until a termination condition has been reached where the root is connected with all the vertices visited such that each edge x,y is ok.
Between two edges there is only one condition when the edge is not ok. This is when x is not a root. If x is not a root, it is manure. A manure is now easy to collect.
The fact that x is manure and is not a root is maintained as an invariant with the progress because we add edges. Notice that in the ok condition, we checked and maintained that y is root. This is a refinement from the earlier definition where we considered y is food. This let us come up with two separate connected graph where one is manure and the other is food. Everything else is unconnected and garbage.
To be able to prove that this is the case, we considered our actions and maintain the invariant that if x is manure it is not a root. Said differently, each vertex is neither manure or not a root for the marker to propagate. We show that this invariant holds throughout the progress.
Initially there is a root for food and  manure is part of garbage. This is easy to establish because the food is mutually exclusive. By definition, food is any vertex that is reachable to the root. So our initial condition holds.
Next in our actions, we look at the two that change a vertex from being or not being a root.
The first is when we add the edge between x and y  and y is food. In this case since y is food, it is not manure. Since the initial invariant holds for all vertex that is not y, adding an edge from that vertex to y, maintains the invariant.
The other action we do is when x is a root and now there exists an edge between x and y. In this case we proceed from the left. If x is not manure and we connect y, then y is not manure. if y is not manure, then y is root. This re-establishes our invariant.
Thus by maintaining our invariant and our progress, we are able to propagate. we detect the termination condition when the root is chosen and all the edges reaching to it are ok.