Cluster computing

Saturday, November 2, 2013

Trend and behavior detection is applied not only in monitoring but also in information retrieval. In a paper by Wang, Bownas and Berry, they discuss it in regard to web queries.
They discuss the type and nature of query characteristics that can be mined from web server logs. They studied the vocabulary to the queries to a website over a long period of time and found that the vocabulary did not have a well-defined Zipf distribution. Instead trends could be better represented as piecewise polynomial data fits i.e based on regression analysis.
Studies of traditional information retrieval systems reveal many problems that searchers encountered. These include the complexity of query syntaxes, the semantics of Boolean logic operators, and the linguistic ambiguities of natural language Since user education was unlikely, systems are designed to facilitate self-learning. Search logs from different search engines were collected and studied. Even though the web search engines differed widely in data collection, processing, focus of search etc. their results were comparable. Web queries were short. They averaged two words and were simple in structure, few queries used advanced search engine features. and many contained errors. These were observed from web logs that were quite large but covered a very short period of time. Hence the study by the authors involved logs from a university website spanning a longer time instead. This facilitated studies for trends over time
The log data included a date stamp, the query statement and hits. The zero hits were caused by many factors. First, the stop words set by the engine excluded many words. Second the queries contained a high percentage of misspelled words. Third many searches entered name of individuals. Fourth, the default Boolean operator was and. The other boolean operator had to be included explicitly. Fifth, the empty queries also contributed to zero hit.
The queries submitted to the search engine were very short and from different users, so it was difficult to predict if the vocabulary will show similar statistical distributions. A plot of logarithmic rank-frequency distribution of terms showed lines that were not smooth and straight.

Friday, November 1, 2013

Today I want to take a short break to discuss an interesting idea for monitoring API calls. APIs are ubiquitous and their usage is relevant to operations, development and test teams. There are various kinds of tools available for monitoring and some come with facilitators such as Mashery and Splunk. These are often based on non-intrusive capture, index and search mechanisms or a subset of them. Often they are cumbersome to use, involve heavy investment and don't provide real-time information. But I want to talk about a specific monitoring usage similar to real-time chart of call volumes. The APIs I want to target for my monitoring have a log counter i.e. they log the number of calls to an API identified by its name in a table in a database for read access by all. In this case, I'm interested in charts similar to performance counters on desktops.
There are a few metrics I'm interested in. For example, one metric is reported via graphs that are based in grids with call volumes on the y-axis and time on the x-axis. The peaks and the valleys in the graph can correspond to the alerts.
Another metric is the histogram of log counters that indicate the relative comparison between APIs. This may have similar patterns over time. But the idea is to view the first metric in comparison with the others at the same time.
A third metric is the averages over time slices. This can come in useful to detect time periods for bursts.
Another metric is to measure rate of change or velocity i.e. the number of calls per second. This can come useful to see and thwart possible denial of service attacks because it is often preceded by a slope. This could be particularly useful for login calls.
Data may also get stored for analysis later. There could be studies such as event correlation and capacity planning.
This tool is expected serves a specific purpose. It's not meant as a general purpose tool to span audit, compliance, governance and reporting. The data set is limited. The intended audience is targeted. The expected actionable items for the alerts are few but critical. This is more for the development and test team to know what's going on without reaching out.
These monitors can be considered as internal use so private and confidential information can also be monitored. One more important thing is that the instrumentations are added by what the developer team is interested in.

After the discussion on the results of simulations conducted by the authors of Fuzzy SKWIC, here are their observations on their simulation using the newsgroups. Since the results showed that the documents that were inconsistently classified were from four miscellaneous classes. the run was repeated without these documents. Specifically, these documents included those that lie in areas of overlap or fuzziness between distinct categories, documents that are outliers, and those that affect the purity of the resulting partition. In a way they studied where the clustering did not perform so well. After discarding these documents, the results showed that the clusters were more uniform , pure and the keywords were richer in terms of their relevance weights and partitions.
This implies that without the noise, the clusters were homogenous, compact and pure and the cluster dependent keywords were more relevant and representative.
In order to measure the performance of the clustering in any scenario, they came up with a metric called the average entropy measure of all C clusters. This average is found by taking the weighted entropy of a cluster as the ratio of the documents in that cluster to the overall number of documents. The cluster entropy for the ith cluster is then represented as the document class wise sum of this ratio times the log of this ratio and tuned by a constant. This constant is the inverse of the log of K documents
By using this metric, they arrive at the conclusion that the real unlabeled text data is the most challenging. Moreover, with any manually labeled benchmark document set, there are plenty of errors adding to the noise. Documents that straddle more than one topic also end up with an inadequate label. So there's no baseline to compare the unsupervised clustering but the results have shown that automatic labeling is superior to manual labeling.
The authors also suggest that the improvements they see to this kind of clustering is in using some kind of probabilistic latent semantic indexing which allows us to differentiate contexts further. Fuzzy SKWIC does simultaneous partitioning in two different hyperspaces - the document space to capture spatial document organization and the keyword space to capture context. Context can be inferred because they are not one from one keyword but from co-occurring relevant keywords. By providing fuzziness or memberships to different clusters to varying degrees, the cluster dependent keywords are richer and better suited to classify the documents in that clusters. The documents in a cluster are also more uniform, compact and homogeneous.

Thursday, October 31, 2013

Some observations from the implementation by the authors of Fuzzy SKWIC. We talk about their experiments with general SKWIC as well as the fuzzy improvement.
They tried simulation results with hard clustering. Their collection consisted of 200 documents spread out equally from four categories. They did filter stop words and used stemming. IDF of the terms were used in descending order to pick out 200 keywords. Each document was represented by a vector of its document frequencies. This vector was normalized to unit-length. Four clusters were used and the SKWIC converged in five iterations. Six cluster dependent keywords were chosen and verified and they correctly represented the cluster category. Only one partition showed most of the error. Documents that straddled two or more topics were found in this partition.
One observation I would like to highlight here is regardless of the keyword extraction, their choice for the number of keywords was in the order of a few hundreds. The ratio of keywords to documents is incidental here but the choice to keep the number of keywords to a small number is deliberate. We should follow a similar practice.
Now on to the fuzzy labels based run, the results were again with four clusters and m = 1.1. Their cluster center updates converged after 27 iterations. The results did indicate a significant improvement over the results from the former run. The improvement was seen particularly with respect to the partition that had errors with documents. In this run, those documents were correctly classified. A closer look at those documents suggested that they could have had soft labels from the start i.e they should not have been assigned one label. The results also indicate that the labels generated by the Fuzzy SKWIC was richer and more descriptive of their categories, thus indicating a good application.
A third simulation was done with a much varied data from 20 newsgroups with 20000 messages. Here the messages were stripped from their structural data in addition to being processed with stemming and stop word filtering.
Here the messages were stripped from their structural data in addition to being processed with stemming and stop word filtering. Here the number of clusters were 40. The feature weights varied quite a bit from cluster to cluster A lot of document migrations to clusters were seen Ten relevant keywords for each cluster were chosen. Again, they reflected their clusters well. Some indicated that the discussions were about an event, while others indicated that they pertained to an issue found interesting by the participants.

We were discussing the implementation of Fuzzy SKWIC where were going through the steps within each iteration which we repeat until cluster centers stabilize. So far we had looked into finding cosine distances, updating and adjusting relevance weights, computing the weighted aggregated cosine and adjusting it. We will discuss the updates to partition matrix, cluster centers and tau next. The partition matrix was initialized by weights in a matrix of C x N and each with values ranging from 0 to 1. The partition matrix had the additional constraints that the column wise sum along clusters must equal 1 and the row wise sum along document must be less than N. To update this matrix in each iteration, we use the computed weighted aggregated cosine distances. Here we take the ratio of the weighted aggregated Dwcij and Dwckj and aggregate the ratio for all the clusters. Note we may feel like we did not compute the denominator till now since it has not been mentioned so far. It follows the same computation as the numerator but against a cluster k ranging from 1 to C instead of the specific ith cluster. Both i and k are cluster iterators and the k changes from 1 to C for the same i in this computation. Therefore Dwcij an Dwckj are both known before this step to update the matrix. Once we calculate the aggregated ratios we raise it to power 1/(m-1). m is just a degree for the partition matrix values. And then we inverse the result to calculate the new partition matrix value uij.
Next we look at updating the cluster centers. Here if the feature weight vik = 0, then document xj has no bearing on the cluster so we don't update the cluster center. If the feature weight is greater than zero, we update the cluster center with the mean of the fuzzy membership weight based documents.
Finally, we update tau. Tau is last to be updated in the iteration and here the only thing that has changed from the last computation is the partition matrix and the newly computed Dwcij.
Thus we see that the data structures we need are just the partition matrix and variables per iteration. The steps in the iteration are as discussed.
The authors for the Fuzzy SKWIC mention a very interesting phenomenon called noise magnets. A cluster starts accumulating outliers and noise while the others are more compact with their constituents at similar distances from the center of the cluster. This happens because the outlier documents are neither similar to each other nor are they similar to any clusters. In addition, they get assigned a maximum distance of 1. They don't have any effect on the objective function. They join the cluster with the seed nearest to them. Further, they bring down the feature weights of this cluster due to extreme averaging. Consequently, they draw all the outliers to this cluster.

Wednesday, October 30, 2013

In the previous post we talked about the steps to compute the terms for the cosine distances and the feature/relevance weights in each iteration. This we could do with the parameters we had during each iteration. We started talking about how to update and adjust the relevance weights. So let's wrap that up and get to computing the weighted aggregated Dwcij and adjusting it if it's negative. So picking up the thread from the relevance weights, we saw that it depends on a default value and a bias. We computed the bias by initializing the fuzzy membership values. Other than the tau, everything else was known to compute the relevance weights. For tau, we wanted to compute it for iteration t with something like the following equation tau-t = K-constant * ((Sum-fuzzy-memberships)(Sum-weighted-cosine))/Sum-relevance-weight-squared) Here we see that we have everything we need to compute for t given K is a constant. So relevance weights can be calculated. we just need initial values for the first iteration and the previous iteration values in the next one.
Now let's look at the weighted aggregated Dwcij and the Dwcij,
Computing the weighted aggregated Dwcij is know as equal to Summation over N (vik.Dwcij).
Its the negative values we need to make adjustment on. Here for each of the clusters we compute the absolute value of weighted aggregated Dwcij and take the minimum. This is the adjustment value we add to the weighted aggregated Dwcij that comes out to be minimum.
Note that the previous iteration values of Dwcij are not used in the next iteration but they are recomputed. This is because they are based on cluster centers which we update in each iteration. The adjustment to the negative value for Dwcij follows a similar pattern as adjusting the negative value for relevance weights that we have seen earlier.