Cluster computing

Sunday, November 3, 2013

Today I'm going to describe a few steps again for the implementation of Fuzzy SKWIC. First I want to talk about the data to be used with this implementation. I 'm comparing nltk corpus to the Gutenberg project. What we need are a set of documents from which we can use conditional frequency distribution instead of compute IDF or Inverse Document Frequency that we can rank and take the top few keywords as our vocabulary list. We will also try out a sample vocabulary list test data population with the python nltk library. Its also possible to implement fuzzy SKWIC in Python.
raw = nltk.clean_html(html)
tokens = nltk.word_tokenize(raw)
vocab = sorted(set(tokens))
porter = nltk.PorterStemmer()
[porter.stem(t) for t in tokens]
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]
I take back the use of conditional frequency distributions and will switch to using IDF. The conditional frequency distribution is useful for Wartena's PLSI based implementation. We will stick to using IDF for this implementation since we are only interested in a limited set of vocabulary and want to focus on Fuzzy SKWIC. IDF can be computed by knowing the frequency distribution of the words in each text of a collection of documents. The frequency distribution is readily available from the nltk package so if a word is present in a text, we can increment the document frequency for that word. We will also stick to using four categories of document samples just like in the implementation by the authors of Fuzzy SKWIC and slip in our test document we would like to extract keywords on into one of the categories. We assume that the given test document from which we will extract keywords is similar to one or more of the categories we have chosen. We fix the C clusters. For each cluster we will maintain an array of feature weights ranging from 1 to n the number of words in our chosen vocabulary. The order of the feature weights must correspond to the order of the words in the vocabulary. We also maintain a variable associated to each term for the weighted aggregated sum of cosine based distances along the individual dimensions. These will be computed for different pairwise matching of terms and the ith cluster center vector. Both the cosine and the feature weights could be stored in data structure for each cell of a C x n matrix. We will also need another data structure for a C x N fuzzy partition matrix which we will use to keep the computed values of the fuzzy membership labels.
We will also keep some data structures between iteration that lets us use the previous iteration values computed for tau. The cosine based distance is probably the only one that may require individual as well as aggregated values to be kept track of between during an iteration as well as between iterations. The individual cosine based dissimilarity is computed once for each 1 < i < C, 1 < j < N and 1 < k < n. So it represents a 3D matrix of values and this we can aggregate over 1 to n. and store in separate lookup variables in a C x N matrix

This is a review of an interesting article written by Daniel Phelps as the Leader, Information Mining Group, at Eastman Kodak Company several years back. He talks about Emerging Trend Detection in Text Data Mining and his suggestion on how an ETD tool would work in an ideal scenario includes:
1. Raw data processed into useful information
2. Sufficient information processed to meet current need.
3. No irrelevant information presented.
4. All information available immediately when needed.
5. Information prioritized for the current need
6. Information presented in a format that is intuitively easy to understand.
7. Information can be viewed at different levels of detail.
8. Information can be viewed from multiple perspectives.
He mentions the objective of ETD to provide an automated alert when new developments are happening in a specific area of interest.
An event may be because of an underlying development. Whether the development is important is determined based on the situation and the particular information needs of the person evaluating the data. The speed to detect and to enable a response is what sets ETD tools apart. ETD tools generally have a user interface to navigate the trends, data and summaries. They are usually customized too from client to client. ETD tools involve knowledge management and text mining. Primarily they include what is a necessary for a decision maker to look at developments and make a call.

Saturday, November 2, 2013

We will continue on our discussion on trend analysis of web searching. From the sample discussed, the number of queries per month showed similar pattern. In a plot of natural log of frequency versus natural log of rank, one line represents ranks of all words ordered by their descending frequencies and another line represents ranks of unique frequencies. The first line was not straight but had a pretty good slope. The second line had a great overlap and dropped down for higher ranks. The first did not include words that shared the same frequency. The second did not include the size of the vocabulary.
There were a few more graphs made, to study if they showed similar pattern. Many terms had the same frequencies. The low frequencies cluster many words. And different time periods were chosen for the additional graphs. They all showed similar trends. There would be overlap for the high frequency words and a divergence on the lower. Consecutively, there were two equations formed one for each half. The lower frequency and higher rank was represented by a line while the higher frequency and lower rank was represented by a polynomial.
The size of the vocabulary increased much more slowly than the size of the queries. Investigations into the overlapping vocabulary across years showed that the overlap had higher frequencies each year. Some of these words were site specific and the others were more seasonal in nature. The matrix of vocabulary terms to term pairs is sparse with only a small portion of the vocabulary words co-occurring. In terms of improvements to the search engine, the following observations were made. First the zero hits due to the use of stop-words could be solved either by automatically excluding them or providing context sensitive information. Second the search engine could interpret term pairs and queries although users may have to be informed on this advanced query construction. Third the word associations measured by the frequencies of word pairs may be predictable by word frequencies alone. Collection of user vocabulary could help for the generation of content based metadata.

Trend and behavior detection is applied not only in monitoring but also in information retrieval. In a paper by Wang, Bownas and Berry, they discuss it in regard to web queries.
They discuss the type and nature of query characteristics that can be mined from web server logs. They studied the vocabulary to the queries to a website over a long period of time and found that the vocabulary did not have a well-defined Zipf distribution. Instead trends could be better represented as piecewise polynomial data fits i.e based on regression analysis.
Studies of traditional information retrieval systems reveal many problems that searchers encountered. These include the complexity of query syntaxes, the semantics of Boolean logic operators, and the linguistic ambiguities of natural language Since user education was unlikely, systems are designed to facilitate self-learning. Search logs from different search engines were collected and studied. Even though the web search engines differed widely in data collection, processing, focus of search etc. their results were comparable. Web queries were short. They averaged two words and were simple in structure, few queries used advanced search engine features. and many contained errors. These were observed from web logs that were quite large but covered a very short period of time. Hence the study by the authors involved logs from a university website spanning a longer time instead. This facilitated studies for trends over time
The log data included a date stamp, the query statement and hits. The zero hits were caused by many factors. First, the stop words set by the engine excluded many words. Second the queries contained a high percentage of misspelled words. Third many searches entered name of individuals. Fourth, the default Boolean operator was and. The other boolean operator had to be included explicitly. Fifth, the empty queries also contributed to zero hit.
The queries submitted to the search engine were very short and from different users, so it was difficult to predict if the vocabulary will show similar statistical distributions. A plot of logarithmic rank-frequency distribution of terms showed lines that were not smooth and straight.

Friday, November 1, 2013

Today I want to take a short break to discuss an interesting idea for monitoring API calls. APIs are ubiquitous and their usage is relevant to operations, development and test teams. There are various kinds of tools available for monitoring and some come with facilitators such as Mashery and Splunk. These are often based on non-intrusive capture, index and search mechanisms or a subset of them. Often they are cumbersome to use, involve heavy investment and don't provide real-time information. But I want to talk about a specific monitoring usage similar to real-time chart of call volumes. The APIs I want to target for my monitoring have a log counter i.e. they log the number of calls to an API identified by its name in a table in a database for read access by all. In this case, I'm interested in charts similar to performance counters on desktops.
There are a few metrics I'm interested in. For example, one metric is reported via graphs that are based in grids with call volumes on the y-axis and time on the x-axis. The peaks and the valleys in the graph can correspond to the alerts.
Another metric is the histogram of log counters that indicate the relative comparison between APIs. This may have similar patterns over time. But the idea is to view the first metric in comparison with the others at the same time.
A third metric is the averages over time slices. This can come in useful to detect time periods for bursts.
Another metric is to measure rate of change or velocity i.e. the number of calls per second. This can come useful to see and thwart possible denial of service attacks because it is often preceded by a slope. This could be particularly useful for login calls.
Data may also get stored for analysis later. There could be studies such as event correlation and capacity planning.
This tool is expected serves a specific purpose. It's not meant as a general purpose tool to span audit, compliance, governance and reporting. The data set is limited. The intended audience is targeted. The expected actionable items for the alerts are few but critical. This is more for the development and test team to know what's going on without reaching out.
These monitors can be considered as internal use so private and confidential information can also be monitored. One more important thing is that the instrumentations are added by what the developer team is interested in.

After the discussion on the results of simulations conducted by the authors of Fuzzy SKWIC, here are their observations on their simulation using the newsgroups. Since the results showed that the documents that were inconsistently classified were from four miscellaneous classes. the run was repeated without these documents. Specifically, these documents included those that lie in areas of overlap or fuzziness between distinct categories, documents that are outliers, and those that affect the purity of the resulting partition. In a way they studied where the clustering did not perform so well. After discarding these documents, the results showed that the clusters were more uniform , pure and the keywords were richer in terms of their relevance weights and partitions.
This implies that without the noise, the clusters were homogenous, compact and pure and the cluster dependent keywords were more relevant and representative.
In order to measure the performance of the clustering in any scenario, they came up with a metric called the average entropy measure of all C clusters. This average is found by taking the weighted entropy of a cluster as the ratio of the documents in that cluster to the overall number of documents. The cluster entropy for the ith cluster is then represented as the document class wise sum of this ratio times the log of this ratio and tuned by a constant. This constant is the inverse of the log of K documents
By using this metric, they arrive at the conclusion that the real unlabeled text data is the most challenging. Moreover, with any manually labeled benchmark document set, there are plenty of errors adding to the noise. Documents that straddle more than one topic also end up with an inadequate label. So there's no baseline to compare the unsupervised clustering but the results have shown that automatic labeling is superior to manual labeling.
The authors also suggest that the improvements they see to this kind of clustering is in using some kind of probabilistic latent semantic indexing which allows us to differentiate contexts further. Fuzzy SKWIC does simultaneous partitioning in two different hyperspaces - the document space to capture spatial document organization and the keyword space to capture context. Context can be inferred because they are not one from one keyword but from co-occurring relevant keywords. By providing fuzziness or memberships to different clusters to varying degrees, the cluster dependent keywords are richer and better suited to classify the documents in that clusters. The documents in a cluster are also more uniform, compact and homogeneous.