Cluster computing

Tuesday, April 2, 2013

WPF UI control layering

The UI controls for an application are built from custom libraries over .Net which is in turn built over the windows framework. Controls that are not visible in a layer should still be accessible from the layers below.

Monday, April 1, 2013

BIRCH review

This is a review of "BIRCH: a new data clustering algorithm and its application".
BIRCH stands for "Balanced Iterative Reducing and Clustering using hierarchies". It is an efficient and scalable data clustering method. It is used to build an interactive pixel classification tool and generating the initial code-book for compression. Data clustering refers to the problem of dividing N data points into K groups so that the data points are This system works well with very large data-sets because it first generates a more compact summary that retains as much distribution information as possible, then clustering the data summary as much as possible. BIRCH lets other clustering algorithms be applied to its summary.
BIRCH proposes a data structure called CF-tree to efficiently store the summary as clusters. It builds the tree as it encounters the data points and inserts the data points based on one or more of distance function metrics.Then it applies clustering algorithms to the leaf nodes. As opposed to partitioning clustering, especially k-means algorithm, that is widely used, BIRCH is a hierarchical clustering algorithm. In partitioning,we pick initial centers of clusters and assign every data instance to the closest cluster and computer new centers of k-clusters. A hierarchical clustering is one in which there are multiple levels of a clustering like in a dendrogram. The level 1 has n clusters and level n has one cluster. First, we calculate the similarity between all possible combinations of two data instances. This makes one cluster in level 2 for k = 7 clusters. Similarly two other data points are chosen for level 3 k = 6 clusters and repeated for level 4 k = 5 clusters. In level 5 K = 4 clusters, the most similar of earlier formed two clusters forms a new cluster. In level 6 K = 3 clusters, calculate the similarity between new cluster and all remaining clusters.. In level 7, there are only two clusters. In level 8, there is only one cluster. Thus the running time of partition clustering is O(n) and that of a hierarchical clustering is O(n2lgn) but its the output for which the latter is chosen. Both need to store data in memory.
BIRCH does incremental and dynamic clustering of incoming objects with only one scan of data and subsequent application of clustering algorithms to leaf nodes.

BIRCH clustering works in the agglomerative style where the cluster centers are not decided initially. Only the maximum
number of cluster summaries and the threshold for any cluster is decided. These two factors are chosen because we want to keep the cluster summaries in memory even for arbitrarily large data sets.
The algorithm reads each of the data points sequentially and proceeds in the following manner:
Step 1: Compute the distance between record r and each of the cluster centers. Let i be the cluster index such that the distance between r and Center is the smallest.
Step 2: For that i'th cluster, recompute the radius as if the record r is inserted into it. If the cluster threshold is not exceeeded, we can proceed to the next record. If not, we start a new cluster with only record r
Step 3: Check that the step 2 does not exceed the maximum cluster summaries. If so, increase threshold such that existing clusters can accomodate more records or they can be merged to fewer cluster summaries.

Sunday, March 31, 2013

How does a classifier based text data mining work ?

Let's say we want an implementation for detecting keywords in text using a semantic lookup and clustering based data mining approach. The implementation data structures, steps of the control flow and data structures are to be explained below.
In general for any such implementation, the steps are as follows:
The raw data undergoes a data selection step based on stemming and part of speech tagging
The data is cleaned to remove stop words.
In the data mining step, we extract the keywords by evaluating each word and building our tree or running a decision tree.
In the evaluation step, we filter and present the results.

To begin with, we view an unstructured document as a bag of words and attempt to find hierarchical clusters. We use a specific CF tree data structure such as the BIRCH system (Zhang et al) to populate a data set that we can run for our algorithms and others. This one is scalable and efficient for the task at hand.
The CF tree represents hierachical clusters of keywords that we insert into the tree one by one. We find the cluster that this keyword belongs to based on a distance function that tells us the distance of this keyword from cluster centers. The distance functions are interchangeable.

We could tag parts of speech this way:
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
common_suffixes = suffix_fdist.keys()[:100]

def pos_features(word):
     features = {}
     for suffix in common_suffixes:
         features['endswith(%s)' % suffix] = word.lower().endswith(suffix)

tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.DecisionTreeClassifier.train(train_set)
classifier.classify(pos_features('cats'))
'NNS'

Saturday, March 30, 2013

A review of emerging trends in search user interface

Chapter review from book: Search User Interfaces Marti Hearst 2009
Search Interface technology has newer trends in the areas of mobile search. multimedia search, social search, and a hybrid of natural language and command based queries in search.
Mobile search interfaces stand out because it is predicted to be the primary means of connecting to the internet by 2020 (Rainie,2008) . Here time and location information could be used to better anticipate a mobile users information needs. For example, a user may want to reserve a table at a restaurant in the next half -hour within a one mile radius.. Another factor observed by Kamvar and Baluja 2006 from a study of the requests to mobile search engine sites was that the query length was shorter for handheld devices than desktop searches and they had fewer variations. They also observed that the users did more retries and reformulations than change the topic. Context based search where context can be defined as current activity, location, time and conversation becomes relevant to mobile queries. Query entry on mobile devices are also subject to dynamic term suggestions and anticipation of common queries. Search results are also presented differently on handheld devices such as with formatting to list essential content only. Formatting could be such that the layout with thumbnails may be preserved but the relevant text could be highlighted. Information visualization and categorization also helps improve the usability on these devices. This is advocated for navigation of pages in the Fathumb (Karlson et al.) interfaces. Results could also be specialized for certain queries such as showing maps when it involves locations. Besides presenting results, some browsers are also optimized for handheld devices.
Images, video and audio are also increasingly being searched on mobile devices. While automatic image recognition is still difficult, automatic speech recognition is greatly improved. There are techniques to input queries that are spoken. Yet Image searches are an important part of the overall searches performed on mobile devices. And the issue is harder to solve when the associated text or metadata with the images do not describe the image content adequately. Videos are searched by segmenting the data into scenes and then shots. Text associate with the shots is attributed to the corresponding index in the videos. Audio searches are improved when the associated audio could be converted to some extent to text.
When the different multimedia results are to be included for a user's query input, it is called blended results or universal search. In general search results are improved with keywords, text labels from anchor texts, surrounding text in web pages or human assisted tags.
Another interesting trend for search is social search. Web 2.0 is considered to be interactive web where people interact with one another for their online needs. Social ranking of web pages, collaborative search and human powered question answering are all examples of user interaction for search. Social ranking is considered the "wisdom of crowds" and is generally indicated by the number of people's recommendations which can then be ranked by a search engine to display the results. Social tagging of websites or liking a website on social networking tools such as Facebook and search engine features to let the user rank are all used for ranking the results based on social interaction.
Multi-person and collaborative search on the other hand is for users to do common tasks involving collaboration such as for travel planning, grocery shopping, literature search, technical information search and fact finding. Typically a centralized location for assimilating a list and a view into participants actions is provided for such search. The latter helps with precision and recall. Precision and Recall are two terms used to measure search: Precision is the fraction of the retrieved instances that are relevant, while Recall is the fraction of relevant instances that are not retrieved. Therefore Precision considers the retrieved results and Recall considers the universe of relevant results. Along the same lines, another set of metrics are freshness and relevance. Freshness is about documents not yet looked at and relevance is about documents that are a match for the user's query. These two metrics counter balance each other and are continuously updated.
Massive scale human question answering is another technique for social search where people enter questions and thousands of others suggest answers in near real-time. This is particularly helpful in certain languages other than English.
Lastly, semantic search is another trend where keywords are discerned by looking up related synonyms to enhance search results.

Thursday, March 28, 2013

Search functionality in UI

In a web application, domain object values are often very relevant and useful to track records or pull up case information. For example, a case could be referred by the name of an individual or the product purchased by the individual. In such cases, the relational storage schema or the business object mapping are not discussed. Simply the value of an entity is all the user wants to provide. Then the search functionality has to pull up the corresponding data from the data source or store. The values provided for search are simple strings and it can be assumed that the corresponding entity attribute stores such values as strings. Therefore, the lookup of these values could be implemented with stored procedures that use fulltext.

Wednesday, March 27, 2013

Book Review : Key Words and Corpus Analysis, In Language Education.

Title: Textual Patterns: Key Words and Corpus Analysis, In Language Education.

Author: Mike Scott & Christopher Tribble (2006)

Summary of Chapter 4 and 5 relevant to keyword extraction.

In chapter 4 of the book, the authors introduce a method for identifying keywords in a text. They propose that there are two main kinds of output in a keyword list - aboutness and style.

He says keyness is a quality words may have in a given text or set of texts. It will be textual. He cites an example of a sentence where you can break the sentence into a dozen trivia each of which has a keyness. What we select out of these could be based on a threshold which is the third factor in this analysis. Once we have a list of words and their frequencies we can compare them to those from known corpus to eliminate usual and frequently occurring words such that only outstanding ones remain. The contrast between the reference and the target text is handled in a statistical way. Statistical tests of probability compare a given finding to what can be expected using a quantitative number.

But statistically significant contrast words need not reflect importance and aboutness. They can be considered to represent “style”. Aboutness can be gathered from the parts of speech analysis. Usually there is a target that references a “genid0” that ties it to the various parts of the speech.

This often depends on context which can be temporal and spatial. Context also has many levels such as collocations, same sentence, same paragraph, same section or chapter, same story or the whole text, a set of texts or the context of culture.

Therefore there are several issues with detecting style and aboutness. First is the issue of selecting a text section versus a text versus a corpus versus a sub-corpus. Second is the statistical issue of what can be claimed. Third issue is how to choose a reference corpus. Fourth issue is handling related forms such as anotonyms. The fifth issue is the status of the keywords one may identify and what is to be done with them.

Choosing a reference corpus can be tackled with a moderate size mixed bag of corpus. The keyword procedure using this reference corpus as mentioned earlier is fairly robust because even an unrelated reference corpus can detect words that indicate aboutness. And aboutness may not be one thing but a combination of several different ones.

Related forms can be avoided with a dictionary such as WordSmith by using stemming and ignoring semantic relations.

Status of the keyword can be determined with the help of context or purpose since its not evident from the text itself. As an example banana and weapons of mass destruction may not have status without context. But status can be a pointer to a specific textual aboutness and/or style or even a pattern. Or the keywords may have been statistically arrived but not established. Thus the keyword candidacy could be determined.

Keyword clusters or key phrases are also important for describing the aboutness and style of a given text or set of texts. When such groups of co-occuring words are selected, how many are positively or negatively key and are there any patterns in these two types ? These are also further lines of research.

Thus the authors conclude that keyness is a pointer to the importance which can be sub-textual, textual or intertextual.

Tuesday, March 26, 2013

Java Continued part 3

Java server is implemented with code behind, JSP pages, server side scripts etc. It's usually a best practice to keep code out of JSP pages because syntax errors in Java code in a JSP page are not detected until the page is deployed as opposed to tag libraries and servlets. JSP pages could be maintained by page authors who are not Java experts. HTML markup and JSP can be hard to read. Besides JSP pages are primarily intended for presentation logic.
Servers are also implemented with Struts framework. Do not mix DynaForm beans with Struts. Use scaffolding to your advantage.