Sunday, March 31, 2013

How does a classifier based text data mining work ?

Let's say we want an implementation for detecting keywords in text using a semantic lookup and clustering based data mining approach. The implementation data structures, steps of the control flow and data structures are to be explained below.
In general for any such implementation, the steps are as follows:
The raw data undergoes a data selection step based on stemming and part of speech tagging
The data is cleaned to remove stop words.
In the data mining step, we extract the keywords by evaluating each word and building our tree or running a decision tree.
In the evaluation step, we filter and present the results.


To begin with, we view an unstructured document as a bag of words and attempt to find hierarchical clusters.  We use a specific CF tree data structure such as the BIRCH system (Zhang et al) to populate a data set that we can run for our algorithms and others. This one is scalable and efficient for the task at hand.
The CF tree represents hierachical clusters of keywords that we insert into the tree one by one. We find the cluster that this keyword belongs to based on a distance function that tells us the distance of this keyword from cluster centers. The distance functions are interchangeable.


We could tag parts of speech this way:
from nltk.corpus import brown
 suffix_fdist = nltk.FreqDist()
common_suffixes = suffix_fdist.keys()[:100]

def pos_features(word):
     features = {}
     for suffix in common_suffixes:
         features['endswith(%s)' % suffix] = word.lower().endswith(suffix)

tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.DecisionTreeClassifier.train(train_set)
classifier.classify(pos_features('cats'))
'NNS'    

Saturday, March 30, 2013

A review of emerging trends in search user interface

Chapter review from book: Search User Interfaces Marti Hearst 2009
Search Interface technology has newer trends in the areas of mobile search. multimedia search, social search, and a hybrid of natural language and command based queries in search.
Mobile search interfaces stand out because it is predicted to be the primary means of connecting to the internet by 2020 (Rainie,2008) . Here time and location information could be used to better anticipate a mobile users information needs. For example, a user may want to reserve a table at a restaurant in the next half -hour within a one mile radius.. Another factor observed by Kamvar and Baluja 2006 from a study of the requests to mobile search engine sites was that the query length was shorter for handheld devices than desktop searches and they had fewer variations. They also observed that the users did more retries and reformulations than change the topic. Context based search where context can be defined as current activity, location, time and conversation becomes relevant to mobile queries. Query entry on mobile devices are also subject to dynamic term suggestions and anticipation of common queries. Search results are also presented differently on handheld devices such as with formatting to list essential content only. Formatting could be such that the layout with thumbnails may be preserved but the relevant text could be highlighted. Information visualization and categorization also helps improve the usability on these devices. This is advocated for navigation of pages in the Fathumb (Karlson et al.) interfaces. Results could also be specialized  for certain queries such as showing maps when it involves locations. Besides presenting results, some browsers are also optimized for handheld devices.
Images, video and audio are also increasingly being searched on mobile devices. While automatic image recognition is still difficult, automatic speech recognition is greatly improved. There are techniques to input queries that are spoken. Yet Image searches are an important part of the overall searches performed on mobile devices. And the issue is harder to solve when the associated text or metadata with the images do not describe the image content adequately. Videos are searched by segmenting the data into scenes and then shots. Text associate with the shots is attributed to the corresponding index in the videos. Audio searches are improved when the associated audio could be converted to some extent to text.
When the different multimedia results are to be included for a user's query input, it is called blended results or universal search. In general search results are improved with keywords, text labels from anchor texts, surrounding text in web pages or human assisted tags.
Another interesting trend for search is social search. Web 2.0 is considered to be interactive web where people interact with one another for their online needs. Social ranking of web pages, collaborative search and human powered question answering are all examples of user interaction for search. Social ranking is considered the "wisdom of crowds" and is generally indicated by the number of people's recommendations which can then be ranked by a search engine to display the results. Social tagging of websites or liking a website on social networking tools such as Facebook and search engine features to let the user rank are all used for ranking the results based on social interaction.
Multi-person and collaborative search on the other hand is for users to do common tasks involving collaboration such as for travel planning, grocery shopping, literature search, technical information search and fact finding. Typically a centralized location for assimilating a list and a view into participants actions is provided for such search. The latter helps with precision and recall. Precision and Recall are two terms used to measure search: Precision is the fraction of the retrieved instances that are relevant, while Recall is the fraction of relevant instances that are not retrieved. Therefore Precision considers the retrieved results and Recall considers the universe of relevant results. Along the same lines, another set of metrics are freshness and relevance. Freshness is about documents not yet looked at and relevance is about documents that are a match for the user's query. These two metrics counter balance each other and are continuously updated.
Massive scale human question answering is another technique for social search where people enter questions and thousands of others suggest answers in near real-time. This is particularly helpful in certain languages other than English.
Lastly, semantic search is another trend where keywords are discerned by looking up related synonyms to enhance search results.


Thursday, March 28, 2013

Search functionality in UI

In a web application, domain object values are often very relevant and useful to track records or pull up case information. For example, a case could be referred by the name of an individual or the product purchased by the individual. In such cases, the relational storage schema or the business object mapping are not discussed. Simply the value of an entity is all the user wants to provide. Then the search functionality has to pull up the corresponding data from the data source or store. The values provided for search are simple strings and it can be assumed that the corresponding entity attribute stores such values as strings. Therefore, the lookup of these values could be implemented with stored procedures that use fulltext.

Wednesday, March 27, 2013

Book Review : Key Words and Corpus Analysis, In Language Education.


Title: Textual Patterns: Key Words and Corpus Analysis, In Language Education.
Author: Mike Scott & Christopher Tribble (2006)
Summary of Chapter 4 and 5 relevant to keyword extraction.
In chapter 4 of the book, the authors introduce a method for identifying keywords in a text. They propose that there are two main kinds of output in a keyword list  - aboutness and style.
He says keyness is a quality words may have in a given text or set of texts. It will be textual.  He cites an example of a sentence where you can break the sentence into a dozen trivia each of which has a keyness. What we select out of these could be based on a threshold which is the third factor in this analysis. Once we have a list of words and their frequencies we can compare them to those from known corpus to eliminate usual and frequently occurring words such that only outstanding ones remain. The contrast between the reference and the target text is handled in a statistical way. Statistical tests of probability compare a given finding to what can be expected using a quantitative number.
But statistically significant contrast words need not reflect importance and aboutness.  They can be considered to represent “style”. Aboutness can be gathered from the parts of speech analysis. Usually there is a target that references a “genid0” that ties it to the various parts of the speech.
This often depends on context which can be temporal and spatial. Context also has many levels such as collocations, same sentence, same paragraph, same section or chapter, same story or the whole text, a set of texts or the context of culture.
Therefore there are several issues with detecting style and aboutness. First is the issue of selecting a text section versus a text versus a corpus versus a sub-corpus. Second is the statistical issue of what can be claimed.  Third issue is how to choose a reference corpus. Fourth issue is handling related forms such as anotonyms. The fifth issue is the status of the keywords one may identify and what is to be done with them.
Choosing a reference corpus can be tackled with a moderate size mixed bag of corpus. The keyword procedure using this reference corpus as mentioned earlier is fairly robust because even an unrelated reference corpus can detect words that indicate aboutness.  And aboutness may not be one thing but a combination of several different ones.
Related forms can be avoided with a dictionary such as WordSmith by using stemming and ignoring semantic relations.
Status of the keyword can be determined with the help of context or purpose since its not evident from the text itself. As an example banana and weapons of mass destruction may not have status without context.  But status can be a pointer to a specific textual aboutness and/or style or even a pattern. Or the keywords may have been statistically arrived but not established. Thus the keyword candidacy could be determined.
Keyword clusters or key phrases are also important for describing the aboutness and style of a given text or set of texts. When such groups of co-occuring words are selected, how many are positively or negatively key and are there any patterns in these two types ? These are also further lines of research.
Thus the authors conclude that keyness is a pointer to the importance which can be sub-textual, textual or intertextual.

Tuesday, March 26, 2013

Java Continued part 3

Java server is implemented with code behind, JSP pages, server side scripts etc. It's usually a best practice to keep code out of JSP pages because syntax errors in Java code in a JSP page are not detected until the page is deployed as opposed to tag libraries and servlets. JSP pages could be maintained by page authors who are not Java experts. HTML markup and JSP can be hard to read. Besides JSP pages are primarily intended for presentation logic.
Servers are also implemented with Struts framework. Do not mix DynaForm beans with Struts. Use scaffolding to your advantage.

Turney's paper on learning algorithms for keyphase extraction

Turney's paper on keyphrase extraction uses the C4.5 decision tree induction algorithm (Quinlan, 1993) and proposes GenEx -  a hybrid genetic algorithm for keyphrase extraction. C4.5 is general purpose algorithm and this paper has used it in a few different ways. However, their proposal GenEx is tailor made for learning to extract key phrases. It has two main components, the Genitor genetic algorithm (Whitley, 1989) and the Extractor keyphrase extraction algorithm (Turney, 1997, 1999). Extractor takes a document as input and produces a list of keyphrases as output. There are twelve parameters that govern how it processes the output. The Genitor takes these parameters and tunes them to maximize fitness on training data. These twelve parameters of Extractor are as follows:
NUM_PHRASES
NUM_WORKING
FACTOR_TWO_ONE
FACTOR_THREE_ONE
MIN_LENGTH_LOW_RANK
MIN_RANK_ROW_LENGTH
FIRST_LOW_THRESH
FIRST_HIGH_THRESH
FIRST_LOW_FACTOR
FIRST_HIGH_FACTOR
STEM_LENGTH
SUPPRESS_PROPER
The Extractor algorithm works something like this:
Step 1 : Find single stems
Step 2 : Score single stems - The score is the number of times the stem appears in the text, multiplied by a factor. If the stem first appears before FIRST_LOW_THRESH, then multiply the frequency by the FIRST_HIGH_FACTOR. Typically, FIRST_LOW_FACTOR is greater than one and FIRST_HIGH_FACTOR is less than one. Thus frequent stems get a higher score and rare stems get a lower score.
Step 3 : Select top single stems : Rank the stems in the order of decreasing score and make a list of the top NUM_WORKING single stems.
Step 4 : Find Stem phrases : Make a list of all phrases ( unigrams, bigrams, trigrams) that repeat without gap or punctuation. Stem each phrase by truncating each word in the phrase at STEM_LENGTH characters
Step 5 : Score stem phrases : For each stem phrase, count how often the stem phrase appears in text and note when it appears. Assign a score to each phrase just like step 2, then adjust based on the number of words in the stem. If there are two words in stem, multiply each word by FACTOR_TWO_ONE and if there are three stems, multiply the score by FACTOR_THREE_ONE. Thus longer phrases have higher score to boost them over rare single word stems.
Step 6 : Expand single stems: For each stem in the list of the top NUM_WORKING single stems, find the highest scoring stem phrase of one, two or three stems that contain the given stem to build a list of same count stem phrases sorted by their scores.
Step 7 : Drop duplicates: For example, when two single stems expand to the same two word stem phrase, they are duplicates. Preserve the ranking among the remaining stems.
Step 8 : Add suffixes : For each of the remaining stem phrases, find the most frequent corresponding whole phrases
Step 9 : Add capitals : Prefer the lowest number of capitals in the phrases. This is usually the best unless the capitalization is inconsistent where one of the words is say a proper noun.
Step 10 : Final output : We now have an ordered list of mixed-case ( upper and lower case, if appropriate ) phrases with suffixes added. The list is ordered by scores calculated in the step 2.
The Genitor algorithm may be viewed as a method for optimizing a string of bits using techniques that are inspired by biological evolution. Think of the bit string as a population of individuals. New bit strings are created by randomly changing existing individuals ( this operation is called mutation ) and by combining sub-strings from parents to make new children (this operation is called crossover). Each individual is assigned a score called its fitness, based on some measure of the quality of the bit string, with respect to a given task.
GenEx  The parameters in GenEx are set using the standard machine language paradigm of supervised learning. The algorithm is tuned with a dataset consisting of a set of documents paired with their target lists.  The document set is divided into test and training subsets. The learning process involves adjusting the parameters  to maximize the match between the output of the extractor and the target keyphrase  lists, using the training data. The success of the learning process is measured by the examining the match between the testing data.

Monday, March 25, 2013

Java continued ...

Java has excellent support for client server programming. In general, it has good support for networking and particularly for client server application development. Java has built in support for applets and has rich support for document model. Applets enable many of the common UI operations we perform on Windows. Microsoft has a runtime support for Java called Microsoft J++.