Tuesday, March 26, 2013

Turney's paper on learning algorithms for keyphase extraction

Turney's paper on keyphrase extraction uses the C4.5 decision tree induction algorithm (Quinlan, 1993) and proposes GenEx -  a hybrid genetic algorithm for keyphrase extraction. C4.5 is general purpose algorithm and this paper has used it in a few different ways. However, their proposal GenEx is tailor made for learning to extract key phrases. It has two main components, the Genitor genetic algorithm (Whitley, 1989) and the Extractor keyphrase extraction algorithm (Turney, 1997, 1999). Extractor takes a document as input and produces a list of keyphrases as output. There are twelve parameters that govern how it processes the output. The Genitor takes these parameters and tunes them to maximize fitness on training data. These twelve parameters of Extractor are as follows:
NUM_PHRASES
NUM_WORKING
FACTOR_TWO_ONE
FACTOR_THREE_ONE
MIN_LENGTH_LOW_RANK
MIN_RANK_ROW_LENGTH
FIRST_LOW_THRESH
FIRST_HIGH_THRESH
FIRST_LOW_FACTOR
FIRST_HIGH_FACTOR
STEM_LENGTH
SUPPRESS_PROPER
The Extractor algorithm works something like this:
Step 1 : Find single stems
Step 2 : Score single stems - The score is the number of times the stem appears in the text, multiplied by a factor. If the stem first appears before FIRST_LOW_THRESH, then multiply the frequency by the FIRST_HIGH_FACTOR. Typically, FIRST_LOW_FACTOR is greater than one and FIRST_HIGH_FACTOR is less than one. Thus frequent stems get a higher score and rare stems get a lower score.
Step 3 : Select top single stems : Rank the stems in the order of decreasing score and make a list of the top NUM_WORKING single stems.
Step 4 : Find Stem phrases : Make a list of all phrases ( unigrams, bigrams, trigrams) that repeat without gap or punctuation. Stem each phrase by truncating each word in the phrase at STEM_LENGTH characters
Step 5 : Score stem phrases : For each stem phrase, count how often the stem phrase appears in text and note when it appears. Assign a score to each phrase just like step 2, then adjust based on the number of words in the stem. If there are two words in stem, multiply each word by FACTOR_TWO_ONE and if there are three stems, multiply the score by FACTOR_THREE_ONE. Thus longer phrases have higher score to boost them over rare single word stems.
Step 6 : Expand single stems: For each stem in the list of the top NUM_WORKING single stems, find the highest scoring stem phrase of one, two or three stems that contain the given stem to build a list of same count stem phrases sorted by their scores.
Step 7 : Drop duplicates: For example, when two single stems expand to the same two word stem phrase, they are duplicates. Preserve the ranking among the remaining stems.
Step 8 : Add suffixes : For each of the remaining stem phrases, find the most frequent corresponding whole phrases
Step 9 : Add capitals : Prefer the lowest number of capitals in the phrases. This is usually the best unless the capitalization is inconsistent where one of the words is say a proper noun.
Step 10 : Final output : We now have an ordered list of mixed-case ( upper and lower case, if appropriate ) phrases with suffixes added. The list is ordered by scores calculated in the step 2.
The Genitor algorithm may be viewed as a method for optimizing a string of bits using techniques that are inspired by biological evolution. Think of the bit string as a population of individuals. New bit strings are created by randomly changing existing individuals ( this operation is called mutation ) and by combining sub-strings from parents to make new children (this operation is called crossover). Each individual is assigned a score called its fitness, based on some measure of the quality of the bit string, with respect to a given task.
GenEx  The parameters in GenEx are set using the standard machine language paradigm of supervised learning. The algorithm is tuned with a dataset consisting of a set of documents paired with their target lists.  The document set is divided into test and training subsets. The learning process involves adjusting the parameters  to maximize the match between the output of the extractor and the target keyphrase  lists, using the training data. The success of the learning process is measured by the examining the match between the testing data.

Monday, March 25, 2013

Java continued ...

Java has excellent support for client server programming. In general, it has good support for networking and particularly for client server application development. Java has built in support for applets and has rich support for document model. Applets enable many of the common UI operations we perform on Windows. Microsoft has a runtime support for Java called Microsoft J++.

Sunday, March 24, 2013

Java

I'm trying to take a java test today so here is a quick overview:
 Java features: simple, object-oriented, robust, multi-threaded, architecture-neutral, interpreted and high-performance, distributed and dynamic.
Code is written usually in one of two ways : viz. procedural  -  set of instructions that defines "what is happening" or second object-oriented  - code is organized around data to describe "who is being affected".

Common data types include byte, int, double etc. You can use instanceOf
String operations : StartsWith, equalsIgnoreCase, regionMatches, reverse etc.

Common data structures are Vector, Dictionary, Hashtable, StringTokenizer, BitSet, Observable,
use static and this keywords for class methods liberally eg:
 class Box
{
public static void main(string args[])

use implements and extends judiciously
class Deadlock implements Runnable 
class Box extends Shape

 use Thread, ThreadGroup and Runnable
 use Cloneable interface
 use Observable class and Observe interface. Observable calls notifyObservers() and Observer declares update(Observable, Object) method.

Friday, March 22, 2013

Expression Tree and WPF

DependencyProperty in WPF represents a property that can be set through methods such as styling, data binding, animation and inheritance. It also reports information such as whether changing a property value can be coerced. Further, it can be edited in the Visual Studio designer just like other properties.
QueryControls can make use of dependencyproperty when representing the expression tree on the group canvas. This is very helpful as the clauses are added, deleted, modified or grouped/ungrouped. Expressions are linear textual representations. They are adapted with ExpressionAdapters to form query trees. LINQ provides the functionalities for expression evaluation.

UI automation continued

There are several diagnostics that can help with UI automation. . For example, WPF has a method called DrawHighlight() that can be called on its controls which makes it appear highlighted on the site. Another way is to use the default spy tool that comes with visual studio template for coded UI project that lets you identify the properties with which to search a control. Controls can be searched by first explicitly setting the property to search for such as a className, name or display text and then calling the FindMatchingControls method that recursively searches the document model. Logging can also be turned on for WPF which helps further with the asynchronous calls made through the UI.

Thursday, March 21, 2013

User Interface automation

When automating UI test cases, there are several challenges that cost time and effort. First of these is that the controls on the winforms may not be named. To find the contol, one may have to walk the document model. Second the controls may not be at the level in the tree as they visually appear on the UI. One may have to use a tool to detect the level and the location of the tree node. Sometimes the tools also don't give the correct information. Further, the properties or attributes used for identifying the control may require a recursive search over the controll hierarchy in the document. Third the contols may not all be consistent across UI panes and their access and usage varies even though they are the same controls. These cause additional code to be written for each such case. Fourth, there are many different ways to implement the same functionality involving different controls or generally different workflow. Code for automation has to deal with all these cases. Fifth, the contols may not always be out of box. They may provide advanced features or they may have additional test cases required. Custom controls have their own test requirements. All these add to the code variations. Fortunately, they can be organized and handled as much as possible.

Wednesday, March 20, 2013

Test harness and drivers

Frequently, we develop modules for automted test execution but don't integrate all of them. We do this mostly because we are short on time or resource. A test framework that's useful for manual tests can also be useful for automated tests. Just the road-map of the integration needs to be worked out. This is where a technical specification and planning tools help. Because it lists all the dependencies, the layers, the object model and mappings, the translations and operations, the stages of development, the costs for development and test and the trade-offs and the deliverables from sprint planning. With a knowledge of the means and methods to integrate test tools and framework captured in written format, resources can pick them up or execute independent of each other. The benefits of integration is mutual to the individual software products being integrated and adds overall value to existing products. I want to be able to do this.