Tuesday, March 26, 2013

Turney's paper on learning algorithms for keyphase extraction

Turney's paper on keyphrase extraction uses the C4.5 decision tree induction algorithm (Quinlan, 1993) and proposes GenEx -  a hybrid genetic algorithm for keyphrase extraction. C4.5 is general purpose algorithm and this paper has used it in a few different ways. However, their proposal GenEx is tailor made for learning to extract key phrases. It has two main components, the Genitor genetic algorithm (Whitley, 1989) and the Extractor keyphrase extraction algorithm (Turney, 1997, 1999). Extractor takes a document as input and produces a list of keyphrases as output. There are twelve parameters that govern how it processes the output. The Genitor takes these parameters and tunes them to maximize fitness on training data. These twelve parameters of Extractor are as follows:
NUM_PHRASES
NUM_WORKING
FACTOR_TWO_ONE
FACTOR_THREE_ONE
MIN_LENGTH_LOW_RANK
MIN_RANK_ROW_LENGTH
FIRST_LOW_THRESH
FIRST_HIGH_THRESH
FIRST_LOW_FACTOR
FIRST_HIGH_FACTOR
STEM_LENGTH
SUPPRESS_PROPER
The Extractor algorithm works something like this:
Step 1 : Find single stems
Step 2 : Score single stems - The score is the number of times the stem appears in the text, multiplied by a factor. If the stem first appears before FIRST_LOW_THRESH, then multiply the frequency by the FIRST_HIGH_FACTOR. Typically, FIRST_LOW_FACTOR is greater than one and FIRST_HIGH_FACTOR is less than one. Thus frequent stems get a higher score and rare stems get a lower score.
Step 3 : Select top single stems : Rank the stems in the order of decreasing score and make a list of the top NUM_WORKING single stems.
Step 4 : Find Stem phrases : Make a list of all phrases ( unigrams, bigrams, trigrams) that repeat without gap or punctuation. Stem each phrase by truncating each word in the phrase at STEM_LENGTH characters
Step 5 : Score stem phrases : For each stem phrase, count how often the stem phrase appears in text and note when it appears. Assign a score to each phrase just like step 2, then adjust based on the number of words in the stem. If there are two words in stem, multiply each word by FACTOR_TWO_ONE and if there are three stems, multiply the score by FACTOR_THREE_ONE. Thus longer phrases have higher score to boost them over rare single word stems.
Step 6 : Expand single stems: For each stem in the list of the top NUM_WORKING single stems, find the highest scoring stem phrase of one, two or three stems that contain the given stem to build a list of same count stem phrases sorted by their scores.
Step 7 : Drop duplicates: For example, when two single stems expand to the same two word stem phrase, they are duplicates. Preserve the ranking among the remaining stems.
Step 8 : Add suffixes : For each of the remaining stem phrases, find the most frequent corresponding whole phrases
Step 9 : Add capitals : Prefer the lowest number of capitals in the phrases. This is usually the best unless the capitalization is inconsistent where one of the words is say a proper noun.
Step 10 : Final output : We now have an ordered list of mixed-case ( upper and lower case, if appropriate ) phrases with suffixes added. The list is ordered by scores calculated in the step 2.
The Genitor algorithm may be viewed as a method for optimizing a string of bits using techniques that are inspired by biological evolution. Think of the bit string as a population of individuals. New bit strings are created by randomly changing existing individuals ( this operation is called mutation ) and by combining sub-strings from parents to make new children (this operation is called crossover). Each individual is assigned a score called its fitness, based on some measure of the quality of the bit string, with respect to a given task.
GenEx  The parameters in GenEx are set using the standard machine language paradigm of supervised learning. The algorithm is tuned with a dataset consisting of a set of documents paired with their target lists.  The document set is divided into test and training subsets. The learning process involves adjusting the parameters  to maximize the match between the output of the extractor and the target keyphrase  lists, using the training data. The success of the learning process is measured by the examining the match between the testing data.

No comments:

Post a Comment