Friday, August 7, 2015

Today we continue our discussion  on the Text Summarization by Eduard Hovy.
When we were mentioning topic identification as stage 1 of the processing. One way to evaluate this stage is to use metrics like precision and recall. Using ratios of the correctly identified sentences by both machine and human to the overall total of those correctly and incorrectly identified and total of correctly identified and those missed respectively, we get a sense of just how effective the topic identification has been. Precision gives a sense of the accuracy because the higher the precision the lesser the number of incorrectly chosen text. Recall gives a sense of information content because the lesser the number of missed text units, the higher the overall topic identification.
Also when we discussed positional criteria to identify sentences, we should included a policy that determine a genre and domain oriented policy to determine the best positions. This ranked list of sentence positions produce the highest yield of extracts and its possible to automate both the  policy creation and the extraction.
Let us look at some of the criteria for these policies:
Cue phrase indicator criteria: Certain words and phrases may be canned  and signal importance. For example a sentence beginning with "in this paper we show"  suggests a high correlation to the topic of the text. Teufel and Moens 1997 report 54 percent joint recall and precision using a manually built list of 1423 cue phrases.
Word and phrase frequency criteria: Luhn 1959 used Zipf's law of word distribution to develop the following extraction criterion : if a text contains some words unusually frequently, then the sentences containing these words are important.
Other metrics involve Term-Frequency-inverse-document-frequency, information gain and so on.Witbrock and Mittal 1999 compute a statistical model that describes the likelihood that each individual word in the text will appear in the summary, in the context of certain features (part-of-speech tag, word length, neighboring words, average sentence length etc)
Query and title overlap criteria : A simple but useful method to describe and score each sentence by the number of desirable words it contains. Desirable words are those that are contained in the title or in the users query
Cohesive or lexical connectedness criteria: Words can be selected in various ways including repetition, conference, synonymy and semantic association. Sentences and paragraphs can be scored based on the degree of connectedness of their words.Mani and Bloedorn 1997 represent the text as a graph in which words are nodes and edges represent adjacency, coreference and lexical similarity.
Discourse structure criteria: Certain lengthy text such as discourses have an underlying structure that can be used to calculate the discourse centrality of text units.
Combination of various module scores, researchers have found that no single method of scoring performs as well as a combination and there is no best strategy. Using a bayesian classifier, Aone  1999 showed that even within a single genre, different newspaper require different features to achieve the same performance. Using "Summarist",  Lin compared eighteen different features and a naive combination of them, and an optimal combination using machine  learning algorithm. While combination is the top scorer, the second best is achieved by query term overlap and  the third best score is achieved by word frequency, the lead method and the naive combination function. The evaluation of different approaches showed that no 5 percent summary achieved a combined precision and recall of (f-score) of over 0.25
#codingexercise
How many times does the digit 1 occur in the integers from 1 to n ?
 For example, if n  is 12, there are four numbers from 1 to 12 containing the digit 1, which are 1, 10, 11, and 12, and the digit 1 occurs five times.

We use a recursive strategy to count the number of ones by divide and conquer into two ranges one without the most significant digit and one with it. The number of ones for a fixed number of digits can be counted as permutations by fixing one of the digits as one and varying other digits from 0 to 9.
Int numberOf1(string N)
{
int first, length;
int numOtherDigits, numRecursive and numFirstDigit;

num = int.TryParse(N);
if (num == 0) return 0;

first = N[0];
length = N.length;
if length == 1 and first == 0 return 0;
if length == 1 and first > 0 return 1;

//number is split into two ranges : lo and hi, hi contains the most significant digit (leftmost) while lo doesn't 

//numFirstDigit is when the most significant digit is 1 in the hi range
numFirstDigit = 0;
if first > 1
       numFirstDigit = PowerBase10(length – 1);
else if (first == 1)
       numFirstDigit   = int(N[1])+1;

//numOtherDigits is the number of 1s in the hi range excluding the above case
numOtherDigits = first * (length – 1) * PowerBase10(length -2);

// numRecursive is the number of 1s in the lo range;
numRecursive = numberOf1(N.trimLeft());

return numFirstDigit+numOtherDigits+numRecursive;

}

PowerBase10 multiplies by 10 the parameter number of times.

int PowerBase10(unsigned int n)
{
int result = 1;
for (int I = 0; I < n ; I ++)
{
result *= 10;
}
return result;
}

No comments:

Post a Comment