Cluster computing

Saturday, August 8, 2015

Today we conclude reading the Text Summarization with multi-document summarization and evaluating summaries:

When summarizing multiple documents, one has to identify and locate thematic overlaps. One also has to decide what to include of the remainder to overcome potential inconsistencies between documents, and when necessary to arrange events from various sources along a single timeline.

Assuming that all input documents are parsed into templates (whose standardization makes comparison easier), systems like SUMMONs clusters the templates according to their contents and then applies rules to extract items of major import.

To determine what additional material should be included, first identify the units most relevant to the user’s query and then estimate the marginal relevance of all remaining units.

Lastly, some domains lend themselves to simple rules that can even get near perfect results. For examples, news articles covering the same event can be summarized by picking the most recent one in the timeline and taking the first two or three paragraphs.

How can you evaluate the quality of a summary ? This can be both intrinsic and extrinsic – the former determines output quality and the latter measures user assistance in task performance. The intrinsic ones use sentence or phrase recall and precision. A second intrinsic method is to require evaluators who judge based on readability, informative-ness, fluency and coverage. Extrinsic evaluation is easy to motivate.IR methods produce the best query based extracts. Much of the scoring process becomes simplified to IR like recall and precision measures against human extracts. That said the biggest challenge here is that summaries receive different scores with different measures.

There are two basic measures to capture the extent to which a summary meet s the requirements to be smaller than the input text and contain the most important information of the original. These are:

1) Compression ratio : CR = (length S)/(length T)

2) Retention ratio : RR = info in S / info in T, where T is the text and S is the summary.

The latter can be judged in four different ways:

1) The expert game: Ask experts to underline or extract the most interesting or informative fragments of the text.

2) The classification game: This is an extrinsic measure

3) The Shannon game: This is an information theory method that relies on the probability of the reader guessing the message, as say p and the measure is –p log p.

4) The question game: This measure approximates the information content of S by allowing how well it allows readers to answer questions drawn up about T

#codingexercise

Big numbers can be formed if numbers in an array are concatenated together. How do you print the minimum concatenated number of a given array? For example, if the input array is {3, 32, 321}, there are six concatenated numbers and the minimum one is 321323.

There are n! permutations for an array of size n. So the brute force approach would be O(n!)

The key observation here is that sorting doesn’t work but if we change the sort rule in such a way that when sorted this special way, their concatenation yields the smallest number.

Void PrintConcatenations(int number[])

{

String strNumber = new String[number.length];

For (int I = 0; I < number.length; i++)

strNumber[I] = number[i].ToString();

Array.Sort(strNumber, new NumericComparer());

For (int I = 0; I < number.length; i++)

Console.WriteLine(strNumber[I]);

}

Class NumericComparer implements IComparer<String> {

Public int Compare(String num1, String num2) {

String str1 = num1 + num2;

String str2 = num2 + num1;

Return str1.CompareTo(str2);

}

#codingexercise

If two English words have the same characters and the occurrence number of each character is also identical respectively, they are anagrams. The only difference between a pair of anagrams is the order of characters. For example, “silent” and “listen”, “evil” and “live” are two pairs of anagrams. Please implement a function to verify whether two words are a pair of anagrams.

bool IsAnagram(string one, string two)
{
if (one.length == two.length) return false;
var h = new HashTable();
for (int i =0; i<one.length;i++)
InsertHash(ref h, one[i]);
for (int i = 0; i < two.length; i++)
if (DecrementHash(ref h, two[i]) == false)
return false;
return h.IsEmpty();

}

Cluster computing

Saturday, August 8, 2015

No comments:

Post a Comment