Cluster computing

Tuesday, August 11, 2015

#codingexercise

Question: Define a queue in which we can get its maximum number with a function max. In this stack, the time complexity of max, push_back, and pop_front are all O(1).

Answer:

Public class QueueWithMax<T>{

Public class InternalData {

T number {get;set;}

Int index {get;set;}

}

Deque<InternalData> data;

Deque<InternalData> maximums;

Int currentIndex;

Public QueueWithMax(){

currentIndex = 0;

}

Public void push_back(T number) {

While(maximums.empty == false && number >= maximum.back().number)

maximums.pop_back();

InternalData idata = {number, currentIndex};

data.push_back(internalData);

maximums.push_back(internalData);

++currentIndex;

}

Public T pop_front(){

If (maximums.empty())

Throw new exception(“queue is empty”);

If (maximums.front().index == data.front().index)

Maximums.pop_front();

Return Data.pop_front().number;

}

Public T max() const {

If (maximums.empty())

Throw new exception(“queue is empty”);

Return maximums.front().number;

}

Monday, August 10, 2015

#codingexercise

Question: If a number only has factors 2, 3, and 5, it is an ugly number . For example, 6 and 10 are two ugly numbers, but 14 is not because it has a factor of 7. Usually 1 is considered to be the first ugly number. What is the arbitrary kth ugly number? Can we do this efficiently?

Solution:
The straightforward solution repeatedly divides by 2, 3 and 5 each until the remainder is 1 in which case the number is ugly or it is more than one. However, this solution is not efficient since it does the same check for non ugly numbers. Instead we could store previously encountered ugly numbers and use their factors.
int GetUglyNumber(int index)
{
if (index <= 0)
return 0;
var uglyNums = new int[index];
uglyNums[0] = 1;
int nextUglyIndex = 1;
int index2 = 0;
int index3 = 0;
int index5 = 0;
while (nextUglyIndex < index)
{
int min = Math.Min(uglyNums[index2]*2, uglyNums[index3]*3);
min = Math.Min(min, uglyNums[index5]*5);
uglyNums[nextUglyIndex] = min;
while (uglyNums[index2] * 2 <= uglyNums[nextUglyIndex])
++index2;
while (uglyNums[index3] * 3 <= uglyNums[nextUglyIndex])
++index3;
while (uglyNums[index5] * 5 <= uglyNums[nextUglyIndex])
++index5;
++nextUglyIndex;

}
return uglyNums[nextUglyIndex-1];
}

We briefly discuss SUMMONS which is the first example of a multi document summarization system and illustrates one of the stages of text summarization - topic identification and fusion. It tackles news events about a single topic and produces a briefing that merges relevant information about each event and how reports by different news agencies have evolved over time. Rather than working with raw text SUMMONS reads a database previously built by a template based message understanding system. Therefore there are two stages : one that takes fulltext as input and fills template slots and then synthesizes a summary and the other that takes summary opertors and a set of rules and with connective phrases smoothens out a summary.
#codingexercise
Public class compare:IComparer
{
Int compare (int i, int j){
Return i%10 <j%10;
}
}

Saturday, August 8, 2015

Today we conclude reading the Text Summarization with multi-document summarization and evaluating summaries:

When summarizing multiple documents, one has to identify and locate thematic overlaps. One also has to decide what to include of the remainder to overcome potential inconsistencies between documents, and when necessary to arrange events from various sources along a single timeline.

Assuming that all input documents are parsed into templates (whose standardization makes comparison easier), systems like SUMMONs clusters the templates according to their contents and then applies rules to extract items of major import.

To determine what additional material should be included, first identify the units most relevant to the user’s query and then estimate the marginal relevance of all remaining units.

Lastly, some domains lend themselves to simple rules that can even get near perfect results. For examples, news articles covering the same event can be summarized by picking the most recent one in the timeline and taking the first two or three paragraphs.

How can you evaluate the quality of a summary ? This can be both intrinsic and extrinsic – the former determines output quality and the latter measures user assistance in task performance. The intrinsic ones use sentence or phrase recall and precision. A second intrinsic method is to require evaluators who judge based on readability, informative-ness, fluency and coverage. Extrinsic evaluation is easy to motivate.IR methods produce the best query based extracts. Much of the scoring process becomes simplified to IR like recall and precision measures against human extracts. That said the biggest challenge here is that summaries receive different scores with different measures.

There are two basic measures to capture the extent to which a summary meet s the requirements to be smaller than the input text and contain the most important information of the original. These are:

1) Compression ratio : CR = (length S)/(length T)

2) Retention ratio : RR = info in S / info in T, where T is the text and S is the summary.

The latter can be judged in four different ways:

1) The expert game: Ask experts to underline or extract the most interesting or informative fragments of the text.

2) The classification game: This is an extrinsic measure

3) The Shannon game: This is an information theory method that relies on the probability of the reader guessing the message, as say p and the measure is –p log p.

4) The question game: This measure approximates the information content of S by allowing how well it allows readers to answer questions drawn up about T

#codingexercise

Big numbers can be formed if numbers in an array are concatenated together. How do you print the minimum concatenated number of a given array? For example, if the input array is {3, 32, 321}, there are six concatenated numbers and the minimum one is 321323.

There are n! permutations for an array of size n. So the brute force approach would be O(n!)

The key observation here is that sorting doesn’t work but if we change the sort rule in such a way that when sorted this special way, their concatenation yields the smallest number.

Void PrintConcatenations(int number[])

{

String strNumber = new String[number.length];

For (int I = 0; I < number.length; i++)

strNumber[I] = number[i].ToString();

Array.Sort(strNumber, new NumericComparer());

For (int I = 0; I < number.length; i++)

Console.WriteLine(strNumber[I]);

}

Class NumericComparer implements IComparer<String> {

Public int Compare(String num1, String num2) {

String str1 = num1 + num2;

String str2 = num2 + num1;

Return str1.CompareTo(str2);

}

#codingexercise

If two English words have the same characters and the occurrence number of each character is also identical respectively, they are anagrams. The only difference between a pair of anagrams is the order of characters. For example, “silent” and “listen”, “evil” and “live” are two pairs of anagrams. Please implement a function to verify whether two words are a pair of anagrams.

bool IsAnagram(string one, string two)
{
if (one.length == two.length) return false;
var h = new HashTable();
for (int i =0; i<one.length;i++)
InsertHash(ref h, one[i]);
for (int i = 0; i < two.length; i++)
if (DecrementHash(ref h, two[i]) == false)
return false;
return h.IsEmpty();

}

#codingexercise
Given two strings, how do you delete characters contained in the second string from the first string? For example, if all characters in the string “aeiou” are deleted from the string “We are students.”, the result is “W r stdnts.”.
Solution:
Void DeleteCharacters(String input, String characters)
{
Var h = new HashTable();
For (int I =0; I < characters.length; i++)
                H[characters[i]] = 1;
String ret = string.empty;
For (int r = 0; r < input.length; r++)
     If (h[input[r]] != 1)
                Ret += input[r];
Return ret;

}

Next we continue our discussion of Text Summarization by Hovy. We now look at the other two stages of Interpretation and generation.
Topic fusion or interpretation as mentioned earlier is what sets apart summarization systems from abstract type systems. During summarization, the topics identified as important are fused, represented in new terms even using concepts or words not found in the original text. No system can perform interpretation without prior knowledge about the domain. At first glance, template representation is used in information extraction is promising but the difficulty of building and filling such structures is makes large scale summarization impractical at present.
Hahn and Reimer 1999 develop operators that condense knowledge structures in a terminological logic through conceptual abstraction. Till date no parser has been built to produce the knowledge structures.
Interpretation remains blocked by problems of domain knowledge acquisition.
The third major stage in summarization is generation. When the summary content has been created through abstracting and/or information extraction, it exists within the computer as internal notations and thus requires the techniques of natural language generation.

#codingexercise
How to generate Sudoku puzzles for a puzzle book: http://1drv.ms/1K2Yeao

Friday, August 7, 2015

Today we continue our discussion on the Text Summarization by Eduard Hovy.
When we were mentioning topic identification as stage 1 of the processing. One way to evaluate this stage is to use metrics like precision and recall. Using ratios of the correctly identified sentences by both machine and human to the overall total of those correctly and incorrectly identified and total of correctly identified and those missed respectively, we get a sense of just how effective the topic identification has been. Precision gives a sense of the accuracy because the higher the precision the lesser the number of incorrectly chosen text. Recall gives a sense of information content because the lesser the number of missed text units, the higher the overall topic identification.
Also when we discussed positional criteria to identify sentences, we should included a policy that determine a genre and domain oriented policy to determine the best positions. This ranked list of sentence positions produce the highest yield of extracts and its possible to automate both the policy creation and the extraction.
Let us look at some of the criteria for these policies:
Cue phrase indicator criteria: Certain words and phrases may be canned and signal importance. For example a sentence beginning with "in this paper we show" suggests a high correlation to the topic of the text. Teufel and Moens 1997 report 54 percent joint recall and precision using a manually built list of 1423 cue phrases.
Word and phrase frequency criteria: Luhn 1959 used Zipf's law of word distribution to develop the following extraction criterion : if a text contains some words unusually frequently, then the sentences containing these words are important.
Other metrics involve Term-Frequency-inverse-document-frequency, information gain and so on.Witbrock and Mittal 1999 compute a statistical model that describes the likelihood that each individual word in the text will appear in the summary, in the context of certain features (part-of-speech tag, word length, neighboring words, average sentence length etc)
Query and title overlap criteria : A simple but useful method to describe and score each sentence by the number of desirable words it contains. Desirable words are those that are contained in the title or in the users query
Cohesive or lexical connectedness criteria: Words can be selected in various ways including repetition, conference, synonymy and semantic association. Sentences and paragraphs can be scored based on the degree of connectedness of their words.Mani and Bloedorn 1997 represent the text as a graph in which words are nodes and edges represent adjacency, coreference and lexical similarity.
Discourse structure criteria: Certain lengthy text such as discourses have an underlying structure that can be used to calculate the discourse centrality of text units.
Combination of various module scores, researchers have found that no single method of scoring performs as well as a combination and there is no best strategy. Using a bayesian classifier, Aone 1999 showed that even within a single genre, different newspaper require different features to achieve the same performance. Using "Summarist", Lin compared eighteen different features and a naive combination of them, and an optimal combination using machine learning algorithm. While combination is the top scorer, the second best is achieved by query term overlap and the third best score is achieved by word frequency, the lead method and the naive combination function. The evaluation of different approaches showed that no 5 percent summary achieved a combined precision and recall of (f-score) of over 0.25
#codingexercise

How many times does the digit 1 occur in the integers from 1 to n ?

For example, if n is 12, there are four numbers from 1 to 12 containing the digit 1, which are 1, 10, 11, and 12, and the digit 1 occurs five times.

We use a recursive strategy to count the number of ones by divide and conquer into two ranges one without the most significant digit and one with it. The number of ones for a fixed number of digits can be counted as permutations by fixing one of the digits as one and varying other digits from 0 to 9.

Int numberOf1(string N)

{

int first, length;

int numOtherDigits, numRecursive and numFirstDigit;

num = int.TryParse(N);

if (num == 0) return 0;

first = N[0];

length = N.length;

if length == 1 and first == 0 return 0;

if length == 1 and first > 0 return 1;

//number is split into two ranges : lo and hi, hi contains the most significant digit (leftmost) while lo doesn't

//numFirstDigit is when the most significant digit is 1 in the hi range

numFirstDigit = 0;

if first > 1

numFirstDigit = PowerBase10(length – 1);

else if (first == 1)

numFirstDigit = int(N[1])+1;

//numOtherDigits is the number of 1s in the hi range excluding the above case

numOtherDigits = first * (length – 1) * PowerBase10(length -2);

// numRecursive is the number of 1s in the lo range;

numRecursive = numberOf1(N.trimLeft());

return numFirstDigit+numOtherDigits+numRecursive;

}

PowerBase10 multiplies by 10 the parameter number of times.

int PowerBase10(unsigned int n)
{
int result = 1;
for (int I = 0; I < n ; I ++)
{
result *= 10;
}
return result;
}

Thursday, August 6, 2015

#codingexercise
Please find the smallest k numbers (in value) out of n numbers. For example, if given an array
with eight numbers {4, 5, 1, 6, 2, 7, 3, 8}, please return the least four numbers 1, 2, 3, and 4.
The naive solution is to sort the n input numbers increasingly and to have the least k numbers be the
first k numbers. Since it needs to sort, its time complexity is O(nlogn). Interviewers will probably ask to
explore more efficient solutions.
There is a way to do this with O(n) time complexity which involves similar logic as in the partition method:
Void getLeastNumbers_2(int[] input, int[] output) {
Int start = 0;
Int end = input.length -1;
Int k = output.length;
Int index = partition(input, start, end);
While (index != k-1) {
If (index > k -1) {
End = index – 1;
Index = partition(input, start, end);
}else{
Start = index + 1;
Index = partition(input, start, end);
}
}
For (int I = 0;I < k; i++)
Output[i] = input[i];
}

Today we start discussing "Text Summarization" by Eduard Hovy. It provides an overview of the principal approaches in summarization, describes some of the summarization systems and how to evaluate the summaries.
The author begins by first reminding that summary is defined as one that is derived from another text and contains a significant portion of the information in the original text while not exceeding more than half the length of the text.
That said there can be different types of summaries:
1) indicative summaries - that provide an idea of what the text is about without giving any content
2) informative summaries - that provide some of the content
3) Extracts that are formed by reusing some of the words or sentences from the original text verbatim.
4) Abstracts that are formed by regenerating the extracted content.
Automated text summarization goes through three different stages generally. The first stage is the topic identification which is the simplest type of summary
The second stage is interpretation and the last stage is summary generation.

Today we start discussing  "Text Summarization" by Eduard Hovy. It provides an overview of the principal approaches in summarization, describes some of the summarization systems and how to evaluate the summaries.

The author begins by first reminding that summary is defined as one that is derived from another text and contains a significant portion of the information in the original text while not exceeding more than half the length of the text.

That said there can be different types of summaries:

1) indicative summaries - that provide an idea of what the text is about without giving any content

2) informative summaries - that provide some of the content

3) Extracts that are formed by reusing some of the words or sentences from the original text verbatim.

4) Abstracts that are formed by regenerating the extracted content.

Automated text summarization goes through three different stages generally. The first stage is the topic identification which is the simplest type of summary

The second stage is interpretation and the last stage is summary generation.

Interpretation is generally hard because it is subjective. Given one input the output may vary from human to human differing because of fusion of concepts, evaluation or other processing. This stage generally occurs after topic identification. Since the result is something new and it required domain knowledge, few systems perform interpretation.

The results of the interpretation stage is a set of incoherent representation with dangling references, omitted discourse linkages, and repeated or omitted material Systems therefore include a stage of summary generation to produce human-readable text.

Stage 1: Topic identification is such that to perform all systems, employ several independent modules. Each module assigns a score to each unit of input (word, sentence or longer passage) then a combination module combines the scores for each unit to assign a single integrated score to it. The system then selects the n-highest scoring units, according to the length desired by the user.

Note that sentence boundary is not always a good unit. But positional criteria for emphasis such as headings, titles, first paragraphs is often helpful. Some variations of the positional method exist in the literature.

Moreover collocation of data is even more helpful to the data. Collocation and co-emphasis can be determined based on statistical methods.

Combinations can exist to improve the processing in this stage.

Stage 2: Interpretation or topic fusion

The topics identified by the system are fused. No system can perform interpretation without prior knowledge about the domain. Operators that condense knowledge fall in this category.

Topic signatures are often alluded to in the interpretations and mentioned as helpful in both identification and interpretation but their application has been tricky. Domain knowledge acquisition is the biggest challenge.

Stage 3: Summary generation: Note that extract summaries require no generation. A process of smoothing is used to identify and repair typical dysfluencies. A summary revision program does multiple transformations to smoothen out the incoherencies. Text compression is another promising approach. Some papers such as the EM algorithm have gained recognition. Cut and paste methods together with bigram language model have often been mentioned as helpful.

These are the three stages to summarize the text that this article delves into.