Thursday, August 6, 2015

#codingexercise
Please find the smallest k numbers (in value) out of n numbers. For example, if given an array
with eight numbers {4, 5, 1, 6, 2, 7, 3, 8}, please return the least four numbers 1, 2, 3, and 4.
The naive solution is to sort the n input numbers increasingly and to have the least k numbers be the
first k numbers. Since it needs to sort, its time complexity is O(nlogn). Interviewers will probably ask to
explore more efficient solutions.
There is a way to do this with O(n) time complexity which involves similar logic as in the partition method:
Void getLeastNumbers_2(int[] input, int[] output) {
Int start = 0;
Int end = input.length -1;
Int k = output.length;
Int index = partition(input, start, end);
While (index != k-1) {
If (index > k -1) {
End = index – 1;
Index = partition(input, start, end);
}else{
Start = index + 1;
Index = partition(input, start, end);
}
}
For (int I = 0;I < k; i++)
Output[i] = input[i];
}

Today we start discussing  "Text Summarization" by Eduard Hovy. It provides an overview of the principal approaches in summarization, describes some of the summarization systems and how to evaluate the summaries.
The author begins by first reminding that summary is defined as one that is derived from another text and contains a significant portion of the information in the original text while not exceeding more than half the length of the text.
That said there can be different types of summaries:
1) indicative summaries - that provide an idea of what the text is about without giving any content
2) informative summaries - that provide some of the content
3) Extracts that are formed by reusing some of the words or sentences from the original text verbatim.
4) Abstracts that are formed by regenerating the extracted content.
Automated text summarization goes through three different stages generally. The first stage is the topic identification which is the simplest type of summary
The second stage is interpretation and the last stage is summary generation.
Today we start discussing  "Text Summarization" by Eduard Hovy. It provides an overview of the principal approaches in summarization, describes some of the summarization systems and how to evaluate the summaries. 
The author begins by first reminding that summary is defined as one that is derived from another text and contains a significant portion of the information in the original text while not exceeding more than half the length of the text. 
That said there can be different types of summaries: 
1) indicative summaries - that provide an idea of what the text is about without giving any content 
2) informative summaries - that provide some of the content 
3) Extracts that are formed by reusing some of the words or sentences from the original text verbatim. 
4) Abstracts that are formed by regenerating the extracted content. 
Automated text summarization goes through three different stages generally. The first stage is the topic identification which is the simplest type of summary 

The second stage is interpretation and the last stage is summary generation. 

Interpretation is generally hard because it is subjective. Given one input the output may vary from human to human differing because of fusion of concepts, evaluation or other processing. This stage generally occurs after topic identification. Since the result is something new and it required domain knowledge, few systems perform interpretation.   
The results of the interpretation stage is a set of incoherent representation with dangling references, omitted discourse linkages, and repeated or omitted material Systems therefore include a  stage of summary generation to produce human-readable text. 

Stage 1: Topic identification is such that to perform all systems, employ several independent modules. Each module assigns a score to each unit of input  (word, sentence or longer passage) then a combination module combines the scores for each unit to assign a single integrated score to it. The system then selects the n-highest scoring units, according to the length desired by the user. 
Note that sentence boundary is not always a good unit.  But positional criteria for emphasis such as headings, titles, first paragraphs is often helpful. Some variations of the positional method exist in the literature. 
Moreover collocation of data is even more helpful to the data. Collocation and co-emphasis can be determined based on statistical methods.  
Combinations can exist to improve the processing in this stage. 

Stage 2: Interpretation or topic fusion 
The topics identified by the system are fused. No system can perform interpretation without prior knowledge about the domain. Operators that condense knowledge fall in this category. 
 Topic signatures are often alluded to in the interpretations and mentioned as helpful in both identification and interpretation but their application has been tricky. Domain knowledge acquisition is the biggest challenge. 

Stage 3: Summary generation: Note that extract summaries require no generation. A process of smoothing is used to identify and repair typical dysfluencies. A summary revision program does multiple transformations to smoothen out the incoherencies. Text compression is another promising approach. Some papers such as the EM algorithm have gained recognition. Cut and paste methods together with bigram language model have often been mentioned as helpful. 

These are the three stages to summarize the text that this article delves into.  

No comments:

Post a Comment