Cluster computing

Saturday, November 18, 2017

A unix-shell command to summarize text:

Programs in UNIX are founded on a principle – to do one thing and do that thing very well. The idea is that complex tasks can be easily composed from granular, well-defined and easy to use building blocks. Tools that process text on UNIX systems are limited to search, slice and dice, find and replace or differencing commands. These scan the text and perform their operations in stream like manner so that the output of a processor can result in the input of another processor. These don’t modify the original text and merely produce an output text or stream.

While usage of command line options is less popular than a user interface say from a browser in general, there is still a lot of value in the commands because they come in helpful to groups such as system administrators. Consequently, writing a program to do text processing jobs as tokenizing, stemming or summarizing text can be just as helpful. Imagine being able to read a text file with the contents equivalent to a novel and print a summary by extracting the sentences that together bring out the most of it.

Programs in UNIX are generally written in C language and are portable on almost all flavors of such systems. Text processing commands can also be written in C. There are more packages and helpers available in higher level languages such as Python which result in smaller or succinct code but it became possible to write such program on Unix using Word2Net. A text summarization command for Unix is made possible with the following steps:

Convert text to word vectors using word2net command

Generate k-means classification of the word vectors and find their centroids :

https://github.com/ravibeta/cexamples/blob/master/classifier.c

https://github.com/ravibeta/cexamples/blob/master/classifiertest.c

Generate summary using the centroids.

This is optional but it would be helpful to keep track of the position of the occurrence of the centroid as the collocation data is used to generate the word vectors so that the most contributing positions of the determined vector can be used to select sentences for the summary. Previously the selection of sentences was heavier in logic and occurred in step 3 as the words closest to the centroid were used to determine the location of the sentences to be selected but here we propose that the best occurrence of the centroid alone can be used to pick the sentences.

Packaging the program involves using conventional means to package shell scripts. A variety of utilities such as bpkg, jean or sparrow can be used. We can also make it available to install from repositories as we usually do on ubuntu by publishing the repository. These commands however require a tremendous of desktop diskspace because they require to download a corpus of text data that is usually very large in itself and available separately. A processed word vector file from a known corpus may alternatively be shipped with the script and while this may take disk space it usually bootstraps and deploys the command for immediate use.

Conclusion: Writing a text processor such as for text summarization is a viable option on UNIX flavor systems.

#codingexercise

Check if two expressions are same:

For example:

3 - (2-1) and 3-2+1

The expressions only differ in the presence of brackets
solution: one approach is to evaluate the expression using a stack. When we encounter a closing paranthesis, we pop all elements iand their operators and the paranthesis, evaluate it and push the result back on the stack we use a helper method eval that can evaluate expressions without brackets by performing the operations in sequence
bool AreEqual(List<char> exp1, List<char> exp2)
{
var s1 = new Stack<char>();
for(int i = 0; i < exp1.Count; i++)
{
if (exp1[i] == ")"){
var ret = new List<int>();
while (s1.empty() == false && s1.Peek() != "(")
ret.Add(s1.pop());
if (s1.empty()) throw new InvalidExpressionException();
s1.Pop();
s1.Push(eval(ret.reverse());
} else {
s1.Push(exp1[i]);
}
}
if (eval(s1.ToList()) == eval(exp2))
return true;
return false;
}

Cluster computing

Saturday, November 18, 2017

No comments:

Post a Comment