Cluster computing

Saturday, November 18, 2017

A unix-shell command to summarize text:

Programs in UNIX are founded on a principle – to do one thing and do that thing very well. The idea is that complex tasks can be easily composed from granular, well-defined and easy to use building blocks. Tools that process text on UNIX systems are limited to search, slice and dice, find and replace or differencing commands. These scan the text and perform their operations in stream like manner so that the output of a processor can result in the input of another processor. These don’t modify the original text and merely produce an output text or stream.

While usage of command line options is less popular than a user interface say from a browser in general, there is still a lot of value in the commands because they come in helpful to groups such as system administrators. Consequently, writing a program to do text processing jobs as tokenizing, stemming or summarizing text can be just as helpful. Imagine being able to read a text file with the contents equivalent to a novel and print a summary by extracting the sentences that together bring out the most of it.

Programs in UNIX are generally written in C language and are portable on almost all flavors of such systems. Text processing commands can also be written in C. There are more packages and helpers available in higher level languages such as Python which result in smaller or succinct code but it became possible to write such program on Unix using Word2Net. A text summarization command for Unix is made possible with the following steps:

Convert text to word vectors using word2net command

Generate k-means classification of the word vectors and find their centroids :

https://github.com/ravibeta/cexamples/blob/master/classifier.c

https://github.com/ravibeta/cexamples/blob/master/classifiertest.c

Generate summary using the centroids.

This is optional but it would be helpful to keep track of the position of the occurrence of the centroid as the collocation data is used to generate the word vectors so that the most contributing positions of the determined vector can be used to select sentences for the summary. Previously the selection of sentences was heavier in logic and occurred in step 3 as the words closest to the centroid were used to determine the location of the sentences to be selected but here we propose that the best occurrence of the centroid alone can be used to pick the sentences.

Packaging the program involves using conventional means to package shell scripts. A variety of utilities such as bpkg, jean or sparrow can be used. We can also make it available to install from repositories as we usually do on ubuntu by publishing the repository. These commands however require a tremendous of desktop diskspace because they require to download a corpus of text data that is usually very large in itself and available separately. A processed word vector file from a known corpus may alternatively be shipped with the script and while this may take disk space it usually bootstraps and deploys the command for immediate use.

Conclusion: Writing a text processor such as for text summarization is a viable option on UNIX flavor systems.

#codingexercise

Check if two expressions are same:

For example:

3 - (2-1) and 3-2+1

The expressions only differ in the presence of brackets
solution: one approach is to evaluate the expression using a stack. When we encounter a closing paranthesis, we pop all elements iand their operators and the paranthesis, evaluate it and push the result back on the stack we use a helper method eval that can evaluate expressions without brackets by performing the operations in sequence
bool AreEqual(List<char> exp1, List<char> exp2)
{
var s1 = new Stack<char>();
for(int i = 0; i < exp1.Count; i++)
{
if (exp1[i] == ")"){
var ret = new List<int>();
while (s1.empty() == false && s1.Peek() != "(")
ret.Add(s1.pop());
if (s1.empty()) throw new InvalidExpressionException();
s1.Pop();
s1.Push(eval(ret.reverse());
} else {
s1.Push(exp1[i]);
}
}
if (eval(s1.ToList()) == eval(exp2))
return true;
return false;
}

Friday, November 17, 2017

We continue our discussion on determining a model to represent the system. A model articulates how a system behaves quantitatively.
An inverse model is a mathematical model that fits experimental data. It aims to provide a best fit to the data.
There are two ways by which we can select the appropriate model. The first is by observing trend lines which correspond to some well known mathematical formula. The second is on the observation of underlying physical processes which contribute towards the system. These physical interpretations contribute to model parameters.
In order to fit the data points, a model may use least squares of errors. the errors called residuals may be both positive or negative which result in inaccurate measure. Instead the squares of the errors can be minimized to give a better fit.
#codingexercise
Find next greater number using the same digits as the given number. If no other number is possible return the original
This is an alternative to the technique discussed earlier.
Int GetNextHigher(int n)
{
Var digits = Integer.ToDigits(n);
For (int I = n+1; I < INT_MAX; I++)
{
Var newdigits = Integer.ToDigits(I);
If (digits.ToHashTable().Equals(newdigits.ToHashTable()))
Return I;
}
Return –1;
}
There are two things to observe here
1) we don't need to compare the hash table for each increment. if we have a slot array for digits 0 to 9, we only consider numbers, we quickly discard numbers that increment digits not in the original number
2) the range from the number to its next higher is very small.

Thursday, November 16, 2017

We continue our discussion on modeling. A model articulates how a system behaves quantitatively. Models use numerical methods to examine complex situations and come up with predictions. Most common techniques involved for coming up with a model include statistical techniques, numerical methods, matrix factorization and optimizations.
An inverse model is a mathematical model that fits experimental data. It aims to provide a best fit to the data. Values for the parameters are obtained from estimation techniques. It generally involves an iterative process to minimize the average difference. The quality of the inverse model is evaluated using well known mathematical techniques as well as intuition.
The steps for inverse modeling of data include:
1) selecting an appropriate mathematical model using say polynomial or other functions
2) defining an objective function that agrees between the data and the model
3) adjusting model parameters to get a best fit usually by minimizing the objective function
4) evaluating goodness of fit to data by not being perfect due to measurement noise
5) estimating accuracy of best fit parameter values
6) determining whether a much better fit is possible which might be necessary if there is local minima

There are two ways by which we can select the appropriate model. The first is by observing trend lines which correspond to some well known mathematical formula. The second is on the observation of underlying physical processes which contribute towards the system. These physical interpretations contribute to model parameters. ain order to fit the data points, a model may use least squares of errors. the errors called residuals may be both positive or negative which result in inaccurate measure. Instead the squares of the errors can be minimized to give a better fit.
#codingexercise
Find next greater number using the same digits as the given number. If no other number is possible return the original
Int GetNextGreater(uint n)
{
Var digits = Int.ToDigits(n);
If (digits.IsEmpty()) return 0;
Int I = 0;
Int J = 0;
// find the start for change in digits
For (int i = digits.Count-1;I > 0; I--)
{
If (digits[I] > digits[I-1]) {
break;
}
If (I == 0) return n;
//find the substitute and sort the digits from position
Int min = I;
For (j = I+1; j < digits.Count; j++)
If (digits[j] > digits[I-1] && digits[j] < digits[min])
min = j;
Swap(digits, min, I-1)
returnDigits.GetRange(0,I-1).Union(digts.GetRange(I, digits.Count-I+1).Sort()).ToList().ToInteger();

}

There is an alternative to getting the number as above. It simply rolls the number forward until each number has the other number has the same count of each digits.

Wednesday, November 15, 2017

We continue our discussion on modeling. A model articulates how a system behaves quantitatively. Models use numerical methods to examine complex situations and come up with predictions. Most common techniques involved for coming up with a model include statistical techniques, numerical methods, matrix factorization and optimizations.
A forward model is a mathematical model that is detailed enough to include the desired level of real world behaviour or features. It is used for simulating realistic experimental data which under the right constraints can be used to test hypothesis. While it may be too complicated to fit experimental data, it can be used to generate synthetic data sets for evaluating parameters.
An inverse model is a mathematical model that fits experimental data. It aims to provide a best fit to the data. Values for the parameters are obtained from estimation techniques. It generally involves an iterative process to minimize the average difference. The quality of the inverse model is evaluated using well known mathematical techniques as well as intuition.
A forward-inverse modeling is a process to combine data simulation with model fitting so that all parameters can be sufficiently evaluated for robustness, uniqueness and sensitivity. This is very powerful for improving data analysis and understanding the limitations.
A good inverse model should have a good fit and describe the data adequately so that some insights may follow. The parameters are unique and their values are consistent with the hypothesis and changes to experimental data in response to alterations in the system.
The steps for inverse modeling of data include:
1) selecting an appropriate mathematical model using say polynomial or other functions
2) defining an objective function that agrees between the data and the model
3) adjusting model parameters to get a best fit usually by minimizing the objective function
4) evaluating goodness of fit to data by not being perfect due to measurement noise
5) estimating accuracy of best fit parameter values
6) determining whether a much better fit is possible which might be necessary if there is local minima as compared to global minimum.
#codingexercise
Given an array and an integer k, find the maximum for each and every contiguous subarray of size k.
List<int> GetMaxInSubArrayOfSizeK(List<int> A, int k)
{
var ret = new List<int>();
var q = new Deque<int>();
for (int i = 0; i < k; i++)
{
while ( (q.IsEmpty() == false) && A[i] >= A[q.Last()])
q.PopLast();

q.AddLast(i);
}

for (int i = k ; i < A.Count; i++)
{
ret.Add(A[q.PeekFirst()]);

while ( (q.IsEmpty() == false) && q.PeekFirst() <= i - k)
q.PopFirst();

while ( (q.IsEmpty() == false) && A[i] >= A[q.PeekLast()])
q.PopLast();

q.AddLast(i);
}

if (q.IsEmpty () == false)
ret.Add(A [q.PeekFirst()]);
return ret;
}

Tuesday, November 14, 2017

We resume our discussion on modeling. A model articulates how a system behaves quantitatively. Models use numerical methods to examine complex situations and come up with predictions. Most common techniques involved for coming up with a model include statistical techniques, numerical methods, matrix factorization and optimizations.
Sometimes we relied on experimental data to corroborate the model and tune it. Other times, we simulated the model to see the predicted outcomes and if it matched up with the observed data. There are some caveats with this form of analysis. It is merely a representation of our understanding based on our assumptions. It is not the truth. The experimental data is closer to the truth than the model. Even the experimental data may be tainted by how we question the nature and not nature itself. This is what Heisenberg and Covell warn against. A model that is inaccurate may not be reliable in prediction. Even if the model is closer to truth, garbage in may result in garbage out
Any model has a test measure to determine its effectiveness. since the observed and the predicted are both known, a suitable test metric may be chosen. for example the sum of squares of errors or the F-measure may be used to compare and improve systems.
A forward model is a mathematical model that is detailed enough to include the desired level of real world behaviour or features. It is used for simulating realistic experimental data which under the right constraints can be sued to test hypothesis. While it may be too complicated to fit experimental data, it can be used to generate synthetic data sets for evaluating parameters.
An inverse model is a mathematical model that fits experimental data. It aims to provide a best fit to the data. Values for the parameters are obtained from estimation techniques. It generally involves an iterative process to minimize the average difference. The quality of the inverse model is evaluated using well known mathematical techniques as well as intuition.
A forward-inverse modeling is a process to combine data simulation with model fitting so that all parameters can be sufficiently evaluated for robustness, uniqueness and sensitivity. This is very powerful for improving data analysis and understanding the limitations.
A good inverse model should have a good fit and describe the data adequately so that some insights may follow. The parameters are unique and their values are consistent with the hypothesis and changes to experimental data in response to alterations in the system.

#codingexercise
Given an array and an integer k, find the maximum for each and every contiguous subarray of size k.
List<int> GetMaxFromSubArraysOfSizeK(List<int> A, int k)
{
var ret = new List<int>();
int max = INT_MIN;
for (int i = 0; i <= A.Count-k; i++)
{
max = A[i];

for (j = 1; j < k; j++)
{
if (A[i+j] > max)
max = A[i+j];
}
ret.Add(max);
}
return ret;

}

Monday, November 13, 2017

We were discussing our recommender software yesterday and the day before.The recommender might get geographical location of the user, the time of the day and search terms from the owner. These are helpful to predict the activity the owner may take. The recommender does not need to rebuild the activity log for the owner but it can perform correlations for the window it operates on. If it helps to build the activity log for year to date, then the correlation can become easier by translating to queries and data mining over the activity log.
The activity log has a natural snowflake schema and works well for warehouse purposes as well. Addition to the activity log may be very granular or coarse or both and these may be defined and labeled as per information found in the past or input from user. the activity log has a progressive order as in time series database and allows standard query operator in date ranges. By virtue of allowing user to accrue this log from anywhere and on any device, this database is well suited to be in the cloud. public cloud databases or virtual data warehouse are well suited for this purpose. when the recommender performs correlations for the owner, it discovers activities by the owner. These activities are recorded in this database. if the recommender needs to search for date ranges it can quickly use the activity log it helped build. Activity log for the owner gives the most information about the owner and consequently helps with recommendations.
Since the recommender queries many data sources, it is not limited to the Activity Log but it eventually grows the Activity Log to be the source of truth about the owner.
#codingexercise

Segregate and Sort even and odd integers:

List <int> SegregateAndSort (List <int> input)

{

var odd = input.Select ( x => x % 2 == 1 ).ToList();

var even = input.Select ( x => x % 2 == 0).ToList ();

return even.Sort ().Union (odd.Sort()).ToList ();

}
and test for our k-means classifier: https://github.com/ravibeta/cexamples/blob/master/classifiertest.c

Sunday, November 12, 2017

We were discussing the personal recommender yesterday. The recommender has access to more data sources than conventional web applications and can perform more correlation than ever before. when integrated with social engineering application such as Facebook or Twitter the recommender finds information about the friends of the owner. places that they have visited or activities that they have posted become relevant to the current context for the owner. In this case, the recommender super imposes what others have shared with the owner. The context may be a place or activity to be used in the search query to pull data from these social engineering applications. This is not intrusive to others and does not raise privacy issues. Similarly it does not instigate movements or flash mob because the will to act on the analysis still rests with the owner. The level of information made available by the social engineering applications is a setting in that application and independent from the recommender. there are no HIPAA violations and whether a user shared his or her visit to a pharmacy or hospital is entirely up to the user. It does provide valuable insight to the owner of the recommender when she decides to find a doctor.
The recommender does not have to take any actions. whether the owner chooses to act on the recommendations or publish it on Facebook is entirely her choice. this feedback loop may be appealing to her friends and their social engineering application but it is an opt in.
The recommender is a personal advisor who can use intelligent correlation algorithms. For example, it can perform collaborative filtering using the owner's friends as data point. In this technique, the algorithm finds people with tastes similar to the owner and evaluates a list to determine the collective rank of items. While standard collaborative filtering uses viewpoints from a distributed data set, the recommender may adapt it to include only the owner's friends.
The recommender might get geographical location of the user, the time of the day and search terms from the owner. These are helpful to predict the activity the owner may take. The recommender does not need to rebuild the activity log for the owner but it can perform correlations for the window it operates on. If it helps to build the activity log for year to date, then the correlation can become easier by translating to queries and data mining over the activity log.