Cluster computing

Friday, November 18, 2016

Today we take a break to discuss websockets
Websockets come in helpful where previously we were doing polling over HTTP. Here we maintain a connection different from HTTP and over TCP where both client and server can push events. Yes events is a good example of using web sockets because they are time series and are also numerous in nature. Websockets are full duplex connections so both sides can be producer and consumer.
Some examples of Client side libraries include :
socket.io and
web-socket-js
while python implementations include
Autobahn
and crossbar.io
Crossbar.io is particularly helpful in synchronous python stacks that want the latitude of near realtime behavior to Django applications.
it enables us to trigger notifications from our synchronous python web stack because it comes with a pusher service that can be configured in the json configuration file.
As an example use case, consider that we want to display real time monitoring of cpu, memory metrics on a dashboard using a python django server. the clients publish ata from each monitored machine to a central django server that in turn publishes to the frontend. the frontend updates on each new statistic because it listens for those events with say Autobahn. Here the difference is highlighted from the traditional model where the front-end pulls the information from a time series database by running a query every once in a while.
<pre>
<code>
#codingexercise
Print next greater palindrome of a given array with each element denoting a single digit of an integer
        static List<int> GetNextPalin(List<int> digits)
{
var current = digits.Select(x => x).ToList();
while (IsPalindrome(current, 0, current.Count -1) == false)
{
int start = 0;
int end = current.Count - 1;
while(start<end)
{
current[end] = current[start];
end--;
start++;
}
    if (compareTo(current, digits) == -1) // current < digits
{
if (start == end)
{
        if (current[start] + 1 >= 10){
            AddOne(ref current, start);
               continue;
        }
        current[start] = current[end] + 1;
}
if (start > end)
{
        if (current[start] + 1 >= 10){
            AddOne(ref current, start);
             continue;
        }
        current[start] = current[start] + 1;
        current[end] = current[end] + 1;
        end++;
        start--;
}
}
}
return current;
}
</code>
</pre>
Int = 123456, Palindrome = 124421
Int = 1, Palindrome = 1
Int = 12, Palindrome = 22
Int = 92, Palindrome = 99
Int = 31, Palindrome = 33
Int = 13, Palindrome = 22
Int = 83422469, Palindrome = 83433438
Int = 12921, Palindrome = 12921
Int = 125322, Palindrome = 125521
Int = 14587678322, Palindrome = 14587678541
Int = 94187978322, Palindrome = 94188088149

bool isPalindrome( List<int> digits, int start, int end)
{bool ret = true;
while(start < end)
{
if (digits[start] != digits[end]) return false;
start++;
end--;
}
return ret;

}

Thursday, November 17, 2016

Yesterday we were discussing Metropolis Hastings algorithm. It helps with drawing independent samples from a high dimensional distribution. This algorithm is iterative. In each iteration, more and more samples are produced where the next sample depends only on the current in a Markov chain like manner.The number of samples is proportional to the number of iterations. In each iteration it picks a sample based on the current value. then it measures how close this sample is to the desired distribution. if the sample is accepted, then the new sample is used as the current value.if it is rejected, the current value is reused. The method uses two steps to generate the next in the sequence.
The parameters to the algorithm are the number of samples to draw denoted by n and the standard deviation denoted by sigma. When the standard deviation is too small, the proposal distribution is set too narrow. When the standard deviation is very large, the proposal distribution is set too wide, the acceptance rate is low and the distribution is highly irregular.
In addition, there are two parameters that help with improving the sampling. The first is called the burnin by which we delay the actual collection until the algorithm has run for a while. The other parameter is called the lag by which we allow several iterations of the sampler to elapse in between successive samples
We were looking at a few interesting applications of this algorithm. One such application was in mapping sequences of unknown text to sequences of labels in known text so as to decipher unknown text. Let us say x is the set of words in unknown text that mapped to y labels in known set. Where there is some association of words to labels and we denote it by f. We take a well known text and count the consecutive text associations from x to y in a matrix M(x,y) of transitions. Then we define a Markov chain function for the consecutive occurrences. We compute this function for two symbols as a starting guess. We change to a candidate by choosing randomly from the values f assigns and then we perform the two steps of the algorithm - first two compute the value of the function for the candidate and then to accept/reject it. Accordingly we move to the next candidate or perform another iteration. This way we move from an unknown space to a sequence in a known space that we understand.
Another application of this algorithm is the "Nested sampling for general Bayesian computation" as described by Skilling. Nested sampling estimates directly how the likelihood function relates to previous value of the discrete random variable. It can be used to compare two data models for the same set by comparing the conditional probability that is assigned after the relevant evidence is taken into account.
#codingexercise
we were discussing several methods to find the maximum non-repeated substring in a given string. We mentioned an approach that lets us merge two sub ranges by counting the desired length in both directions from the join. Additionally, we know the max length of the desired substring in both sub ranges as left and right. Consequently the maximum on merge is max(left, right, length from join)
As we compute bottom up, we know the left and right progressively since we start from single element length sub ranges and keep merging up until we get the original sequence.Since the join forms at every element boundary, we know the maximum propagates to the top.
int GetMergeBottomUpMax(string input, int start, int end)
{
int max = 0;
if (start < end)
{
int mid = (start+end)/2;
int left = GetMergeBottomUpMax(input, start, mid);
int right = GetMergeBottomUpMax(input,mid+1,end);
max = merge_with_init(input, start,mid,end,left,right);
}
return max;
}

Wednesday, November 16, 2016

\prod M (f (s i), f (s i + 1))

then we do the following :
Compute Pl(f).
• Change to fdash by making a random transposition of the values f assigns to two symbols.
• Compute Pl(fdash); if this is larger than Pl(f), accept fdash.
• If not, flip a Pl(fdash)/Pl(f) coin; if it comes up heads, accept fdash.
• If the coin toss comes up tails, stay at f.

#codingexercise
we were discussing several methods to find the maximum non-repeated substring in a given string.
When they are O(N^2) it may take a long time to get to the desired substring if it is skewed towards one end.
Instead, in one way, we can distribute the search in a scatter-gather mode where we pick different starting points throughout the length of the string for a faster search. This starting point is denoted with the parameter mid below:

static int merge(string input, int start, int mid, int end, int left, int right)
{
if (mid == start || mid == end) return 0;
var ret = new List<int>(){left,right};
var h = new List<int>(256);
h.AddRange(Enumerable.Repeat(0, 256));
h[input[mid]]++;
int lcount = 1;
for (int i = mid - 1; i >= start; i--)
{
if (h[input[i]] != 0)
{
break;
}
else
{
h[input[i]]++;
lcount++;
}
}
var t = new List<int>(256);
t.AddRange(Enumerable.Repeat(0, 256));
t[input[mid]]++;
int rcount = 1;
for (int j = mid + 1; j <= end; j++)
{
if (t[input[j]] != 0)
{
break;
}
else
{
t[input[j]]++;
rcount++;
}
}
int ccount = lcount;
for (int k = mid + 1; k <= end; k++)
{
if (h[input[k]] != 0)
{
break;
}
else
{
h[input[k]]++;
ccount++;
}
}
ret.Add(lcount);
ret.Add(rcount);
ret.Add(ccount);
return ret.Max();
}

the above implementation is only to show that the we could start for all points between start and end as mid and because we could go about that in any order we get an early convergence towards max.

Tuesday, November 15, 2016

Yesterday we were discussing Metropolis Hastings algorithm. It helps with drawing independent samples from a high dimensional distribution. This algorithm is iterative. In each iteration, more and more samples are produced where the next sample depends only on the current in a Markov chain like manner.The number of samples is proportional to the number of iterations. In each iteration it picks a sample based on the current value. then it measures how close this sample is to the desired distribution. if the sample is accepted, then the new sample is used as the current value.if it is rejected, the current value is reused. The method uses two steps to generate the next in the sequence. The first stage is to generate a candidate from the proposal distribution which depends on the current state of the Markov chain. The proposal distribution is usually defined with a standard deviation. The second stage is the accept-reject step. We saw that the acceptance probability depends on the ratios of the probabilities for the candidate in the given and proposal distribution taken together. The first ratio does not depend on the normalizing factor which makes the calculation easier and the second ratio is a correcting factor which can be simplified for a special case of this algorithm.
With the candidate generated and the acceptance probability calculated, the candidate is either accepted or rejected. To make this distinction, a uniformly distributed random number is generated between 0 and 1 and denoted by u. Then if u is less than or equal to the acceptance probability, we accept the candidate and if it isn't, then we reject the candidate. We start the next iteration.
The parameters to the algorithm are the number of samples to draw denoted by n and the standard deviation denoted by sigma. When the standard deviation is too small, the proposal distribution is set too narrow. The acceptance rate is high so the chain doesn't get stuck in any one spot but it doesn't cover a wide range. When the standard deviation is very large, the proposal distribution is set too wide, the acceptance rate is low and the distribution is highly irregular. The chain gets stuck in one spot for long periods of time but it does manage to make big jumps, covering the whole range.
There are two parameters that can be introduced to help with improving the sampling. The first is called the burnin by which we delay the actual collection until the algorithm has run for a while. The other parameter is called the lag by which we allow several iterations of the sampler to elapse in between successive samples which comes in helpful when the Markov chain gets stuck in one location for long periods of time.
Courtesy : Navarro and Perfors

#codingexercise
Yesterday we were discussing a solution for finding non-repeating character sequence in a string. Here's another.
static int GetMaxFreq(string input)
{
Console.WriteLine("input={0}", input);
int count = 0;
int max = 0;
for (int i = 0; i < input.Length; i++)
{
var freq = new List<int>(256);
freq.AddRange(Enumerable.Repeat(0, 256));
freq[input[i]]++;
count = 1;
if (count > max)
max = count;
for (int j = i + 1; j < input.Length; j++)
{
if (freq[input[j]] != 0)
{
break;
}
else
{
count++;
}
if (count > max)
max = count;
freq[input[j]]++;
}
}
return max;
}

#puzzle given three multiple choices for a question as below, find the right answer:

A. Answer A
B. Answer A or B
C. Answer B or C

by process of mutial exclusion, the answer is C. if it were A or B there would be more than one choice possible.

Monday, November 14, 2016

Metropolis Hastings algorithm helps with performing independent sampling in a high dimensional distribution. In smaller dimensions, it is easier to tell apart samples by segregating batches that are not dependent on a factor. For example, if we were to find the effectiveness of a medicine, we would create two groups - one where the medicine is administered and one where it isn't. When there are many many factors, the samples are difficult to draw.
This algorithm generates samples iteratively with the desired distribution. In each iteration, more and more samples are produced where the next sample depends only on the current in a Markov chain like manner. hence it is also called Markov Chain Monte Carlo method. The number of samples is proportional to the number of iterations. In each iteration it picks a sample based on the current value. then it measures how close this sample is to the desired distribution. if the sample is accepted, then the new sample is used as the current value.if it is rejected, the current value is reused.
The idea is that we generate a sequence of x values one after the other say x0, x1, x2, .. xn such that as we increase their number to a large value, we can see that the last sample is the one with the Probability distribution P(x).
The method uses two steps to generate the next in the sequence. The first stage is to generate a candidate from the proposal distribution which depends on the current state of the Markov chain. The proposal distribution is usually defined with a standard deviation. The second stage is the accept-reject step.
In this step we calculate acceptance probability which depends on the ratios of the probabilities for the candidate in the given and proposal distribution taken together. Given the current state of the Markov chain as xn, the first ratio is that of the probability of the candidate to the probability of the current state. The second ratio is that of the conditional probabilities of the current state given the candidate and that of the candidate given the current state. The first ratio does not depend on the normalizing constant for the distribution P(x) which makes it easier to calculate. The second ratio is a correcting factor for any biases that the proposal distribution might induce. Since the second ratio involves the conditional distribution of the candidate given the current state as the denominator, it is a measure of what happened. The opposite is true for the numerator in the second ration which is therefore a measure of what would have occurred. The second ratio turns out to be 1 when the proposal distribution is normal. This is a special case of Metropolis Hastings algorithm which is then called the Metropolis algorithm
#codingexercise
Yesterday we were discussing an iterative solution for finding non-repeating character sequence in a string. Here is the recursive version:

static void GetMaxRecursive(String input, int start, int curLen, ref int max, ref List<int> h)

        {

               if (start > input.Length - 1)
               {
                   if (curLen >= max)
                   {
                       max = curLen;
                       Console.WriteLine(
                       "One Non repeating character substring found at index {0} is {1}",
                       input.Length - max,
                       input.Substring(input.Length - max, max));
                   }
                    return;
               }

                int prev = h[input[start]];
                if (prev == -1 || start - curLen > prev)
                {
                    curLen++;
                }
                else
                {
                    if (curLen > max)
                    {
                        max = curLen;
                        Console.WriteLine(
                        "One Non repeating character substring found at index {0} is {1}",
                        start - max,
                        input.Substring(start - max, max));
                    }
                    curLen = start - prev;
                }
                h[input[start]] = start;
                if (input.Length - max >= 0 && curLen >= max)
                {
                    max = curLen;
                    Console.WriteLine(
                    "One Non repeating character substring found at index {0} is {1}",
                    input.Length - max,
                    input.Substring(input.Length - max, max));
                }

            GetMaxRecursive(input, start+1, curLen, ref max, ref h);
        }

Puzzle : we have five pieces of chains each three links long. Breaking open a link costs 1 dollar and resealing costs 3 dollars. Can we form one long chain with fifteen dollars.
Answer Yes, break open all links of one chain. Use these three as connectors for the remaining pieces. Cost = 3 x 1 + 3 x 3 = 12.

Sunday, November 13, 2016

#codingexercise

Get the length of the longest non-repeating character subsequence in a given input string

static int GetLongestDistinctSubstringLength(String input)
{
var visited = new List<int>(256);
visited.AddRange(Enumerable.Repeat(-1, 256));
if (input.Length == 0) return 0;
visited[input[0]] = 0;
int prev = 0;
int curLen = 1;
int max = curLen;
for (int i = 1; i < input.Length; i++)
{
prev = visited[input[i]];
if (prev == -1 || i - curLen > prev)
{
curLen++;
}
else
{
if (curLen > max)
{
max = curLen;
Console.WriteLine(
"One non-repeating character substring found at index {0} is {1}",
i-max,
input.Substring(i-max, max));
}
curLen = i - prev;
}
visited[input[i]] = i;
}
if (curLen >= max)
{
max = curLen;
Console.WriteLine(
"One non-repeating character substring found at index {0} is {1}",
input.Length - max,
input.Substring(input.Length - max, max));
}
return max;
}

Console.WriteLine("Max = {0}", GetLongestDistinctSubstringLength("ABRACADABRA"));
Console.WriteLine("Max = {0}", GetLongestDistinctSubstringLength("A"));
Console.WriteLine("Max = {0}", GetLongestDistinctSubstringLength("AB"));
Console.WriteLine("Max = {0}", GetLongestDistinctSubstringLength("AA"));
Console.WriteLine("Max = {0}", GetLongestDistinctSubstringLength("ABC"));
Console.WriteLine("Max = {0}", GetLongestDistinctSubstringLength("ABCA")); // expected ABC
Console.WriteLine("Max = {0}", GetLongestDistinctSubstringLength("ABCB")); // expected ABC

One Non repeating character substring found at index 0 is ABR
One Non repeating character substring found at index 1 is BRAC
Max = 4
One Non repeating character substring found at index 0 is A
Max = 1
One Non repeating character substring found at index 0 is AB
Max = 2
One Non repeating character substring found at index 1 is A
Max = 1
One Non repeating character substring found at index 0 is ABC
Max = 3
One Non repeating character substring found at index 0 is ABC
One Non repeating character substring found at index 1 is BCA
Max = 3
One Non repeating character substring found at index 0 is ABC
Max = 3

We can make this recursive as well because the current character either participates in the max length or it doesn't.

Saturday, November 12, 2016

Yesterday we were discussing scoped persistent connections to emulate a terminal session. We said one way to do it would be to use a real session and overlay the emulated session on top of it. This translates to the user seeing an emulated session and not knowing the difference that there is no interpreter associated with it. The command are merely piped forward and backwards through the real shell console.
We pointed out that there are two independent objects that have a scope and maintain a timeline. One is the session concerning the real shell console which is instantiated and held so that user commands on the emulated window flow here and the results flow back. Typically this real shell console's lifetime is maintained across all the commands issued by the user. While each command may be stateless, it is the real console that maintains state between the commands so that the results of the previous commands are available to the user. Consequently the API to redirect the commands from the user to the console is stateless but the console is stateful.
This API middle layer in fact looks very much like any programming's shell command invocation. For example, in python we do:

class ScopedRealConsole():

  def __init__(self, ):
      self.process = None
      self.kill_proc = None
      self.timer = None

  def setup(self, preamble):
        import shlex, subprocess
        from threading import Timer
        from subprocess import Popen, PIPE
        self.process = subprocess.Popen(preamble, stdin = subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
        self.kill_proc = lambda p: p.kill()
        self.timer = Timer(10, kill_proc, [process])

  def teardown(self):
      if self.process != None:
         self.process.kill()

  def execute(self, cmds):
        output = ''
        try:
          timer.start()
          for i in range (1,len(cmds)):
             self.process.stdin.write(str.encode(cmds[i]))
          output, stderr = self.process.communicate()
          output = output.decode('utf-8')
        finally:
          timer.cancel()
        return output

We are merely pointing out that the Popen is separated from the communicate. The Popen is done in a setup phase. The communicate happens for each user command as and when it is invoked and the teardown happens via the cancelation of the process. They are not all done together otherwise cmd_execute becomes stateless and we want this to be stateful. Furthermore the setup and teardown are called once for the duration of the real console. And we pointed out this is the case with any networking protocol. Consequently the considerations for securing and functionality of the communication are the same as with networking protocols.
The second connection is the user session which dictates how the commands are tied together. This is maintained with a tag such as a request.session.session_key. Over all the request flowing in to the real console, we can tell apart those that belong to a user by this key. When the commands and outputs are captured for the same session key, we have the content for the emulation window.
The emulation happens with an emulator like so:

<div id="term_demo"></div>
        <p>Javascript code:</p>
        <pre class="javascript">jQuery(function($, undefined) {
    $('#term_demo').terminal(function(command, term) {
        if (command !== '') {
            try {
                var result = cmd_execute(command);
                if (result !== undefined) {
                    term.echo(new String(result));
                }
            } catch(e) {
                term.error(new String(e));
            }
        } else {
           term.echo('');
        }
    }, {
        greetings: 'Emulation of a shell in an insolate Linux Container',
        name: 'bash_demo',
        height: 200,
        prompt: 'bash> '
    });
});</pre>

and the user session dictates how long the page for the emulator is shown. Each page is unique to the user session and the destination on which the commands are targeted. The destination is analogous to the well-known screen command of the terminal and comes with an identifier in addition to the IP address of the machine on which it is targeted.
The display of the emulated console is merely a repetition of single line <div> elements that all look the same with a black window and beat the text associated with the corresponding line in the real console.