Cluster computing

Friday, November 11, 2016

Scoped and Persistent Connections

Let us take a look at an example where we use both stateful and stateless APIs. We use the example of a terminal console. In a stateless API, the console takes one command at a time and returns the output. There is no state maintained between the previous command and the current command. So if we wanted to list the files of a directory by changing to that directory, then we do it all at once in a single command otherwise we don’t do it. In a stateful API, we do processing of the command but maintain the state so we can issue one command for change directory and another command for listing the files.

One way to mimic the stateful API over the stateless API would be to replay the history of commands. If there is only one command, we execute it. If there is second command we look up the history to find the earlier commands, execute it and then execute the given command. If we tolerate the performance, we can use this method, however this assumes the commands are idempotent which means they can be called repeatedly to the same effect.

Generally the APIs are never interchanged since they serve different purposes. But one can be used over the other. In this case, we take the example of a state represented by a connection and use the stateless calls over these for the commands. The only difference is the connection has a setup and teardown and we persist for the duration of the calls. This is very similar to socket and in fact many networking protocols have a setup and tear down phase.

The commands coming over the stateless APIs are decoupled from the stateful connection setup and tear down that maintains a single connection over which all the commands are run and the output returned through the stateless APIs.

Here the state is maintained for the resource. In some cases, the state is maintained for the user. In such a case the order is reversed. The stateful layer is over the stateless APIs. This is exactly the case for many portals where the user session needs to be tracked so that the users have a seamless experience between the actions they take and they don’t need to repeat their authentication each time.

Such state have a lifetime and this is established beforehand or tied to user actions. In all these cases, the destruction of the state invalidates subsequent api calls and this is very much like the socket that is terminated so that no subsequent actions can be taken. Generally the state is referred to with a key or an identifier that is used by the subsequent API calls.

#codingexercise

given four keys letter A, ctrl A, ctrl C, ctrl  V find the maximum number of A we can print using the combination of keys in N keystrokes. the elements of repetitions are
A
Ctrl V
Ctrl V Ctrl V
Ctrl A Ctrl C Ctrl V
Therefore, one way to do this could be :

static int getCount(int N, int copy, int current)

{

if(N <=0)

return 0;

if (N <= 6)

return N;

int count0 = current+getCount(N-3, current, current * 2); // ctrl A+C+V

int count1 =  copy * 2 + getCount(N-2, copy, current + copy * 2); // ctrl V + ctrl V

int count2 = copy + getCount(N-1, copy, current+copy); // ctrl V

int count3 = 1 + getCount( N-1, copy, current+1);

var counts = new List<int>(){ count0, count1, count2, count3 };

int max = counts.Max();

return current + max;

}

Console.WriteLine("Max = {0}", getCount(6, 0, 0));

Console.WriteLine("Max = {0}", getCount(9, 0, 0));

Max = 6

Max = 12

we can also consider a variation of the above dynamic programming with varying length repetitions of CtrlV and CtrlACV at any position beyond 6th occurrence of A.

Thursday, November 10, 2016

Today we take a break to discuss stateful APIs. With the move towards REST Framework most of the APIs are necessarily and correctly stateless. However, once in a while we do come across use cases where the APIs have to be stateful. Take for instance a native os command. Within the APIs, we can invoke a native system command and return the output each time we encounter a command. Since all the commands issued are maintained in the .bash_history, we don't always require to maintain state. and can even replay the commands issued assuming they are idempotent. But if we open a python console and start issuing python statements, we no longer have that state unless we redirect the input and output via the pipes to the API and even so there is no stashing unless the APIs keep track of it separately.
Enter request.session.session_key and we now solve this for every session initiated by the user by attaching the state to the session. Since the state can be persisted, all we have to do is name the state with the session key so we can tell apart the sessions. Still this is mostly a front-end call and the APIs remain stateless. Many UI framework support this model by facilitating the notion of sessions.
So the question really is should the statefulness be pushed down to the API ?
Pautasso et al. classify the application integration style as shared database, remote procedure call, message bus and File Transfer.

#codingexercise

Problem: Given a sequence of words, print all anagrams together.

Sort the letters within a word
Sort the words in the sequence
This brings the anagram groupings
Since each anagram keeps track of its index in the sequence we can find the corresponding words and their groupings.

Void PrintAllAnagrams(List<string> words)

{

Var items = new List<Tuple<string, int>>();

Words.enumerate((x,i) => items.Add(new Tuple<string, int>(x, i)));

Items.forEach(x => sort(x));

Items.sort(new TupleSorter());

Items.ForEach(x => console.writeLine("{0} at index {1}", words[x.second], x.second);

}

Or we could simply cluster them based on anagram similarity. This clustering has to be hierarchical since we don’t know the number of anagrams and the threshold has to be zero because we are looking for exact similarity and anything else is not.

Void PrintAllAnagramsByClustering(List<string> words)

{

Var items = new List<Tuple<List<string>, int>>();

Words.enumerate((x,i) => items.Add(new Tuple<List<string>, int>(new List<string>(x),-1))); // all items have label -1 at start

Bool over = false;

While (!over)

{

Var newCluster = new Tuple<List<string>, int>();

For ( int I =0; I < items.length; i++)

For (int j =i+1; j < items.Length; j++)

{

If (i != j && distance(items[i], items[j]) == 0) && items[I].second != items[j].second){

match = true;

var merged = merge(items[i], items[j], i);

If (newCluster.Contains(merged) == false)

NewCluster.Add(merge);

}

Foreach (var item in items)

If newcluster.contains(item)

Items.Remove(item);

Else

Item.second = -1;

If (newCluster.empty() == false)

Item.Append(newCluster);

If (newCluster.empty()){

Console.WriteLine(newCluster.ToString());

over = true;

}

We could also use a hashing function that computes the sum of the ascii values of the letters in the words.

Wednesday, November 9, 2016

We continue reading on Kubernetes. Kubernetes is not a traditional, all-inclusive PaaS. Unlike PaaS that restricts applications, dictates choice of application frameworks, restrict supported language runtimes or distinguish apps from services, Kubernetes aims to support an extremely diverse variety of workloads. As long as the application has been compiled to run in a container, it will work with Kubernetes. PaaS provides databases, message bus, cluster storage systems but those can run on Kubernetes. There is also no click to deploy service marketplace. Kubernetes does not build user code or deploy it. However it facilitates CI workflows to run on it.
Kubernetes allows users to choose logging, monitoring and alerting Kubernetes also does not require a comprehensive application language or system. It is independent of machine configuration or management. But PaaS can run on Kubernetes and extend its reach to different clouds.
A Kubernetes cluster can be launched on machines running Ubuntu 16.04, CentOS 7 or HypriotOS v1.0.1+ with an installation tool called kubeadm. The process works with local VMs, physical servers and/or cloud servers. This tool assumes we have a set of machines virtual or real but is preferable over baremetal. Each machine should have 1 GB RAM and there should be full network connectivity between all machines in the cluster.
Kubernetes cluster can be installed on cloud. While this was preferable over AWS, we now have Azure supporting Kubernetes installation. This is done with a tool called kops and is well-known for its high availability support. It is self healing, auto-scaling and can directly provision or generate terraform manifests. It uses fully qualified name for addressing within the cluster and outside. It creates a route53 domain for the cluster so all the cluster machines can be looked up without remembering each one's ipaddress or name. It creates an s3 bucket to store the cluster state. It builds the cluster configuration. It creates the cluster in AWS and enables several add-ons such as WeaveNet, Calico, Flannel, Canal and Romana for networking, WeaveScope and Dashboard for web interface and Legacy Add-ons.
#codingexercise
Find a tour for a trucker to visit all gas stations in a circle
The naiive solution can be O(n^2) but we can make it linear.
int Tour(List<int>fill, List<int>use)
{
int min = int_max;
int reserve = 0;
int index = 0;
for (int i =0; i < fill.size; i++)
{
reserve += fill[i] - use[i];
if (reserve < min)
{
min = reserve;
index = (i+1) % fill.size;
}
}
if (reserve >= 0)
return index;
return -1;
}
Convert a binary tree into one where the nodes have the sum of left and right subtree
int ToSumTree(node root)
{
if (root == null) return 0;
int val = root.data;
root.data = ToSumTree(root.left) + ToSumTree(root.right);
return root.data + val;
}
The same algorithm remains true even if we wanted the difference instead of sum.

Given a sequence of words, print all anagrams together.

Void PrintAllAnagrams(List<string> words)
{
Var items = new List<Tuple<string, int>>();
Words.enumerate((x,i) => items.Add(new Tuple<string, int>(x, i)));
Items.forEach(x => sort(x));
Items.sort(new TupleSorter());
Items.ForEach(x => console.writeLine("{0} at index {1}", words[x.second], words[x.second]);

}
Or we could simply cluster them based on anagram similarity

Tuesday, November 8, 2016

Yesterday we comparing Linux Containers with virtual machines and cited ongoing efforts such as with Kubernetes. Windows Azure now claims to support Kubernetes. Native Containers are small and fast. They have two characteristics. First the containers are isolated from each other and from the host in that they even have their own file system. which makes it portable across cloud and os distributions. Second the immutable container images can be created at build/release time rather than the deployment time of the application since each application doesn't need to be composed with the rest of the application stack nor tied to the production infrastructure environment. Kubernetes extends this idea of app+container all the way where the host can be nodes of a cluster. Kubernetes evolved as an industry effort from the native Linux containers support of the operating system. It can be considered as a step towards a truly container centric development environment. Containers decouple applications from infrastructure which separates dev from ops. Containers demonstrate better resource isolation and improved resource utilization.
At this point it is important to differentiate Kubernetes from PaaS. Kubernetes is not a traditional, all-inclusive PaaS. Unlike PaaS that restricts applications, dictates choice of application frameworks, restrict supported language runtimes or distinguish apps from services, Kubernetes aims to support an extremely diverse variety of workloads. As long as the application has been compiled to run in a container, it will work with Kubernetes. PaaS provides databases, message bus, cluster storage systems but those can run on Kubernetes. There is also no click to deploy service marketplace. Kubernetes does not build user code or deploy it. However it facilitates CI workflows to run on it.
Kubernetes allows users to choose logging, monitoring and alerting Kubernetes also does not require a comprehensive application language or system. It is independent of machine configuration or management. But PaaS can run on Kubernetes and extend its reach to different clouds.
#codingexercise
Find the modulus of very large numbers represented as (base^exp)%m
Modulus is distributive
(a*b)%m = ((a%m)(b%m))%m
(a+b)%m = ((a%m) + (b%m))%m
int modulus(int base, int exp, int m)
{
// assume parameter validation
base %= m;
int result = 1;
while (exp > 0)
{
if (exp & 1)
result = (result * base) % m;
base = (base*base) % m;
exp >> = 1;
}
return result;
}

Alternatively we can compute the operand first before the modulus
int power( int base, uint exp)
{
int result = 1;
while (exp > 0)
{
if (exp & 1)
result = result * base;
base = base * base;
exp = exp >> 1; // reduce exp by half
}
return result;
}
Then
int modulus( int operand, uint m)
{
var digits = operand.toString();
int remainder = 0;
for (int i = 0; i < digits.length; i++)
remainder = (remainder * 10 + digits[i].toInt()) % m
return remainder;
}

And now we can also make this recursive by splitting
Int moduluswhole(int number, m)
{
// or make it recursive with a termination condition.
return (modulus(number/2, m) + modulus(number - (number/2), m))%m;
}

Int modulusWhole(int operand, int m)
{
If (operand < 10000)
{
return modulus(operand, m);
}
return (modulus(operand/2, m) + modulus(operand- (operand/2), m))%m
}

https://github.com/ravibeta/PythonExamples/blob/master/iamapi.zip

Monday, November 7, 2016

Yesterday we were discussing Linux Containers. We were comparing Containers with virtual machines. Today we look at this comparision in depth.
First Containers are small and fast. Unlike virtual machines where a hypervisor carves out the vm as a separate machine with even a different operating system than its own, containers share the same, containers run wherever the operating system is the same. Specifically the containers are best used for packaging applications so they can move around. The old way to deploy applications was to install the applications on a host using the operating system package manager, This tied the application to the host OS. Instead we could build immutable VM images in order to achieve predicatable rollouts and rollbacks but VMs are heavyweight and non-portable. The new way is to deploy containers based on a operating-system-level virtualization rather than hardware virtualization. There are two significant advantages to this. First the containers are isolated from each other and from the host in that they even have their own file system. which makes it portable across cloud and os distributions. Second the immutable container images can be created at build/release time rather than the deployment time of the application since each application doesn't need to be composed with the rest of the application stack nor tied to the production infrastructure environment. In this case each application is compiled with its own set of container libraries which enables a consistent environment to be carried from development into production. Moreover, containers are vastly more transparent the virtual machines because they facilitate monitoring and management.This is clear to see when the container process lifecycles are managed by the infrastructure rather than hidden by a process supervisor inside the container. Now managing the applications becomes the same as managing the containers while applications have gained tremendous portability. Kubernetes extends this idea of app+container all the way where the host can be nodes of a cluster.
With this introduction, we now list the differences between containers and virtual machines as follows:
1) Containers are more dense form of computing as compared to a virtual machine. While a hypervisor may support a few vms. a single vm may support hundreds of container.
2) Containers make the one-application-per-server more formal with an isolation of the compute and storage
3) Containers have very little overhead as compared to vms and are fast and small.
4) Containers make application creation and deployment easier
5) Containers such as in Kubernetes are an improvement over PaaS because it is build time and not just deployment time.
6) Point number 5 implies now that the containers decouple applications from infrastructure which separates dev from ops.
7) Containers show Environmental consistency across development, testing and production because they run the same on a desktop or in a cluster.
8) Containers raise the level of abstraction and make it application centric.
9) Containers enable loosely coupled, distributed, elastic, liberated micro-services
10) Point number one implies now that containers demonstrate better resource isolation and improved resource utilization.
Kubernetes evolved as an industry effort from the native Linux containers support of the operating system. It can be considered as a step towards a truly container centric development environment.

#codingexercise
Given an array of positive numbers, find the maximum sum of a subsequence with the constraint that no two numbers in the sequence should be adjacent or next to next in the array. So 3 2 7 10 should return 13 (sum of 3 and 10) or 3 2 5 10 7 should return 15 (sum of 3, 5 and 7)

static int GetAltSum(List<int> nums, int start)
{
if (start >= nums.Count) return 0;
int incl_sum = GetAltSum(nums, start + 3) + nums[start]; // start + 3 is used to denote the element after adjacent two
int excl_sum = GetAltSum(nums, start + 1);
return Math.Max(incl_sum, excl_sum);
}

#algorithms
How do we perform independent sampling in a high dimensional distribution ?
Independent samples are those where we choose two different types of items such that the values of one sample do not affect the values of the other.
In small dimensions, this is relatively easy because we can tell apart the samples. For example, to prove the effectiveness of a medicine, we use a test group and a control group as independent samples. The control group does not get the medicines but get say a placebo instead. In higher dimensions, there are many more factors involved than just one factor - the medicine.
In higher dimensional distributions, we use Metropolis Hastings algorithm.
The algorithm generates samples iteratively with the desired distribution. In each iteration, more and more samples are produced where the next sample depends only on the current in a Markov chain like manner. hence it is also called Markov Chain sequence model. The number of samples is proportional to the number of iterations. In each iteration it picks a sample based on the current value. then it measures how close this sample is to the desired distribution. if the sample is accepted, then the new sample is used as the current value.if it is rejected, the current value is reused.

Text analysis often uses high dimensional vectors so this may be of use there.

Its important to note that the samples need not be considered synthetic. Instead it draws samples which is why this comes useful.