Cluster computing

Tuesday, November 8, 2016

Yesterday we comparing Linux Containers with virtual machines and cited ongoing efforts such as with Kubernetes. Windows Azure now claims to support Kubernetes. Native Containers are small and fast. They have two characteristics. First the containers are isolated from each other and from the host in that they even have their own file system. which makes it portable across cloud and os distributions. Second the immutable container images can be created at build/release time rather than the deployment time of the application since each application doesn't need to be composed with the rest of the application stack nor tied to the production infrastructure environment. Kubernetes extends this idea of app+container all the way where the host can be nodes of a cluster. Kubernetes evolved as an industry effort from the native Linux containers support of the operating system. It can be considered as a step towards a truly container centric development environment. Containers decouple applications from infrastructure which separates dev from ops. Containers demonstrate better resource isolation and improved resource utilization.
At this point it is important to differentiate Kubernetes from PaaS. Kubernetes is not a traditional, all-inclusive PaaS. Unlike PaaS that restricts applications, dictates choice of application frameworks, restrict supported language runtimes or distinguish apps from services, Kubernetes aims to support an extremely diverse variety of workloads. As long as the application has been compiled to run in a container, it will work with Kubernetes. PaaS provides databases, message bus, cluster storage systems but those can run on Kubernetes. There is also no click to deploy service marketplace. Kubernetes does not build user code or deploy it. However it facilitates CI workflows to run on it.
Kubernetes allows users to choose logging, monitoring and alerting Kubernetes also does not require a comprehensive application language or system. It is independent of machine configuration or management. But PaaS can run on Kubernetes and extend its reach to different clouds.
#codingexercise
Find the modulus of very large numbers represented as (base^exp)%m
Modulus is distributive
(a*b)%m = ((a%m)(b%m))%m
(a+b)%m = ((a%m) + (b%m))%m
int modulus(int base, int exp, int m)
{
// assume parameter validation
base %= m;
int result = 1;
while (exp > 0)
{
if (exp & 1)
result = (result * base) % m;
base = (base*base) % m;
exp >> = 1;
}
return result;
}

Alternatively we can compute the operand first before the modulus
int power( int base, uint exp)
{
int result = 1;
while (exp > 0)
{
if (exp & 1)
result = result * base;
base = base * base;
exp = exp >> 1; // reduce exp by half
}
return result;
}
Then
int modulus( int operand, uint m)
{
var digits = operand.toString();
int remainder = 0;
for (int i = 0; i < digits.length; i++)
remainder = (remainder * 10 + digits[i].toInt()) % m
return remainder;
}

And now we can also make this recursive by splitting
Int moduluswhole(int number, m)
{
// or make it recursive with a termination condition.
return (modulus(number/2, m) + modulus(number - (number/2), m))%m;
}

Int modulusWhole(int operand, int m)
{
If (operand < 10000)
{
return modulus(operand, m);
}
return (modulus(operand/2, m) + modulus(operand- (operand/2), m))%m
}

https://github.com/ravibeta/PythonExamples/blob/master/iamapi.zip

Monday, November 7, 2016

Yesterday we were discussing Linux Containers. We were comparing Containers with virtual machines. Today we look at this comparision in depth.
First Containers are small and fast. Unlike virtual machines where a hypervisor carves out the vm as a separate machine with even a different operating system than its own, containers share the same, containers run wherever the operating system is the same. Specifically the containers are best used for packaging applications so they can move around. The old way to deploy applications was to install the applications on a host using the operating system package manager, This tied the application to the host OS. Instead we could build immutable VM images in order to achieve predicatable rollouts and rollbacks but VMs are heavyweight and non-portable. The new way is to deploy containers based on a operating-system-level virtualization rather than hardware virtualization. There are two significant advantages to this. First the containers are isolated from each other and from the host in that they even have their own file system. which makes it portable across cloud and os distributions. Second the immutable container images can be created at build/release time rather than the deployment time of the application since each application doesn't need to be composed with the rest of the application stack nor tied to the production infrastructure environment. In this case each application is compiled with its own set of container libraries which enables a consistent environment to be carried from development into production. Moreover, containers are vastly more transparent the virtual machines because they facilitate monitoring and management.This is clear to see when the container process lifecycles are managed by the infrastructure rather than hidden by a process supervisor inside the container. Now managing the applications becomes the same as managing the containers while applications have gained tremendous portability. Kubernetes extends this idea of app+container all the way where the host can be nodes of a cluster.
With this introduction, we now list the differences between containers and virtual machines as follows:
1) Containers are more dense form of computing as compared to a virtual machine. While a hypervisor may support a few vms. a single vm may support hundreds of container.
2) Containers make the one-application-per-server more formal with an isolation of the compute and storage
3) Containers have very little overhead as compared to vms and are fast and small.
4) Containers make application creation and deployment easier
5) Containers such as in Kubernetes are an improvement over PaaS because it is build time and not just deployment time.
6) Point number 5 implies now that the containers decouple applications from infrastructure which separates dev from ops.
7) Containers show Environmental consistency across development, testing and production because they run the same on a desktop or in a cluster.
8) Containers raise the level of abstraction and make it application centric.
9) Containers enable loosely coupled, distributed, elastic, liberated micro-services
10) Point number one implies now that containers demonstrate better resource isolation and improved resource utilization.
Kubernetes evolved as an industry effort from the native Linux containers support of the operating system. It can be considered as a step towards a truly container centric development environment.

#codingexercise
Given an array of positive numbers, find the maximum sum of a subsequence with the constraint that no two numbers in the sequence should be adjacent or next to next in the array. So 3 2 7 10 should return 13 (sum of 3 and 10) or 3 2 5 10 7 should return 15 (sum of 3, 5 and 7)

static int GetAltSum(List<int> nums, int start)
{
if (start >= nums.Count) return 0;
int incl_sum = GetAltSum(nums, start + 3) + nums[start]; // start + 3 is used to denote the element after adjacent two
int excl_sum = GetAltSum(nums, start + 1);
return Math.Max(incl_sum, excl_sum);
}

#algorithms
How do we perform independent sampling in a high dimensional distribution ?
Independent samples are those where we choose two different types of items such that the values of one sample do not affect the values of the other.
In small dimensions, this is relatively easy because we can tell apart the samples. For example, to prove the effectiveness of a medicine, we use a test group and a control group as independent samples. The control group does not get the medicines but get say a placebo instead. In higher dimensions, there are many more factors involved than just one factor - the medicine.
In higher dimensional distributions, we use Metropolis Hastings algorithm.
The algorithm generates samples iteratively with the desired distribution. In each iteration, more and more samples are produced where the next sample depends only on the current in a Markov chain like manner. hence it is also called Markov Chain sequence model. The number of samples is proportional to the number of iterations. In each iteration it picks a sample based on the current value. then it measures how close this sample is to the desired distribution. if the sample is accepted, then the new sample is used as the current value.if it is rejected, the current value is reused.

Text analysis often uses high dimensional vectors so this may be of use there.

Its important to note that the samples need not be considered synthetic. Instead it draws samples which is why this comes useful.

Sunday, November 6, 2016

Today we discuss Containers versus Virtual Machines (vm).

As a cloud provider and consumer, we are aware that to the end user the computing power and storage matters more than how it is provisioned and where. The ease of requesting the compute, the cleanliness of instantiating and deleting the associated resource, the ability to migrate applications and services far out weigh any consideration of the vendor or the mechanism just so long as the performance, security and costs are comparable.

Essentially the provisioning of such compute is a result of partitioning some wholesale undivided resource whether it is at the ESX level or an instance level. Each time a new resource is requested, a partition is made and the resource is provisioned. Partition do not interact with each other so we have isolation and this works well on shared resources. The partitioning policy can depend on the size of the resource being cut , the availability and the load on the underlying fabric. A private cloud provider for example has several regional datacenters to distribute the volume of requests for new virtual machine instances.

Yet it is widely understood in computing world that we don't aggressively partition lest we shoot ourselves in the foot. Instead we are as lazy to partition the resources as and when the need arises.
As an example, we don't excessively partition our hard disk into many drives up front only to realize later that one drive is filled up while the other is relatively unused. This brings up the importance of elasticity of the resource we partition. If we partition a datacenter into several virtual machines, is the datacenter elastic to more and more usage ? Yes for example, we can add ESX blades as appropriate in say a VMWare cloud. But what about a Linux container ? The same is true, if the current resource is exhausted, we provision elsewhere. However, the container is restricted to a vm and the vm may or may not be elastic.

Furthermore containers are governed by the quotas and limits of the underlying system to manage the shared resources. One of the reasons Solaris zones suffered from application migration considerations from the operating system consumers is that they kept running into application failures due to limits. Both linux containers and solaris zones must be understood as operating system features and not as standalone products for virtualization technologies.

That said linux containers have the advantage that they provide the same environment without the overhead of a separate kernel and hardware. This translates to lightning fast availability of an isolated shell as well as there reduction of failure points. Moreover hardware is not necessarily partitioned per virtual machine in any case. However, separtion of kernel does play significantly in terms of performance because the emulation is different from the real especially when compared to a standalone.

In addition, we now have integration with Open-stack for linux containers. The plugin allows the hosts to be compute nodes while the container takes on the workloads from the user.

#codingexercise
Given an array of positive numbers, find the maximum sum of a subsequence with the constraint that no two numbers in the sequence should be adjacent in the array. So 3 2 7 10 should return 13 (sum of 3 and 10) or 3 2 5 10 7 should return 15 (sum of 3, 5 and 7)

static int GetSum(List<int> nums, int start)
{
if (start >= nums.Count) return 0;
int incl_sum = GetSum(nums, start + 2) + nums[start];
int excl_sum = GetSum(nums, start + 1);
return Math.Max(incl_sum, excl_sum);
}

static void Main(string[] args)
{
var nums = new List<int>() { 3, 2, 5, 7, 10, 9 };
var nums1 = new List<int>() { 3, 2, 5, 10, 7 };
var nums2 = new List<int>() { 3, 2, 7, 10 };
var nums3 = new List<int>() { 3, 2 };
var nums4 = new List<int>() { 3 };
var nums5 = new List<int>() { };
Console.WriteLine("Max = {0}", GetSum(nums, 0));
Console.WriteLine("Max = {0}", GetSum(nums1, 0));
Console.WriteLine("Max = {0}", GetSum(nums2, 0));
Console.WriteLine("Max = {0}", GetSum(nums3, 0));
Console.WriteLine("Max = {0}", GetSum(nums4, 0));
Console.WriteLine("Max = {0}", GetSum(nums5, 0));
}

Max = 19
Max = 15
Max = 13
Max = 3
Max = 3
Max = 0

Saturday, November 5, 2016

We continue to review the mechanics of Active Directory replication:
We are referring to intrasite replication for the most often use cases of Active directory domain name-context item changes. Such AD replication takes place between DCs in pull operations, not pushes. AD controls and manages the transmission of AD Data. We discussed what makes up the replication metadata and we saw how it is put to use. We saw how it was used to address conflict resolution and usn rollbacks.
Today we look at ReadOnly domain controllers. This is usually deployed in branch offices where there is no need to have dedicated and full strength server, personnel or protection. There are no write operations and outbound replication is not permitted. Moreover, the objects are replicated without the user's password information. This creates an additional layer of security because if a DC "grows legs and walks away" there is no password information resident on the hard drive.

#codingexercise
We were discussing the celebrity problem yesterday and making it O(N) from O(N^2) complexity
We used a stack that compared adjacent elements. However we can do away with the stack and use the ends of the list.
int findCelebrity(int n)
{
int a = 0;
int b = n - 1;

while (a < b)
{
if (knows(a, b))
a++;
else
b--;
}

for (int i = 0; i < n; i++)
{
if ( (i != a) &&
(knows(a, i) || !knows(i, a)) )
return -1;
}
return a;
}

Remove half-nodes from a binary tree

Node RemoveHalfNodes(node root)
{
if (Root == null) return null;
root.left = RemoveHalfNodes(root.left);
root.right = Remove HalfNodes(root.right);
if (root.left == null && root.right == null)
return root;
if (root.left == null){
var newroot = root.right;
delete root;
return newroot;
}
if (root.right == null){
var newroot = root.left;
delete root;
return newroot;
}
return root;
}

Given an infinite string defined by function f(x)=x+”0″+f(complement x) find k’th bit from start
Assuming complement of x is inverting bits
we just keep track of position and iterations
for example if k is less than length(x) then the answer is trivial
else we count the length(x) + 1 + length of complement of x as the new x for the next iteration.
we iterate until the length spanned is greater than k and return the value that is immediately preceding k

Facebook API to get all friends' id:

FB.api('/me/friends?limit=200', function(response) {

Log.info('API response', response);

document.getElementById('results').innerHTML = '';

var names = JSON.parse(JSON.stringify(response));

for (var i=0; i < names["data"].length; i++) {

document.getElementById('results').innerHTML += JSON.stringify(names["data"][i]["id"]);

}

// document.getElementById('accountInfo').innerHTML += JSON.stringify(response);

});

Examples : https://github.com/ravibeta

Friday, November 4, 2016

Thursday, November 3, 2016

Today we continue to review the mechanics of Active Directory replication:
We are referring to intrasite replication for the most often use cases of Active directory domain name-context item changes. Such AD replication takes place between DCs in pull operations, not pushes. AD controls and manages the transmission of AD Data. We discussed what makes up the replication metadata and we saw how it is put to use. We saw how it was used to address conflict resolution.
Today we will see one more mitigation which is USN rollback. It has to do with object deletions. Rather than simply removing the object, it is marked deleted so that this is replicated to other DCs who know now that the object is deleted. Also the original object does not need to be passed around since it my have a big size. Only the shell with the isDeleted attribute is sent across aka tombstone. USN rollback occurs when a domain controller is so completely out of date that it has "missed" one or more tomb-stoned objects and thus is unable to bring its local copy of the Active Directory up to date. One way to avoid this is to have the DC recognize that it has been out of service longer than the longest tombstone period and not come back online. Instead it is rebuilt from scratch.
Rebuilding in general is the right approach as compared to repairing in many cluster based technologies.
Theoretically this is proved because it improves the mean time to data loss. The protection overhead is reduced, space is better utilized and there is no net addition to management overhead.
The second kind of usn rollback occurs when a DC has been restored with an inappropriate means and it has not kept current with the other DCs. This is referred to as going back in time because the restore was not AD aware. This is mitigated with event logging where the text of the log and the necessary steps to recover them are available.

In order to override automatic replication of AD objects, we could do one of the following:
1) invoke repadmin command on the destination DC to pull the changes
2) invoke repadmin command on the souce DC to push changes to all other DC.
3) issue active directory powershell cmdlet to invoke sync on a per object basis.

#puzzle
In a party of N people, only one person is known to everyone. Such a person may be present in the party, if yes, she doesn’t know anyone in the party. We can only ask questions like "does A know B?". Find the stranger in minimum number of questions.
This problem is better visualized with a directed graph where there is only one vertex that has all incoming edges from the N-1 other vertices and no outbound edges to anybody else. To construct a graph, we will need to form the adjacency matrix which is O(N^2), Then we can traverse to see if there is any vertex satisfying the criteria.

int FindCelebrity(int[,] m, int row, int col)
{
assert (row == col);
for (int i = 0; i < row; i++)
{
int known = 0;
for (int j = 0; j < col; j++)
known += m[i,j];
if (known != 0) continue;
for (int k=0; k<row; k++)
if (k!=i && m[k,i] > 0)
known += m[k,i];
if (known == row-1) {
console.WriteLine("Celebrity found at {0}", i);
return i;
}
}
return -1;
}

Wednesday, November 2, 2016

Today we continue to review the mechanics of Active Directory replication:
We are referring to intrasite replication for the most often use cases of Active directory domain name-context item changes. Such AD replication takes place between DCs in pull operations, not pushes. AD controls and manages the transmission of AD Data with replication metadata. We saw that this metadata included
1) USN which revisions changes on local machine
2) HWMV which informs whether it is originating or replicating
3) UTDV which prevents repeated processing of replicating changes and helps with propagation dampening
The replicated change therefore includes the following:
the GUID of the DC that is replicating the change
the USN from the DC that is replicating the change
the GUID of the DC that originated the change
the USN from the DC that originated the change
Based on the UTDV table, the DC determines whether it needs to make the change or simply update the HWMV entry.
The DC also has to perform conflict resolution. There are three types of conflict that can occur in Active Directory environment, each of which uses a different method of conflict resolution.
1) The same object is updated in two originating servers
2) name collision when two objects are created by the same name
3) If one object is created but another deletes the container.
To; address first two types of conflict, the replication metadata includes two things - a version ID for the attribute and a timestamp. The third type of conflict is managed by moving to a specially designated container called the orphaned container.
We were comparing the Active directory replication to a database replication and suggesting that the snapshot isolation method works very well in this regard. Just like databases can be distributed with namespace and versions, we have similar comparison with active directory.
Generally a database detects conflicts in one of the following ways:
1) current versus old : the receiving site detects an update conflict if there is any difference between the old values of the replicated row and the current values of the same row at the receiving site
2) dulpicate : the receiving site detects a uniqueness conflict if a uniqueness constraint violation occurs during an insert or an update of the replicated data
3) missing : the receiving site detects a delete conflict if it cannot find a row for an update or delete statement because the primary key of the row does not exist.
Question can the usn, hwmv and utdv be simplified : hint : think globally unique version numbers, timestamps, action name and idempotent operations.
Courtesy : Laura Hunter, Windows Server Administration

#codingexercise
Convert an infix expression to postfix expression
An infix expression is like this a + b*c + d
An postfix expression is like this abc*d++
Operators have standard precedence
As we can see we are moving the operator, consequently a stack comes in useful.
The method proceeds like this
For every operand, maintain its sequence in the result
For every ( push it on the stack
For every ) pop elements from stack till matching open ( arrives or invalidate. the popped elements are added to resulting expression.
For every operator, empty the stack onto the result until the topmost of the stack has higher precedence than the current operator and then include the operator in the resulting expression.
That's it. since we only add operators to the stack, the sequence of operands is maintained
The opening and closing paranthesis are not included in the resulting expression because the operators are stacked and two operands apply the operation following it.
For example,
(a*b)^c => ab*c^
a*(b^c) => abc^*