Cluster computing

Tuesday, July 19, 2016

#codingexercise
In the previous exercise of 15-puzzle, we talked about finding inversion counts.
int GetInversions( int[,] board, int N)
{
int count = 0;
var serialized = board.ToList();
for (int i =0; i < N*N-1; i++)
for (int j = i+1; j < N *N; j++)
if (serialized[i] && serialized[j] && serialized [i] > serialized[j])
count++;
return count;
}

int findXPosition(int[,] board, int N)
{
for (int i = N-1; i >= 0; i--)
for (int j = N-1; j >= 0; j--)
if (board[i , j] == 0)
return N-i;
Return -1;
}

We were talking about docker usage of clusters. Asimple way to take advantage of the cluster is to have a file system that stretches over the cluster. Notably OneFS is an example of such a file system that can expand to a large number of nodes for a very large storage. Using a cluster instead of the host, lets the docker to have resources far beyond the capability of a single host. In this model, docker can continue to operate out of a single host but have a file system that spans clusters. This may require little or no change to how docker operates on a single host.

Another way to take advantage of the cluster is to not be host bound but cluster bound where an instance of the docker runs on each node and there is a translation or mapping of the workload on the cluster to target a specific node in the custer. This calls for a dotted naming convention where we can scope down progressively to the specific instance. This too may not require changes to the way docker runs on a single host but in additional capabilities that maintain mapping information between apps and docker instances.

In terms of compute improvements as opposed to storage, there is a possibility that the runtime interfaces docker uses such as libcontainer, libvirt, lxc and systemdnspawn interfaces with Linux, could possibly be replaced by something that targets similar functionalities on a cluster where the replacement of a host by a cluster does not affect the operations of docker.

Moreover, a cluster can improve batch processing via scatter and gather operations as opposed to the serial single processor techniques in compute. This provides an additional leverage for task parallel operations in docker on a cluster.

Given a tree, implement a function which replaces a node’s value with the sum of all its childrens’ value, considering only those children whose value is less than than the main node’s value.
void ToChildrenSumTree(ref node root)
{
int left = 0;
int right = 0;
if (root==null || (root.left == null && root.right == null)) return;
ToChildrenSumTree(ref root.left);
ToChildrenSumTree(ref root.right);
If (root.left != null) left = root.left.data;
If (root.right != null) right = root.right.data;
int sum = 0;
if (left < root.data) sum += left;
if (right < root.data) sum += right;
int delta = sum - root.data;
if (delta >= 0)
root.data = root.data + delta;
else
increment(ref root, 0-delta);
}

Monday, July 18, 2016

#codingexercise

Print all possible words from phone digits. An n digit number is given, and all words that are possible by pressing these numbers must be listed.

var hashTable = new List<int>(){"", "", "abc", "def", "ghi", "jkl",

                               "mno", "pqrs", "tuv", "wxyz"};

void combineKeypad(List<int> number, int n, int current, int level, ref List<char> candidate)

{

if (current == n) {

   Console.Write(candidate);

   return;

}
If (number[current] ==0 || number[current] == 1){
    combineKeypad(number, n, current+1, level, ref candidate);
    Return;}

for (int i =0; i < hashTable(number[current]).Length; i++)

{

candidate[level] = hashTable[number[current]][i];

combineKeypad(number, n, current+1, level +1, ref candidate);

candidate[level] = '/0';

}

}

}

combineKeypad(number, n, 0,0, ref result);

In the previous post, we talked about making docker containers to scale out with the resources of the host.One simple approach to this would be for Docker to treat the host as a cluster and not a single instance. This way the abstraction layer can assume more power than the original host while enabling additional nodes to be brought up or taken down. The most important ability of a cluster in the scale to which it can allow distributed computing. By introducing a layer that redirects to different nodes of the cluster in the native linux calls via agents, docker can continue to roll apps from one host to another and for that matter one cluster to another.
We can also use the concept of a 'lid' in that the applications are not brought down but all the data flows are temporarily cut off. This does not require changes to the network interfaces but just that data flow to a particular docker is rerouted to another. We call this a lid to the container. By not requiring the applications to be powered down, docker can even switch regions without any changes to all that it supports. The mapping information between services and dependencies are now maintained by docker as an extension to the format in which the applications currently declare their dependencies to docker.
The layer of redirection from the application to the host via docker does not necessarily have to be one that supports clusters. Consider rack compute and storage where there is a large number of cores and storage available. On such a fabric, virtual machines of all sizes can be spun up and shut down. If the redirection layer supports targeting the linux host dynamically, the application workloads can benefit from increased hardware resources dynamically as the applications are moved by docker from one host to another.
In a sense, this is similar to vMotion of virtual machines but for hosts. The difference is that the physical resource of cpu and storage can grow and shrink to the workload without having the applications to know about it. This is dynamic scale up and down versus scale out.

#codingexercise

Determine if a particular arrangement of 15 puzzle is solvable.
for a given grid of width N, an N*N-1 puzzle will be solvable as follows:
bool IsSolvable(int[,] board,int N)
{

int count = GetInversions(board);
// if grid is odd, return true if inversions is even
if (N & 1)
return !(count & 1);
else
{
int pos = findXPosition(board);
if (pos & 1)
return !(count & 1);
else
return count & 1;
}
}

Sunday, July 17, 2016

Docker Containers, Solaris Zones and vCenter Virtual machines
There are different levels of isolation and virtualization of applications. With the embrace of cloud computing, applications have become modular with deep virtualizations such as separating code to run on different virtual machines. While the guest operating system in a virtual machine may provide another level of isolation in terms of resources such as with cgroups and namespaces, they don’t control data partitioning or load balancing. They improve portability so that the application can seamlessly run on premises, public cloud, private cloud, baremetal etc. Each level of isolation is encapsulated within another. A application running within a docker container is bound to a single linux instance that it is hosted on. Similarly by their very nature operating system zones or namespaces are bound to that machine. A virtual machine can move between datacenters or storage such as with vMotion but it is still bound to a single instance.
Most applications are indeed tied to their hosts. If they have dependencies on external services, mostly they are in the form of database connection or an api calls. Eventually these api servers or database servers become more consolidated and single point of existence for their purpose. The microservice model is a good example of such a dedicated single instance. While data can be partitioned, servers are seldom so. This is due to the fact that most operations are data intensive rather than computation intensive. Consequently database servers and api servers grow to become clusters and farms to take on the load that they do.
An alternative approach to this model is a distributed container – one that spans multiple servers with their own partitions of data. Consider a service that interacts with a database server for a table. If the pair of the database server and the api server could partitioned for this data, then the workload will always be within limits and the performance will improve as the workload is split. This separation in terms of pairs of servers can even be expanded to include clusters. In such a case the applications are written once with arbitrary data in mind but deployed almost always with their own regions to operate in. The api servers can then be differentiated one from another with their endpoints such as with us-region-1.apiserver.corp.company.com or with qualiifers such as apiserver.corp.company.com/region1/v1/exports
However there is very little tie-up required between the servers in the example above because most of the data transferred is with the help of connection endpoints across the network. There is hardly any resource sharing between the clusters or the machines in the pool. Another way to look at this is to say that this is a deployment time concern and requires non-traditional partitioning of tables across database servers. Further it requires additional mapping relations between the code and data for each region. This means that there is pairing information between each server and database for its partition of data. This is simple but it still does not harness the power of a container as an elastic loop around several machines.
Clusters solve this problem by allowing additional nodes to be seamlessly brought online or taken offline. However clusters are usually homogenous not a combination of api servers and database servers. Imagine a platform-as-a-service in an enterprise where several microservices are hosted each
with its own purpose and not necessarily sharing the same database, database server or database cluster. If this set of services is having hot spots in terms of one or two services, it may be easy to bolster those services with web farms or load balancers. However this is not the same as cloning the entire set of api servers for another region. In such a case we are looking at a container for a platform-as-a-service and not an application container alone.
This still does not require any operating system libraries like docker does for Linux. It’s merely a production or deployment time inventory classification convenience and mapping exercise. When a docker enables containers spanning over multiple machines, it will required some messaging framework between the servers so that the applications can move in and around the participant machines. The trouble is that code is easy to be cloned but the data is sticky. As data grows, the database becomes larger and it demands more servicing and application binding.
#codingexercise
Check if a tree satisfies children sum property
Bool isChildrenSum(node root)
{
If ( root == null || ( root.left == null && root.right == null))
Return true;
Int left = 0;
Int right= 0;
If (root.left != null) left = root.left.data;
If (root.right != null) right = root.right.data;
If (root.data == left + right && (isChildrenSum(root.left) && isChildrenSum(root.right))
Return true;
return false;

}
Convert an arbitrary tree into one that satisfies children sum property
void ToChildrenSumTree(ref node root)
{
int left = 0;
int right = 0;
if (root==null || (root.left == null && root.right == null)) return;
ToChildrenSumTree(ref root.left);
ToChildrenSumTree(ref root.right);
If (root.left != null) left = root.left.data;
If (root.right != null) right = root.right.data;
int delta = left + right - root.data;
if (delta >= 0)
root.data = root.data + delta;
else
increment(ref root, 0-delta);
}

void increment(ref node root, int delta)
{

if (root.left){
root.left.data += delta;
increment(root.left, delta);
}
if (root.right){
root.right,data += delta;
increment(root.right, delta);
}
}

Saturday, July 16, 2016

In the previous post we discussed snapshots and clones. On the vcenter, they are available as files. In fact each virtual machine has its own folder in the datastore. If you wanted to enumerate these files, here's how we would do it :

args = get_args()

    search = vim.HostDatastoreBrowserSearchSpec()

    search.matchPattern = "NameSubstring" # relevant files in vm folder usually include the vm name

    search_ds = dsbrowser.SearchDatastoreSubFolders_Task(dsname, search)

    while search_ds.info.state != "success":

        pass

    # results = search_ds.info.result

    # print results

    for rs in search_ds.info.result:

        dsfolder = rs.folderPath

        for f in rs.file:

            try:

                dsfile = f.path

                vmfold = dsfolder.split("]")

                vmfold = vmfold[1]

                vmfold = vmfold[1:]

                vmxurl = "https://%s/folder/%s%s?dcPath=%s&dsName=%s" % \

                         (args.host, vmfold, dsfile, datacenter, fulldsname)

                VMX_PATH.append(vmxurl)

            except Exception, e:

                print "Caught exception : " + str(e)

                return -1

generate the sequence
5, 25, 5+25,125, 125 + 5, 125 + 5 + 25 ...
int GetKMagic(int k)
{
int count = 1;
int power = 1;
var ret = new List<int>(){5};
power++;
for (int i =1; i < k; i ++)
{
var pow = Math.pow(5, power);
if (count + 1 == power)
{
ret,Add(pow);
power++;
count = 0;
}else{
for (int j = 0; j < count; j++){
var sum = ret.getRange(0,j).Sum()
if (sum + pow < Math.pow(5, power+1) && ret.contains(sum+pow) == false)){
ret.add(sum+pow);
break;
}
}
count++;
}
}
return ret.Last();
}

Alternatively, magic numbers like above are sum of zero or one times coefficient of power of 5
int GetMagicN(int n)
{
int power = 1;
int result = 0;
while (n)
{
power = power * 5;
if ( n & 1)
result += power;
n >> = 1;
}
return result;
}

Friday, July 15, 2016

Snapshots and Clones of Virtual machines in VMWare

Snapshots are stored as changes from the parent state. To create a fully independent copy of a virtual machine from a snapshot, we make a clone. Both of them don’t require any downtime of the original Virtual machine. A lease on the other hand cannot be taken without powering off the virtual machine. A lease is necessary to export the virtual machine. Exporting a virtual machine disk means it can be used elsewhere and not just at the origin datacenter. While the lease is held, the operations that alter the state of the virtual machines covered by the lease are blocked. Examples of blocked operations are PowerOn, Destroy, Migrate, etc.

A snapshot preserves the state and data of a virtual machine at a specific point of time. The state refers to such things as the virtual machine’s powered on state. The data refers to all the files that make up the virtual machine. Changes from the original virtual machine disk are saved as child disks. Hence the snapshots form a chain of disks. When a snapshot is created, the file system can be quiesced. Queiscing is a process of bringing on the on-disk data of a physical or virtual computer into a state suitable for backups. This process might include such operations as flushing dirty buffers from the operating system’s in-memory cache to disk, or other higher level application specific tasks. Any process that is running and can alter the state of a virtual machine particularly those that modify the information stored on a disk during a backup, to guarantee a consistent and usable backup. Quiescing is not necessary for memory snapshots and is primarily used for backups.

A collection of <vm>-<number>.vmdk and <vm>-<number>-delta.vmdk comprise the snapshot. This is taken for each virtual disk connected to the virtual machine at the time of the snapshot. These files are referred to as child disks, redo logs or delta links. These child disks can later be considered parent disks for future child disks. From the original parent disk, each child constitutes a redo log pointing back from the present state of the virtual disk, one step at a time, to the original. The snapshots may also consist of <vm>.vmsd file which is a database of the virtual machine’s snapshot information and the primary source of the information for the Snapshot Manager. Snapshots may also include a <vm>Snapshot<number>.vmsn file which is a file that mentions the current configuration and optionally the active state of the virtual machine. With non memory snapshots, you can also revert to a turned off virtual machine state.

The common operations allowed on snapshots are Create Snapshot, Remove snapshot, Remove All snapshots, RevertToSnapshot and Consolidate. For each operation, if the memory option is specified, the destination is memory rather than disk. Similarly, if the quiesce operation is specified, the guest operating system is notified to do so. The child disk which is created with a snapshot is a sparse disk. Sparse disks employ the copy on write (COW) mechanism, in which the virtual disk contains no data in places, until copied there by a write. This optimization saves storage space. Sparse disks are measured in grains which is a block of sectors containing virtual disk data.

#codingexercise

Give a function rev(int i) which reverses the segment of array ar[] from 0-i,

List<int> rev(List<int> ar, int i)
{
var sub = ar.GetRange(0,i);
sub.reverse();
return sub.Union(ar.GetRange(i,ar.Count()-i)).ToList();
}
Give a function reverselatter(int i) which reverses the segment of array ar[] from i to last
List<int> rev(List<int> ar, int i)
{
var sub = ar.GetRange(i,ar.Count()-i);
sub.reverse();
return sub.GetRange(0,i).Union(sub).ToList();

}

Thursday, July 14, 2016

#codingexercise
Given an array of strings, find if the given strings can be chained to form a circle. A string X can be put before another string Y in circle if the last character of X is same as first character of Y.
We can use a graph based depth first search or we can use try various permutations of the string.

Permutations:

Void Permute (List<String> a, List<string> b, bool[] used)

{

If ( b.Length == a.Length) { if ( isCircle(b)) Console.Write(b.ToString()); return;}

For (int I = 0; I < a.Length ; i++)

{

If (used[i]) continue;

used[i] = true;

B.Add(A[i]);

Permute(a, b, used);

B.RemoveLast();

used[i] = false;

}

}
bool isCircle(List<string> a)
{
if (a.Count <=1) return false;
for (int i = 1; i < a.Count; i ++)
if (a[i].first != a[i-1].Last)
return false;
return a[a.Count-1].last == a[0].first;
}
Yesterday we were reading a paper titled "Pelican: a building block for exascale cold data storage". Pelican treats a group of disks as a single schedulable unit. Resource restrictions such as power consumption, vibrations and failure domains are expressed as constraints over these units.

With the help of resource constraints and scheduling units, Pelican aims to be better than its overprovisioned counterpart racks using computations in software stacks
We will continue this shortly
Discussion on a service : https://1drv.ms/w/s!Ashlm-Nw-wnWk2xYkvf4o1W0hEJk

#codingexercise
Get the minimum number of dice throws required to reach the last cell from the first cell in the snake and ladder game.
int getMinMoves(int[] move, int n)
{
var visited = new bool[n];
var q = new Queue<Tuple<int, int>>(); // vertex, distance
var position = new Tuple<int, int>(0,0);
q.enqueue(t);
while (!q.empty())
{
position = q.Peek();
if (position.first == N-1) break;
q.dequeue();
for ( int j = v+1, j <= v+6 && j < N; ++j)
{
if (!visited[j])
{
var n = new Tuple<int, int>(0, pos.second + 1);
visited[j] = true;
if (move[j] != -1)
n.first = move[j];
else
n.first = j;
q.enqueue(n);

}
}
}
return position.second;

}

Wednesday, July 13, 2016

#codingexercise
Find the number of islands of 1s in a binary 2d matrix. This problem is similar to the one in graph where we find connected components. In a 2d matrix, every cell has eight neighbors. A depth first search explores all these eight neighbors.
Int getIslands(int[,] m, int row, int col)
{
Var visited = new int[row,col];
Int count = 0;
For(int i =0; i < row; i++)
For(int j =0; j < col; j++)
If (m[i,j] == 1&& !visited[i,j])
{
DFS(m, ref visited, i,j);
Count++;
}
Return count;
}

Earlier we started reading about a paper titled "Pelican: a building block for exascale cold data storage". Cold data is the data that is rarely accessed. Pelican is a rack-scale hard-disk based storage unit designed as the basic building block for Exabyte scale storage for cold data. A rack-scale set of computers is a class of hardware with pooled storage, compute and network such that thousands of cores are interconnected at high speed. A rack is increasingly replacing the individual server as the basic building block of modern data centers. At Microsoft, the rack-scale computing, the Rack-scale computing project aims to redesign hardware, storage, networking and OS stacks in a cross layer co-design manner so as to improve efficiencies. By choosing a building block of this size, Pelican aims to make a building block of this size for cold data workloads. The provisioning is considered just right when the total cost of ownership reduces as compared to traditional disk-based storage. For example, only 8% of the drives can be concurrently spinning in this building block. This introduces complex resource management to be handled by Pelican.
This is met in the following way. Resource restrictions are expressed as constraints over the harddrive. The data layer and IO scheduling ensure that these constraints are not violated. Pelican and conventional storage are compared side by side using a cross validated simulator. Pelican demonstrates high throughput with acceptable latency for cold workloads.
The cost of storing data appears as a sliding scale. The storage in Amazon S3 for example is near the high end. But cold data is also growing phenomenally. Amazon Glacier and Facebook cold data storage are just a few examples where S3 doesn't suffice and cloud scale tiers need to be provisioned. Moreover these systems have to minimize up front capital costs of buying the storage as well as the costs of running the storage. By co-designing hardware and software stacks, Pelican aims to allow the entire rack's resources to be carefully balanced to support cold data workloads.
Pelican can withstand a peak sustainable read rate of 1 GB per second per PB of storage ( 1 GB/PB/sec) with 1GB blobs. A single rack can support upto 5 PB but since the rate is low, the entire data in Pelican can be transferred out every 13 days.
Pelican differs from a conventional rack in its behavior not structure. It uses a 52-U standard sized rack. It uses 40 Gbps of full duplex bandwidth and 1152 archive grade hard disks packed into the rack and using currently available drives of average size While a traditional storage rack would be provisioned for peak-performance involving all drives, Pelican allows only 96 drives to be spun up. All disks which are not spun up are in standby mode. However Pelican uses the software stack to give good performance even with restricted resources by using innovative algorithms to handle data layout and IO scheduling.
Pelican treats a group of disks as a single schedulable unit. Resource restrictions such as power consumption, vibrations and failure domains are expressed as constraints over these units.
There are 48 groups of such units and each disk is assigned to a group. Files are stored within a single group and the scheduling is done using groups rather than disks.
With the help of resource constraints and scheduling units, Pelican aims to be better than its overprovisioned counterpart racks.