Cluster computing

Friday, July 22, 2016

Today we continue our discussion of the paper titled "Pelican: a building block for exascale cold data storage". Pelican treats a group of disks as a single schedulable unit. Resource restrictions such as power consumption, vibrations and failure domains are expressed as constraints over these units.

With the help of resource constraints and scheduling units, Pelican aims to be better than its overprovisioned counterpart racks using computations in software stacks

They also have failure domains for disks, frames and chassis.

Data layout conventionally proceeds with choosing 18 disks at random from a lift of 1152 disks. This results in a large number of combinations as 1152 Choose 18. The authors improve this naiive disk layout with by creating 'l' domains of disks and by ensuring that the domains share nothing. Each disk is a member of a single group and all disks within a group can be spinning concurrently. This reduces the complexity to determining l^2 pairs of logical groups Furthermore, the authors make all the disks of one group as fully colliding with another group even if one of the disk collides with that of the other group.

#codingexercise

Given a binary tree, remove all of the half nodes ( those that have one child )

void RemoveHalfNodes(ref Node root)

{

if (root == null) return null;

RemoveHalfNodes(ref root.left);

RemoveHalfNodes(ref root.right);

if ( root.left == null &&

root.right == null)

return root;

if (root.left == null)

{

var temp = root.left;

root.left = null;

delete root;

root = temp

}

if (root.right == null)

{

var temp = root.right;

root.right = null;

delete root;

root = temp

}

Thursday, July 21, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its overprovisioned counterpart racks using computations in software stacks

Pelican uses resources from a set of resource domains which is a subset of disks. Pelican proposes a data layout and IO scheduling algorithms by expressing these resource domains as constraints over the disks. The authors classify the constraints as hard or soft. Hard constraints cause short or long term failures. For example, power and cooling are hard constraints. Violating a hard constraint takes the disks offline and may have to be remounted. Violating a soft constraint does not lead to failure but instead causes performance degradation or inefficient power resource usage. For example, bandwidth is a soft constraint. The Pelican storage stack uses the following constraints:

1) one disk spinning or spinning up per cooling domain

2) two disks spinning or spinning up per power domain

3) shared links in the PCIe interconnectivity hierarchy and

4) disks located back to back in a tray share a vibration domain.

#codingexercise

Given a sorted matrix (row-wise and column wise) , find kth smallest element.

we walk the board and count until k

int GetKth(int[,] m, int row, int col, int k)

{

int r = 0;

int c = 0;
int count = 0;
var front = new List<int>()
for (int k = row-1; k >= 0; k--)
front.add(k*col+0);
while ( count < k)

{

// compare right and down and increment r or c

// the idea is that the k smallest elements will be a contiguous block

// and the next element for inclusion will be either to the right of this block or
// to the bottom of this block
// we just need to propagate the front on all rows one cell at a time.
int min = INT_MAX;
int selected = -1;
for (int k = row-1; k >= 0; k--)
{
int i = front[k]/col;
int j = front[k]%col;
if ( j < col && m[i,j] < min){
selected = k;
min = m[i,j];
}
}
if (selected == -1){
return m[r,c];
}
else{
int i = front[selected]/col;
int j = front[selected]%col;
front[selected] = i * col + j+1;
r = i;
c = j;
prev = count;
count++;
}
}

}

return m[r,c];

}

2 7 11

3 8 12

9 13 15

Wednesday, July 20, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its overprovisioned counterpart racks using computations in software stacks

Pelican is not merely a power consumption optimization system. Previous work included Massive array of idle disks that achieve power proportionality assuming there is sufficient power and cooling to have all disks spinning and active when required. Hence they provide power saving. Pelican's peak power is 3.7kW and average of 2.6kW. Pelican provides both a lower capital cost per disk as well as a lower running cost while previous work improved the latter only.

Pelican stores unstructured, immutable chunks of data called blobs and has a key value store interface with write, read and delete operations. Blobs sizes supported are in the range of 200MB to 1 TB. Each blob is identified by a 20 byte key. Since Pelican serves to archive data, the initial time is spent mostly writing blobs and then all subsequent time is spent doing infrequent reads. Reading and repairing constitute much of the normal mode of operation.

Pelican serves as the lower tier in a cloud based storage system. Data is already staged in the higher layers. Pelican controls the transfer time to its system. This means that writes are already scheduled only for low load periods. Pelican therefore focuses on read-dominated workloads.

The software storage stack plays the innovative role of reducing the impact on performance from the restrictions on hardware resources. Pelican uses resources from a set of resource domains. Resource domain supplies its resource to a subset of disks simultaneously Disks in the resource domain are said to be domain conflicting. Two disks that share no common resource domains are domain disjoint.Pelican proposes a data layout and IO scheduling algorithms on these resource domains.

Given a tree, implement a function which replaces a node’s value with the sum of all its childrens’ value, considering only those children whose value is morethan than the main node’s value.

void ToChildrenSumTree(ref node root)

{

int left = 0;

int right = 0;

if (root==null || (root.left == null && root.right == null)) return;

ToChildrenSumTree(ref root.left);

ToChildrenSumTree(ref root.right);

If (root.left != null) left = root.left.data;

If (root.right != null) right = root.right.data;

int sum = 0;

if (left > root.data) sum += left;

if (right > root.data) sum += right;

int delta = sum - root.data;

if (delta >= 0)

root.data = root.data + delta;

else

increment(ref root, 0-delta);

}

Given a sorted matrix (row-wise and column wise) , find kth smallest element.

we walk the board and count until k

int GetKth(int[,] m, int row, int col, int k)

{

int r = 0;

int c = 0;

while ( r* col + c < k)

{

// compare right and down and increment r or c

}

return m[r,c];

}

Tuesday, July 19, 2016

#codingexercise
In the previous exercise of 15-puzzle, we talked about finding inversion counts.
int GetInversions( int[,] board, int N)
{
int count = 0;
var serialized = board.ToList();
for (int i =0; i < N*N-1; i++)
for (int j = i+1; j < N *N; j++)
if (serialized[i] && serialized[j] && serialized [i] > serialized[j])
count++;
return count;
}

int findXPosition(int[,] board, int N)
{
for (int i = N-1; i >= 0; i--)
for (int j = N-1; j >= 0; j--)
if (board[i , j] == 0)
return N-i;
Return -1;
}

We were talking about docker usage of clusters. Asimple way to take advantage of the cluster is to have a file system that stretches over the cluster. Notably OneFS is an example of such a file system that can expand to a large number of nodes for a very large storage. Using a cluster instead of the host, lets the docker to have resources far beyond the capability of a single host. In this model, docker can continue to operate out of a single host but have a file system that spans clusters. This may require little or no change to how docker operates on a single host.

Another way to take advantage of the cluster is to not be host bound but cluster bound where an instance of the docker runs on each node and there is a translation or mapping of the workload on the cluster to target a specific node in the custer. This calls for a dotted naming convention where we can scope down progressively to the specific instance. This too may not require changes to the way docker runs on a single host but in additional capabilities that maintain mapping information between apps and docker instances.

In terms of compute improvements as opposed to storage, there is a possibility that the runtime interfaces docker uses such as libcontainer, libvirt, lxc and systemdnspawn interfaces with Linux, could possibly be replaced by something that targets similar functionalities on a cluster where the replacement of a host by a cluster does not affect the operations of docker.

Moreover, a cluster can improve batch processing via scatter and gather operations as opposed to the serial single processor techniques in compute. This provides an additional leverage for task parallel operations in docker on a cluster.

Given a tree, implement a function which replaces a node’s value with the sum of all its childrens’ value, considering only those children whose value is less than than the main node’s value.
void ToChildrenSumTree(ref node root)
{
int left = 0;
int right = 0;
if (root==null || (root.left == null && root.right == null)) return;
ToChildrenSumTree(ref root.left);
ToChildrenSumTree(ref root.right);
If (root.left != null) left = root.left.data;
If (root.right != null) right = root.right.data;
int sum = 0;
if (left < root.data) sum += left;
if (right < root.data) sum += right;
int delta = sum - root.data;
if (delta >= 0)
root.data = root.data + delta;
else
increment(ref root, 0-delta);
}

Monday, July 18, 2016

#codingexercise

Print all possible words from phone digits. An n digit number is given, and all words that are possible by pressing these numbers must be listed.

var hashTable = new List<int>(){"", "", "abc", "def", "ghi", "jkl",

                               "mno", "pqrs", "tuv", "wxyz"};

void combineKeypad(List<int> number, int n, int current, int level, ref List<char> candidate)

{

if (current == n) {

   Console.Write(candidate);

   return;

}
If (number[current] ==0 || number[current] == 1){
    combineKeypad(number, n, current+1, level, ref candidate);
    Return;}

for (int i =0; i < hashTable(number[current]).Length; i++)

{

candidate[level] = hashTable[number[current]][i];

combineKeypad(number, n, current+1, level +1, ref candidate);

candidate[level] = '/0';

}

}

}

combineKeypad(number, n, 0,0, ref result);

In the previous post, we talked about making docker containers to scale out with the resources of the host.One simple approach to this would be for Docker to treat the host as a cluster and not a single instance. This way the abstraction layer can assume more power than the original host while enabling additional nodes to be brought up or taken down. The most important ability of a cluster in the scale to which it can allow distributed computing. By introducing a layer that redirects to different nodes of the cluster in the native linux calls via agents, docker can continue to roll apps from one host to another and for that matter one cluster to another.
We can also use the concept of a 'lid' in that the applications are not brought down but all the data flows are temporarily cut off. This does not require changes to the network interfaces but just that data flow to a particular docker is rerouted to another. We call this a lid to the container. By not requiring the applications to be powered down, docker can even switch regions without any changes to all that it supports. The mapping information between services and dependencies are now maintained by docker as an extension to the format in which the applications currently declare their dependencies to docker.
The layer of redirection from the application to the host via docker does not necessarily have to be one that supports clusters. Consider rack compute and storage where there is a large number of cores and storage available. On such a fabric, virtual machines of all sizes can be spun up and shut down. If the redirection layer supports targeting the linux host dynamically, the application workloads can benefit from increased hardware resources dynamically as the applications are moved by docker from one host to another.
In a sense, this is similar to vMotion of virtual machines but for hosts. The difference is that the physical resource of cpu and storage can grow and shrink to the workload without having the applications to know about it. This is dynamic scale up and down versus scale out.

#codingexercise

Determine if a particular arrangement of 15 puzzle is solvable.
for a given grid of width N, an N*N-1 puzzle will be solvable as follows:
bool IsSolvable(int[,] board,int N)
{

int count = GetInversions(board);
// if grid is odd, return true if inversions is even
if (N & 1)
return !(count & 1);
else
{
int pos = findXPosition(board);
if (pos & 1)
return !(count & 1);
else
return count & 1;
}
}

Sunday, July 17, 2016

Docker Containers, Solaris Zones and vCenter Virtual machines
There are different levels of isolation and virtualization of applications. With the embrace of cloud computing, applications have become modular with deep virtualizations such as separating code to run on different virtual machines. While the guest operating system in a virtual machine may provide another level of isolation in terms of resources such as with cgroups and namespaces, they don’t control data partitioning or load balancing. They improve portability so that the application can seamlessly run on premises, public cloud, private cloud, baremetal etc. Each level of isolation is encapsulated within another. A application running within a docker container is bound to a single linux instance that it is hosted on. Similarly by their very nature operating system zones or namespaces are bound to that machine. A virtual machine can move between datacenters or storage such as with vMotion but it is still bound to a single instance.
Most applications are indeed tied to their hosts. If they have dependencies on external services, mostly they are in the form of database connection or an api calls. Eventually these api servers or database servers become more consolidated and single point of existence for their purpose. The microservice model is a good example of such a dedicated single instance. While data can be partitioned, servers are seldom so. This is due to the fact that most operations are data intensive rather than computation intensive. Consequently database servers and api servers grow to become clusters and farms to take on the load that they do.
An alternative approach to this model is a distributed container – one that spans multiple servers with their own partitions of data. Consider a service that interacts with a database server for a table. If the pair of the database server and the api server could partitioned for this data, then the workload will always be within limits and the performance will improve as the workload is split. This separation in terms of pairs of servers can even be expanded to include clusters. In such a case the applications are written once with arbitrary data in mind but deployed almost always with their own regions to operate in. The api servers can then be differentiated one from another with their endpoints such as with us-region-1.apiserver.corp.company.com or with qualiifers such as apiserver.corp.company.com/region1/v1/exports
However there is very little tie-up required between the servers in the example above because most of the data transferred is with the help of connection endpoints across the network. There is hardly any resource sharing between the clusters or the machines in the pool. Another way to look at this is to say that this is a deployment time concern and requires non-traditional partitioning of tables across database servers. Further it requires additional mapping relations between the code and data for each region. This means that there is pairing information between each server and database for its partition of data. This is simple but it still does not harness the power of a container as an elastic loop around several machines.
Clusters solve this problem by allowing additional nodes to be seamlessly brought online or taken offline. However clusters are usually homogenous not a combination of api servers and database servers. Imagine a platform-as-a-service in an enterprise where several microservices are hosted each
with its own purpose and not necessarily sharing the same database, database server or database cluster. If this set of services is having hot spots in terms of one or two services, it may be easy to bolster those services with web farms or load balancers. However this is not the same as cloning the entire set of api servers for another region. In such a case we are looking at a container for a platform-as-a-service and not an application container alone.
This still does not require any operating system libraries like docker does for Linux. It’s merely a production or deployment time inventory classification convenience and mapping exercise. When a docker enables containers spanning over multiple machines, it will required some messaging framework between the servers so that the applications can move in and around the participant machines. The trouble is that code is easy to be cloned but the data is sticky. As data grows, the database becomes larger and it demands more servicing and application binding.
#codingexercise
Check if a tree satisfies children sum property
Bool isChildrenSum(node root)
{
If ( root == null || ( root.left == null && root.right == null))
Return true;
Int left = 0;
Int right= 0;
If (root.left != null) left = root.left.data;
If (root.right != null) right = root.right.data;
If (root.data == left + right && (isChildrenSum(root.left) && isChildrenSum(root.right))
Return true;
return false;

}
Convert an arbitrary tree into one that satisfies children sum property
void ToChildrenSumTree(ref node root)
{
int left = 0;
int right = 0;
if (root==null || (root.left == null && root.right == null)) return;
ToChildrenSumTree(ref root.left);
ToChildrenSumTree(ref root.right);
If (root.left != null) left = root.left.data;
If (root.right != null) right = root.right.data;
int delta = left + right - root.data;
if (delta >= 0)
root.data = root.data + delta;
else
increment(ref root, 0-delta);
}

void increment(ref node root, int delta)
{

if (root.left){
root.left.data += delta;
increment(root.left, delta);
}
if (root.right){
root.right,data += delta;
increment(root.right, delta);
}
}

Saturday, July 16, 2016

In the previous post we discussed snapshots and clones. On the vcenter, they are available as files. In fact each virtual machine has its own folder in the datastore. If you wanted to enumerate these files, here's how we would do it :

args = get_args()

    search = vim.HostDatastoreBrowserSearchSpec()

    search.matchPattern = "NameSubstring" # relevant files in vm folder usually include the vm name

    search_ds = dsbrowser.SearchDatastoreSubFolders_Task(dsname, search)

    while search_ds.info.state != "success":

        pass

    # results = search_ds.info.result

    # print results

    for rs in search_ds.info.result:

        dsfolder = rs.folderPath

        for f in rs.file:

            try:

                dsfile = f.path

                vmfold = dsfolder.split("]")

                vmfold = vmfold[1]

                vmfold = vmfold[1:]

                vmxurl = "https://%s/folder/%s%s?dcPath=%s&dsName=%s" % \

                         (args.host, vmfold, dsfile, datacenter, fulldsname)

                VMX_PATH.append(vmxurl)

            except Exception, e:

                print "Caught exception : " + str(e)

                return -1

generate the sequence
5, 25, 5+25,125, 125 + 5, 125 + 5 + 25 ...
int GetKMagic(int k)
{
int count = 1;
int power = 1;
var ret = new List<int>(){5};
power++;
for (int i =1; i < k; i ++)
{
var pow = Math.pow(5, power);
if (count + 1 == power)
{
ret,Add(pow);
power++;
count = 0;
}else{
for (int j = 0; j < count; j++){
var sum = ret.getRange(0,j).Sum()
if (sum + pow < Math.pow(5, power+1) && ret.contains(sum+pow) == false)){
ret.add(sum+pow);
break;
}
}
count++;
}
}
return ret.Last();
}

Alternatively, magic numbers like above are sum of zero or one times coefficient of power of 5
int GetMagicN(int n)
{
int power = 1;
int result = 0;
while (n)
{
power = power * 5;
if ( n & 1)
result += power;
n >> = 1;
}
return result;
}