Cluster computing

Tuesday, July 26, 2016

Today we continue our discussion of the paper titled "Pelican: a building block for exascale cold data storage". Pelican treats a group of disks as a single schedulable unit. Resource restrictions such as power consumption, vibrations and failure domains are expressed as constraints over these units.

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

Groups are used instead of the underlying hardware constraints. The IO Scheduler in Pelican simply considers which class the group is in rather than the constraints on all the drives. If we increase the number of groups which totally collide, we also increase the number of independent groups leading to better throughput and lower latency.

Let us review next how this IO scheduler works. Pelican reorders requests in order to minimize the impact of spin latency. A set of groups form a class. Within a class only one group can be spinning at a time because of the domain conflicts. The remaining domains are not conflicted and form other classes of groups. In this case, the 48 groups divide into four classes of 12 groups and each class is independent from the others. An independent instance of the scheduler is run for each class and it services requests only for its class. Reordering is therefore class-level.

Traditional IO re-ordering attempts to order the request based on a continuous cost function. In contrast in Pelican there is a fixed constant cost of spinning up a group which is independent of the current set of disks spinning. The cost function is binary and is zero if the currently spinning group and the group to perform IO are the same and one otherwise.

The cost of a proposed schedule of IO request as the sum of the costs for each request. Only one group can be spun up at any time and if there are q requests queued then the cost is 0.92 q because the probability of two consecutive operations on different groups is 0.92 and the worst case is one where every request causes a group to spin. The goal of the IO scheduler is to try to minimize c. In the worst case each spin up incurs a latency of 8 seconds and in the best case only 8 requests are serviced per minute.

There's also cost from failures in addition to the success run. A disk failure triggers a set of rebuild operations to regenerate the lost stripe-stacks. This rebuild requires activity on the group for a length of time equal to the data of the failed disk divided by the disk throughput. Rebuild traffic cannot be simply prioritized over the incoming requests. The IO scheduler addresses these two challenges using two mechanisms - request reordering and rate limiting.

#codingexercise
Given a Binary Search Tree (BST) and a range [min, max], remove all keys which are outside the given range. The modified tree should also be BST.

void Prune(ref node root, int min, int max)
{
if (root == null) return;
Prune(ref root.left, min, max);
Prune(ref root.right, min, max);
if (root.data < min)
{
node temp = root.right;
delete root;
root = temp
}
if (root.data > max)
{
node temp = root.left;
delete root;
root = temp
}
}

Monday, July 25, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

Pelican uses a metadata service called Catalog that is durable and highly available. It maintains bookkeeping information such as the mapping from a blob key to the 18 disks and group which store metadata. It is updated during write, rebuild and delete requests and looked up during read requests.

#codingexercise

Write a method to delete all the nodes from a binary tree that lie on a path whose sum from root to leaf is less than a given value K.

Void delPathSums(Node root, int K, ref int cur)

{

If (Root == null) return;

Cur+= root.data;

Int left = cur;

Int right = cur;

Root.left = delPathSums(root.left , K, ref int left);

Root.right = delPathSums(root.right, K, ref int right);

Cur += max(left, right);

If (cur <k){

Tree_delete(root);

Root = null;

}

Return root;

}

#DataDaySeattle2016 continued

Hi,

Here's the link to the file:

https://1drv.ms/w/s!Ashlm-Nw-wnWk3yjDmBBSqrd4PSr

Shared from Word for Android

https://1drv.ms/w/s!Ashlm-Nw-wnWk3y1Lm87oOphR2j6

Sunday, July 24, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

The data layout algorithm of Pelican divides the disks into l groups. The value of l is selected such that it is the maximum when each group of size g with l x g >= 1152 is such that g>= k + r. K is the number of fragments and r is the number of additional fragments containing redundancy information using a Cauchy Reed-Solomon erasure code. k+ r is the total number of fragments to store. and is referred to a stripe. Therefore it follows that each group must be as large as k + r

One of the advantages of using groups is that it spans multiple failure domains. Disks belonging to a group are distributed across the trays and all the backplanes. In addition, groups reduce time required to recover from a failed disk because all the required data is contained within the same group.

Data is stored in unstructured immutable chunks called blobs. Blobs vary in size from 200MB to 1TB and each blob is uniquely identified by a 20 byte key. Pelican is designed to store blobs which are infrequently accessed. A blob is written to k+r disks in a single randomly selected group.
Disks are selected to store the blog. They are first split into six sets each containing disks from the same backplane failure domain, then they are ordered on spare capacity and then the three disks with the highest spare capacity are selected.
#codingquestion
Int max(int[] nums)
{
Int max = int _min;
for(int i = 0; i < nums.count; i++)
If (nums[i] > max)
Max = nums[i];
Return max;
}

#DataDaySeattleContinued:

https://1drv.ms/w/s!Ashlm-Nw-wnWk3r3LxoLq77q9TSY

Saturday, July 23, 2016

#codingexercise

Merge two binary search trees

Node Merge(Node tree1, Node tree2)

{

var in1 = new List<int>();

InOrder(Node tree1, ref in1);

var in2 = newList<int>();

InOrder(Node tree2, ref in2);

var result = in1.Merge(in2);

return ToBST(result, 0, result.count-1);

}

void InOrder (Node root, List<int> tree)

{

if (root == null) return;

InOrder(root.left);

tree.add(root.data);

InOrder(root.right);

}

Node ToBST(List<int> nums, int start, int end)

{

if (start > end) return null;

int mid = (start + end) / 2;

var root == new Node();

root.data = nums[mid];

root.left = ToBST(nums,start, mid-1);

root.right = ToBST(nums, mid+1, end);

return root;

}

With the help of resource constraints and scheduling units, Pelican aims to be better than its overprovisioned counterpart racks using computations in software stacks

Pelican reduces the disk layout problem from 1152 Choose 18 disks to one involving l ^ 2 pairs of logical groups.Furthermore, if one of the disks in one group collides with another group, all the disks of that group will collide with those in the other. This reduces the collision probability from g ^ 2 to being proportional to g. This is close to the lower bound on the domain collision probability because g is the number of cooling domains used by a group ( only one disk can be active per cooling domain and disks within a group are domain disjoint)

l and g can be determined in the following manner. we want to maximize the number l of groups of size g given l x g = 1152 and with g >= k + r where r is the groups large enough to store a stripe. In the current implementation the authors use g = 24 rather than g = 18 so that blob can entirely reside within a single group even after some disks have failed. Stripe stacks stored on failed drives are regenerated and stored on other drives in the group. This gives l = 48 which is then divided into 4 classes of 12 groups. Each class is independent of the other.

Notice how the disk layout algorithm tries to form groups to improve the concurrency. In the paper on local reconstruction codes, there was another attempt to form groups to find reconstruction co-efficients. In this case each group holds within itself all the constraints because disks in a group are all domain-disjointed which means they don't share any common resource domain.

Data day seattle talks:

https://1drv.ms/w/s!Ashlm-Nw-wnWk3Uo1l-ol2OcM4Nl

Friday, July 22, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its overprovisioned counterpart racks using computations in software stacks

They also have failure domains for disks, frames and chassis.

Data layout conventionally proceeds with choosing 18 disks at random from a lift of 1152 disks. This results in a large number of combinations as 1152 Choose 18. The authors improve this naiive disk layout with by creating 'l' domains of disks and by ensuring that the domains share nothing. Each disk is a member of a single group and all disks within a group can be spinning concurrently. This reduces the complexity to determining l^2 pairs of logical groups Furthermore, the authors make all the disks of one group as fully colliding with another group even if one of the disk collides with that of the other group.

#codingexercise

Given a binary tree, remove all of the half nodes ( those that have one child )

void RemoveHalfNodes(ref Node root)

{

if (root == null) return null;

RemoveHalfNodes(ref root.left);

RemoveHalfNodes(ref root.right);

if ( root.left == null &&

root.right == null)

return root;

if (root.left == null)

{

var temp = root.left;

root.left = null;

delete root;

root = temp

}

if (root.right == null)

{

var temp = root.right;

root.right = null;

delete root;

root = temp

}

Thursday, July 21, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its overprovisioned counterpart racks using computations in software stacks

Pelican uses resources from a set of resource domains which is a subset of disks. Pelican proposes a data layout and IO scheduling algorithms by expressing these resource domains as constraints over the disks. The authors classify the constraints as hard or soft. Hard constraints cause short or long term failures. For example, power and cooling are hard constraints. Violating a hard constraint takes the disks offline and may have to be remounted. Violating a soft constraint does not lead to failure but instead causes performance degradation or inefficient power resource usage. For example, bandwidth is a soft constraint. The Pelican storage stack uses the following constraints:

1) one disk spinning or spinning up per cooling domain

2) two disks spinning or spinning up per power domain

3) shared links in the PCIe interconnectivity hierarchy and

4) disks located back to back in a tray share a vibration domain.

#codingexercise

Given a sorted matrix (row-wise and column wise) , find kth smallest element.

we walk the board and count until k

int GetKth(int[,] m, int row, int col, int k)

{

int r = 0;

int c = 0;
int count = 0;
var front = new List<int>()
for (int k = row-1; k >= 0; k--)
front.add(k*col+0);
while ( count < k)

{

// compare right and down and increment r or c

// the idea is that the k smallest elements will be a contiguous block

// and the next element for inclusion will be either to the right of this block or
// to the bottom of this block
// we just need to propagate the front on all rows one cell at a time.
int min = INT_MAX;
int selected = -1;
for (int k = row-1; k >= 0; k--)
{
int i = front[k]/col;
int j = front[k]%col;
if ( j < col && m[i,j] < min){
selected = k;
min = m[i,j];
}
}
if (selected == -1){
return m[r,c];
}
else{
int i = front[selected]/col;
int j = front[selected]%col;
front[selected] = i * col + j+1;
r = i;
c = j;
prev = count;
count++;
}
}

}

return m[r,c];

}

2 7 11

3 8 12

9 13 15

Wednesday, July 20, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its overprovisioned counterpart racks using computations in software stacks

Pelican is not merely a power consumption optimization system. Previous work included Massive array of idle disks that achieve power proportionality assuming there is sufficient power and cooling to have all disks spinning and active when required. Hence they provide power saving. Pelican's peak power is 3.7kW and average of 2.6kW. Pelican provides both a lower capital cost per disk as well as a lower running cost while previous work improved the latter only.

Pelican stores unstructured, immutable chunks of data called blobs and has a key value store interface with write, read and delete operations. Blobs sizes supported are in the range of 200MB to 1 TB. Each blob is identified by a 20 byte key. Since Pelican serves to archive data, the initial time is spent mostly writing blobs and then all subsequent time is spent doing infrequent reads. Reading and repairing constitute much of the normal mode of operation.

Pelican serves as the lower tier in a cloud based storage system. Data is already staged in the higher layers. Pelican controls the transfer time to its system. This means that writes are already scheduled only for low load periods. Pelican therefore focuses on read-dominated workloads.

The software storage stack plays the innovative role of reducing the impact on performance from the restrictions on hardware resources. Pelican uses resources from a set of resource domains. Resource domain supplies its resource to a subset of disks simultaneously Disks in the resource domain are said to be domain conflicting. Two disks that share no common resource domains are domain disjoint.Pelican proposes a data layout and IO scheduling algorithms on these resource domains.

Given a tree, implement a function which replaces a node’s value with the sum of all its childrens’ value, considering only those children whose value is morethan than the main node’s value.

void ToChildrenSumTree(ref node root)

{

int left = 0;

int right = 0;

if (root==null || (root.left == null && root.right == null)) return;

ToChildrenSumTree(ref root.left);

ToChildrenSumTree(ref root.right);

If (root.left != null) left = root.left.data;

If (root.right != null) right = root.right.data;

int sum = 0;

if (left > root.data) sum += left;

if (right > root.data) sum += right;

int delta = sum - root.data;

if (delta >= 0)

root.data = root.data + delta;

else

increment(ref root, 0-delta);

}

Given a sorted matrix (row-wise and column wise) , find kth smallest element.

we walk the board and count until k

int GetKth(int[,] m, int row, int col, int k)

{

int r = 0;

int c = 0;

while ( r* col + c < k)

{

// compare right and down and increment r or c

}

return m[r,c];

}