Cluster computing

Thursday, July 28, 2016

Today we continue our discussion of the paper titled "Pelican: a building block for exascale cold data storage". Pelican treats a group of disks as a single schedulable unit. Resource restrictions such as power consumption, vibrations and failure domains are expressed as constraints over these units.

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

Pelican uses resources from a set of resource domains which is a subset of disks. Pelican proposes a data layout and IO scheduling algorithms by expressing these resource domains as constraints over the disks.

We were reading about the IO scheduler. It works on groups and classes rather than the hardware constraints. It reorders requests between the classes to reduce spin latency. The IO scheduler manages costs from the requests as well as costs from the rebuild traffic. It does this using two mechanisms - request ordering and rate limiting.
If the blob sizes are uniform, the reordering is expressed in terms of the number of operations. If it is non-uniform, it is expressed in terms of wall clock time.

We now look at rate limiting methods. The rebuild operation interferes with the regular operations that are queued, hence the rate at which each queue is serviced is then controlled. The goal here is to make the rebuild operations complete within an upper time bound. This helps ensure data durability when the annual failure rates of the hardware are known.
If x is the amount of data on the failed disk and t is the average throughput of a single disk, then x / t is the time taken to transfer and if w is the fraction of the resources the scheduler allocates, then the total time to repair after a single disk failure is (x / t times 1/w) .
This is used to compute the weights for two queues - one for rebuild and another for regular operations and the scheduler merely does weighted fair queuing.

Building bridges problem

Consider a 2D map with a river flowing from left to right through the center. There are n cities to the south of the river and same number on the northern bank. We have to bridge city i on the northern bank to city i on the southern bank however the order of cities can be random on both the northern and southern bank. We have to draw as many bridges without having them cross each other.

int GetBridges(List<int> cities)

{

int n = cities.Length;

int best = new List<int>(n);

for (int i =1; i < n; i++)

for (int j = 0; j < i j++)

{

if (cities[i] < cities[j] && best[j] + 1 > best[i])

best[i] = best[j] + 1;

}

return best.max();

}

return min (GetBridges(NorthernCities), GetBridges(SouthernCities));

Find the largest subarray with equal number of zeros and ones
Tuple<int, int> GetLargestSubArray(List<int> nums)
{
var ret = new Tuple<int, int>();
int sum = 0;
int size = INT_MIN;
for (int i = 0; i < nums.Count; i++)
{
sum = (nums[i] == 0) ? -1: 1;
int maxsize = -1, startindex;
for (int j = i+1; j < nums.Count; j++)
{
if (nums[j] == 0)
sum += -1;
else
sum += 1;
if (sum == 0 && size < j-i+1)
{
size = j - i + 1;
ret.first = i;
ret.second = j;
}
}
}
return ret;
}

Wednesday, July 27, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

The scheduler does reordering to batch sets of operation for the same group to amortize the group spin up latency over the set of operations. The larger the batch size, the smaller the cost from the request but more the queuing delay for some operations. Let us see how they quantify this delay.

Queued operations have a timestamp. An operation r has a timestamp tr and is assigned a reordering counter or. The difference between or and tr is the absolute change in ordering compared to first come first serve. There is an upper bound u on the tolerated re-ordering and each difference for every request must be within this limit. The scheduler first examines the queue to find l, the last operation in the same group as r. If there are no such operation, r is appended to the tail of the queue and the process completes. On the other hand, if there was an l then the impact of inserting r after l is determined. The scheduler performs a check to quantify the impact if r were inserted after l by considering all operations i following l. if for any of these operations i, the absolute change does not respect the upper bound, then r is appended to the tail. Otherwise all counters are incremented by 1 and r is inserted after l. Otherwise all oi counters are incremented by one, and r is inserted after l with its reordering counter reduced by the number of request i that it has jumped.

This remains true for all uniform blob sizes so the reordering is expressed in terms of the number of operations. If it is non-uniform, it is expressed in terms of wall clock time.

#codingexercise

Given a binary array and an integer m, find the position of zeroes flipping which creates maximum number of consecutive 1s in array.

The solution involves the following steps

while the number of zeros is no more than m, expand the window to the right and update the count of zeros

while the count of zeros exceeds m, shrink the window from the left and update the count of zeros

update the widest window along the way.

The positions of output zeros are inside the window.

void findZeroes(List<int> arr, int n, int m)
{
int left = 0, right = 0;
int start = 0, window = 0;
int count = 0;
For (int i = 0; i < n; i++)
{
if (count <= m)
{
if (arr[i] == 0){
count++;
}
Right = i;

}
if (count > m)
{
if (arr[left] == 0)
count--;
left++;
}
if (right-left > window)
{
window = right-left;
start = left;
}
}
for (int i=0; i<window; i++)
{
if (arr[start+i] == 0)
Console.writeline("{0}",start+i);
}

}

#codingexercise
int getmin(int[] num, int n)
{
int min = INT_MAX;
for (int i = 0; i < num.count; i++)
if (num[i] < min)
min = num[i];
return min;
}

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

This remains true for all uniform blob sizes so the reordering is expressed in terms of the number of operations. If it is non-uniform, it is expressed in terms of wall clock time.

#codingexercise

Given a binary array and an integer m, find the position of zeroes flipping which creates maximum number of consecutive 1s in array.

The solution involves the following steps

while the number of zeros is no more than m, expand the window to the right and update the count of zeros

while the count of zeros exceeds m, shrink the window from the left and update the count of zeros

update the widest window along the way.

The positions of output zeros are inside the window.

void findZeroes(List<int> arr, int n, int m)
{
int left = 0, right = 0;
int start = 0, window = 0;
int count = 0;
while (right < n)
{
if (count <= m)
{
if (arr[right] == 0)
count++;
right++;
}
if (count > m)
{
if (arr[left] == 0)
count--;
left++;
}
if (right-left > window)
{
window = right-left;
start = left;
}
}
for (int i=0; i<window; i++)
{
if (arr[start+i] == 0)
Console.rightite("{0}",start+i);
}

}

Tuesday, July 26, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

Groups are used instead of the underlying hardware constraints. The IO Scheduler in Pelican simply considers which class the group is in rather than the constraints on all the drives. If we increase the number of groups which totally collide, we also increase the number of independent groups leading to better throughput and lower latency.

Let us review next how this IO scheduler works. Pelican reorders requests in order to minimize the impact of spin latency. A set of groups form a class. Within a class only one group can be spinning at a time because of the domain conflicts. The remaining domains are not conflicted and form other classes of groups. In this case, the 48 groups divide into four classes of 12 groups and each class is independent from the others. An independent instance of the scheduler is run for each class and it services requests only for its class. Reordering is therefore class-level.

Traditional IO re-ordering attempts to order the request based on a continuous cost function. In contrast in Pelican there is a fixed constant cost of spinning up a group which is independent of the current set of disks spinning. The cost function is binary and is zero if the currently spinning group and the group to perform IO are the same and one otherwise.

The cost of a proposed schedule of IO request as the sum of the costs for each request. Only one group can be spun up at any time and if there are q requests queued then the cost is 0.92 q because the probability of two consecutive operations on different groups is 0.92 and the worst case is one where every request causes a group to spin. The goal of the IO scheduler is to try to minimize c. In the worst case each spin up incurs a latency of 8 seconds and in the best case only 8 requests are serviced per minute.

There's also cost from failures in addition to the success run. A disk failure triggers a set of rebuild operations to regenerate the lost stripe-stacks. This rebuild requires activity on the group for a length of time equal to the data of the failed disk divided by the disk throughput. Rebuild traffic cannot be simply prioritized over the incoming requests. The IO scheduler addresses these two challenges using two mechanisms - request reordering and rate limiting.

#codingexercise
Given a Binary Search Tree (BST) and a range [min, max], remove all keys which are outside the given range. The modified tree should also be BST.

void Prune(ref node root, int min, int max)
{
if (root == null) return;
Prune(ref root.left, min, max);
Prune(ref root.right, min, max);
if (root.data < min)
{
node temp = root.right;
delete root;
root = temp
}
if (root.data > max)
{
node temp = root.left;
delete root;
root = temp
}
}

Monday, July 25, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

Pelican uses a metadata service called Catalog that is durable and highly available. It maintains bookkeeping information such as the mapping from a blob key to the 18 disks and group which store metadata. It is updated during write, rebuild and delete requests and looked up during read requests.

#codingexercise

Write a method to delete all the nodes from a binary tree that lie on a path whose sum from root to leaf is less than a given value K.

Void delPathSums(Node root, int K, ref int cur)

{

If (Root == null) return;

Cur+= root.data;

Int left = cur;

Int right = cur;

Root.left = delPathSums(root.left , K, ref int left);

Root.right = delPathSums(root.right, K, ref int right);

Cur += max(left, right);

If (cur <k){

Tree_delete(root);

Root = null;

}

Return root;

}

#DataDaySeattle2016 continued

Hi,

Here's the link to the file:

https://1drv.ms/w/s!Ashlm-Nw-wnWk3yjDmBBSqrd4PSr

Shared from Word for Android

https://1drv.ms/w/s!Ashlm-Nw-wnWk3y1Lm87oOphR2j6

Sunday, July 24, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

The data layout algorithm of Pelican divides the disks into l groups. The value of l is selected such that it is the maximum when each group of size g with l x g >= 1152 is such that g>= k + r. K is the number of fragments and r is the number of additional fragments containing redundancy information using a Cauchy Reed-Solomon erasure code. k+ r is the total number of fragments to store. and is referred to a stripe. Therefore it follows that each group must be as large as k + r

One of the advantages of using groups is that it spans multiple failure domains. Disks belonging to a group are distributed across the trays and all the backplanes. In addition, groups reduce time required to recover from a failed disk because all the required data is contained within the same group.

Data is stored in unstructured immutable chunks called blobs. Blobs vary in size from 200MB to 1TB and each blob is uniquely identified by a 20 byte key. Pelican is designed to store blobs which are infrequently accessed. A blob is written to k+r disks in a single randomly selected group.
Disks are selected to store the blog. They are first split into six sets each containing disks from the same backplane failure domain, then they are ordered on spare capacity and then the three disks with the highest spare capacity are selected.
#codingquestion
Int max(int[] nums)
{
Int max = int _min;
for(int i = 0; i < nums.count; i++)
If (nums[i] > max)
Max = nums[i];
Return max;
}

#DataDaySeattleContinued:

https://1drv.ms/w/s!Ashlm-Nw-wnWk3r3LxoLq77q9TSY

Saturday, July 23, 2016

#codingexercise

Merge two binary search trees

Node Merge(Node tree1, Node tree2)

{

var in1 = new List<int>();

InOrder(Node tree1, ref in1);

var in2 = newList<int>();

InOrder(Node tree2, ref in2);

var result = in1.Merge(in2);

return ToBST(result, 0, result.count-1);

}

void InOrder (Node root, List<int> tree)

{

if (root == null) return;

InOrder(root.left);

tree.add(root.data);

InOrder(root.right);

}

Node ToBST(List<int> nums, int start, int end)

{

if (start > end) return null;

int mid = (start + end) / 2;

var root == new Node();

root.data = nums[mid];

root.left = ToBST(nums,start, mid-1);

root.right = ToBST(nums, mid+1, end);

return root;

}

With the help of resource constraints and scheduling units, Pelican aims to be better than its overprovisioned counterpart racks using computations in software stacks

Pelican reduces the disk layout problem from 1152 Choose 18 disks to one involving l ^ 2 pairs of logical groups.Furthermore, if one of the disks in one group collides with another group, all the disks of that group will collide with those in the other. This reduces the collision probability from g ^ 2 to being proportional to g. This is close to the lower bound on the domain collision probability because g is the number of cooling domains used by a group ( only one disk can be active per cooling domain and disks within a group are domain disjoint)

l and g can be determined in the following manner. we want to maximize the number l of groups of size g given l x g = 1152 and with g >= k + r where r is the groups large enough to store a stripe. In the current implementation the authors use g = 24 rather than g = 18 so that blob can entirely reside within a single group even after some disks have failed. Stripe stacks stored on failed drives are regenerated and stored on other drives in the group. This gives l = 48 which is then divided into 4 classes of 12 groups. Each class is independent of the other.

Notice how the disk layout algorithm tries to form groups to improve the concurrency. In the paper on local reconstruction codes, there was another attempt to form groups to find reconstruction co-efficients. In this case each group holds within itself all the constraints because disks in a group are all domain-disjointed which means they don't share any common resource domain.

Data day seattle talks:

https://1drv.ms/w/s!Ashlm-Nw-wnWk3Uo1l-ol2OcM4Nl