Cluster computing

Saturday, August 6, 2016

Today we continue our discussion of the paper titled "Pelican: a building block for exascale cold data storage". Pelican treats a group of disks as a single schedulable unit. Resource restrictions such as power consumption, vibrations and failure domains are expressed as constraints over these units.

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

Pelican uses resources from a set of resource domains which is a subset of disks. Pelican proposes a data layout and IO scheduling algorithms by expressing these resource domains as constraints over the disks. We were discussing evaluation of Pelican and the Poisson process to generate a range of workloads Most of these workloads focused on read request because the write requests are offloaded to other storage tiers, we assume that servicing read requests is the key performance requirement for Pelican. The read requests are randomly distributed across all the blobs stored in the rack. Requests operate on a blob size of 1 GB. The evaluation used the following metrics:

Completion time - this is the end to end time between the request initiation by the client and the last data received. It captures the queuing delay, spin up latency, and the time to read and transfer data. The completion time increased linearly with the number of requests for both the rack and the simulator. This was after the simulator and the rack were cross validated. It was configured in such a way that the prototype Pelican rack hardware matched the simulator. It goes to show that both the sequential file read and the disks being spun up take a toll on the completion time Further the disks are being spun up only one at a time and only one disk spinning per tray rather than two disks spinning up. The completion time increased by the order of 3 for an increase by 64 times the number of requests. This covered the range of the most practical range of workload that Pelican would typically be used for. It was also clear that the simulator could be run for extrapolation with just as much reliability as the pelican rack hardware. Consequently the completion time graph gave satisfying results in terms of predictability. The distribution of workload and the effectiveness of Pelican was clear from this linear plot.

#codingexercise

You are given partial information of a binary tree: for each node, its id, and the sum of the ids of its children. This sum is 0 if the node is a leaf, and if the node has just one child, then it is equal to the child's id. Given this, you are asked to output ids of all possible roots of such a tree, given that such a tree does exist.
int GetRootId(List<int>id, List<int>sum)
{
assert(id.count == sum.count);
assert(id.all(x => x > 0));
assert(sum.all(x=>x > 0));
int root = 0;
for (int i =0; i<n; i++)
{
root += id[i] - sum[i];
}
return root;
}

Return the square root of a number

Double SqRt (int n)

{

Double g = n/2;

Double y = n/g;

While(abs (y-g) > 0.001)

{

g = (g + n/g) /2;

y = n / g;

}

Return y;

}

Friday, August 5, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

Pelican was measured under different load regimes. The disk throughput was measured using a quiescent system. Seeks are simulated by including a constant latency of 4.2 ms for all disk accesses and it was found that seek latency has negligible impact on performance. It was also found that the latency of spinning the disk up and mounting the volume heavily dominate over seek time. The distribution for mount delays was widely affected by the load on the system.
Pelican was compared with a system denoted by FP that was subject to no resource constraints. The disks were never spun down but the same topology was used. Pelican partitions the disks into 48 groups whereas in FP all 48 groups could be concurrently accessed. Reads are done at full throughput because the stripe stacks are contiguous on disk.
Pelican supports a wide variety of workloads. A full range of workloads is covered by parameters representing characteristics and using a Poisson process with an average arrival rate of 1 / lambda. The Poisson process allows us to model random events along a real line using a single parameter which in this case is made to vary with lambda from 0.125 to 16. The average request rate per second is shown rather than the lambda value because this makes the results clear. The workloads are all such that Pelican can focus on servicing read requests which are randomly distributed across all the blobs stored in the rack.

#codingexercise

Check if a given array is a pre-order traversal of a binary search tree

bool IsPre(List<int>nums)

{

var stk = new Stack<int>();

int root = INT_MIN;

for (int i = 0; i < nums.Count; i++)

{

if (nums[i] < root)

return false;

while (!stk.empty() && s.top() < nums[i])

{

root = stk.top()

stk.pop();

}

stk.push(nums[i]);

}

return true;

}

50 30 35 70 80
35 30 80 70 50
The same linear scan above can be applied to finding the next greater element for every element in an array.
void GetNextGreater(List<int> arr)
{
if (arr.count == 0) return;
int element, next;
var stk = new Stack();
stk.push(arr[0]);
for (int I = 1; I < arr.Count; i++)
{
next = arr[I];
If (isEmpty(stk) == false)
{
element = stk.pop();
while (element < next)
{
Console.Write("{0} next {1}", element, next);
if (isEmpty(stk) == true) break;
element = stk.pop()
}
if (element > next)
stk.push(element)
}
stk.push(next)
}
while (stk.empty() == false)
{
element = stk.pop()
next = -1;
Console.Write("{0} next {1}", element, next);
}
}

The same scan that holds true for preorder check does not hold true for post order because the root is no longer the first encountered that will be greater than both the left and right in a BST.
So the solution would be to reverse scan the array

bool IsPost(List<int>nums)

{

var stk = new Stack<int>();

int root = INT_MIN;

for (int i = nums.Count - 1; i> 0; i--)

{

while (!stk.empty() && s.top() > nums[i])

{

root = stk.top()

stk.pop();

}

if (nums[i] < root)

return false;

stk.push(nums[i]);

}

return true;

}

50 30 35 70 80
35 30 80 70 50

Thursday, August 4, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

We were analyzing the charts from the evaluation run.we said the spin up and unmount delay are not affected by load and have more or less a uniform distribution because they affect all the disks and are one time only. This is different for mount delays where the higher the workload the higher the chance of a disk being a straggler.

#codingexercise

There are n pairs of 2n people. All these 2n persons are arranged in random fashion.Find the minimum number of swaps required to arrange these pairs such that all pairs become adjacent to each other.

Notice this has a greedy choice property in terms of costs as the number of pairs

The cost to pair the first two elements if they are already a pair is zero.

The cost to pair the first two element if they are not is the minimum of the costs to find the partner for the first element and to find the partner for the second element

Once we found the minimum cost for the first we do the same for the next pair.

Each time we revert the changes made so we can find the cost for the next pair as it appears in the given order.

int GetMinSwaps(List<int> pairs, List<int>num, List<int>index, int start)
{
var n = num.Count;
if (start> n) return 0;
if (pairs[num[start]] == num[start+1])
return GetMinSwaps(pairs, num, index, start+2);

int select = num[start+1]
int pairindex = index[pairs[num[start]]];
int partner = num[pairindex];
Swap(num[start+1], partner)
UpdateIndex(index, select, start+1, partner, pairindex);
int firstcost = GetMinSwaps(pairs, num, index, start+2);

swap(num[start+1], partner);
UpdateIndex(index, select, pairindex, partner, start+1);

int secondindex = index[pairs[num[start+1]]];
int second = nums[secondindex];
swap(nums[start], second);
UpdateIndex(index, second, secondindex, nums[start], start);
int secondcost = GetMinSwaps(pairs, num, index, start+2);

swap(nums[start], second);
UpdateIndex(index, second,start, nums[start], secondindex);

return 1 + min(firstcost, secondcost);

}
Void updateindex(list<int> index, int first, int firstindex, int second, int secondindex)
{
Index[firstindex] = first;
Index[secondindex]=second;
}

Wednesday, August 3, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

Pelican uses resources from a set of resource domains which is a subset of disks. Pelican proposes a data layout and IO scheduling algorithms by expressing these resource domains as constraints over the disks. We were discussing implementation considerations. Now we discuss evaluation of Pelican for which a discrete event based Pelican simulator was developed and cross validated against the storage stack. The simulator was cross validated using micro benchmarks from the rack.The mount delay distributions were generated by taking samples of mount latencies during different load regimes on the Pelican. The CDF was plotted against spin up delay (ms), mount delay(ms), unmount delay (ms), Blob size(MB), Volume capacity (TB). The CDF stands for cumulative distribution function and is expressed as Fa,b(x) on a uniform distribution of [a,b] as
Fa,b(x) = 0 for x<a,
Fa,b(x) = (1/ (b-a)) (x-a) for A<=x<=b,
Fa,b(x) = 1 for x> b,
It has a monotonic property which means that it is non-decreasing and lies in the range 0 to 1 which makes it great for charts!
The delays for spin up and the mounting and unmounting delays for a drive after spin up was measured. The disks were spun up and down 100,000 times to measure the spin up delay and the unmount delay. Both the charts against CDF for spin up and unmounting delays showed a large increase initially and near saturation afterwards indicating that the distribution remained uniform. This was different for mount delays where distribution was widely affected by the load on the system. In general, to study the effect of workloads, Pelican was measured under different load regimes.
The disk throughput was measured using a quiescent system. Seeks are simulated by including a constant latency of 4.2 ms for all disk accesses and it was found that seek latency has negligible impact on performance. It was also found that the latency of spinning the disk up and mounting the volume heavily dominate over seek time.
As the blob size increased, the CDF showed irregular increase as well. This was probably due to the fact that large blobs were for a single contiguous file on disk. The simulator used a distribution of disk sizes as a normal distribution.
#codingexercise
Yesterday we were discussing counting number of squares within a matrix with an alternative approach. This included iteration like this:

int GetCountSquaresFromPosition(int n, int m, int startrow, int startcol)

{
int count = 0;

for (int i =startrow; i <= M; i++){

for (int j = startcol; j <=N; j++){
if (i == startrow && j == startcol) continue;

if (i-startrow+1 = = j-startcol+1)
count++;
}
}
return count;

}
Now we can use this method with each cell as the starting position:

void GetCountSquares(int n, int m)

{
int count = 0;

for (int i = 1; i <= m; i++)

for(int j = 1; j <=n; j++)

count += GetCountSquaresFromPosition(n,m,i,j);
return count;
}

Tuesday, August 2, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

We were discussing implementation considerations. Pelican has to ensure that the disks do not spin when first powered. It also needs to mount and check every drive which it does in parallel.It does this with the help of groups where the request is initialized and scheduled as the first operation in each group. The disks in the group are domain disjoint so these can be performed concurrently. Each disk is initialized with a GPT partition table and formatted with an NTFS system. Furthermore, Pelican had to silence unexpected spin ups of disks from services like disk failure predictors polling the disk. It added a special No Access flag to the Windows Server driver stack that causes all IOs issued to a disk marked to return with a "media not ready" error code. Pelican also had to tackle legacy issues such as ports exposed by the HBA that run out of space and cause the BIOS code to hang. Therefore, the BIOS code was modified to handle all 72 HBAs.

#codingexercise

Yesterday we were discussing counting the number of squares in a rectangle.

We counted the squares formed within the largest square in the rectangle and those from the overlap of the largest square and the extra columns that make up the rectangle. However, we suggested that squares can also be formed exclusively in the region made up of the extra columns beyond the largest square and making up the rectangle. Consequently, we split the problem of counting squares as a recursive call to count the squares in the largest square of the rectangle plus the squares from the overlap of the largest square of the rectangle and the extra columns and those exclusively formed within the extra columns. This is shown as follows:

Int getSquaresCount(int n, int m)

{

If (n < m)

Swap(n,m);

If (n == 0 || m == 0) return 0;

If (n == m && n == 1) return 1;

int count = m(m+1)(2m+1)/6;

count += (n-m)*m(m-1)/2 + GetSquaresCount(n –m, n-m) + GetSquaresCount(n-m, 2m-n);

return count;

}

Next let us look to generalize the problem as

Given a N*M matrix, print all squares/rectangles of all possible sizes

void PrintSquaresRectanglesFromPosition(int n, int m, int startrow, int startcol)

{

for (int i =startrow; i <= M; i++){

for (int j = startcol; j <=N; j++){
if (i == startrow && j == startcol) continue;

Console.Write("top left {0}{1} - bottom right{2}{3}", startrow-1, startcol-1, i-1, j-1);
}
}

}

void PrintSquaresRectangles(int n, int m)

{

for (int i = 1; i <= m; i++)

for(int j = 1; j <=n; j++)

PrintSquaresRectanglesFromPosition(n,m,i,j);

}

Monday, August 1, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

We were reading about the IO scheduler which does request ordering and rate limiting.

We now look at some of the implementation considerations. The first was to not violate the power constraints from when the power is applied until the OS has finished booting and the Pelican service is managing the disks.This is done by ensuring that the disks do not spin when first powered. This is possible via Power up in Standby option (PUIS) where the drives establish a PHY link on power up, but require an explicit SATA command to spin up. This ensures the disks do not spin without the Pelican storage stack managing the constraints.
Second the Pelican storage stack needs to mount and check every drive. If this is done sequentially it would adhere to the power constraints but provide long boot time. Therefore this is done parallely by exploiting the group abstraction. This is the first operation performed for each group.

#codingexercise
Count number of squares in a rectangle. For example, in a matrix of n x m is given, how many squares are in it ?
We have to try out a few cases first.
If m = n = 1, we have 1
If m = n = 2, we have 4 + 1 with four unit squares and 1 two by two.
If m = n = 3, we have 9 + 4 + 1 for increasing square sizes
This follows a pattern of n^2 + n-1 ^2 + ... +1 which equals n(n+1)(2n+1)/6

Now let us consider the case when m and n are not equal. Then we can consider the additions one extra column at a time.
Let us say m <=n,
Then we can calculate for m x m as above
For an additional column, the number of square increases by m + m-1 + m - 2 + ... + 1
With m unit squares + m-1 squares of size 2×2 and so on
This equals m(m+1)/2
But there are n-m such columns and so we have (n-m)×m(m+1)/2 squares

The total is therefore
Int getSquaresCount(int n, int m)
{
If ( n < m)
Swap(n,m)
Return m(m+1)(2m+1)/6 + (n-m)×m(m+1)/2;
}
Question: is it possible to add more squares contained exclusively within n-m columns and m rows
We can do this for n-m × n-m square and proceed so on.
Each time we split the rectangle into a square and a smaller rectangle until we have a unit square left