Cluster computing

Wednesday, August 3, 2016

Today we continue our discussion of the paper titled "Pelican: a building block for exascale cold data storage". Pelican treats a group of disks as a single schedulable unit. Resource restrictions such as power consumption, vibrations and failure domains are expressed as constraints over these units.

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

Pelican uses resources from a set of resource domains which is a subset of disks. Pelican proposes a data layout and IO scheduling algorithms by expressing these resource domains as constraints over the disks. We were discussing implementation considerations. Now we discuss evaluation of Pelican for which a discrete event based Pelican simulator was developed and cross validated against the storage stack. The simulator was cross validated using micro benchmarks from the rack.The mount delay distributions were generated by taking samples of mount latencies during different load regimes on the Pelican. The CDF was plotted against spin up delay (ms), mount delay(ms), unmount delay (ms), Blob size(MB), Volume capacity (TB). The CDF stands for cumulative distribution function and is expressed as Fa,b(x) on a uniform distribution of [a,b] as
Fa,b(x) = 0 for x<a,
Fa,b(x) = (1/ (b-a)) (x-a) for A<=x<=b,
Fa,b(x) = 1 for x> b,
It has a monotonic property which means that it is non-decreasing and lies in the range 0 to 1 which makes it great for charts!
The delays for spin up and the mounting and unmounting delays for a drive after spin up was measured. The disks were spun up and down 100,000 times to measure the spin up delay and the unmount delay. Both the charts against CDF for spin up and unmounting delays showed a large increase initially and near saturation afterwards indicating that the distribution remained uniform. This was different for mount delays where distribution was widely affected by the load on the system. In general, to study the effect of workloads, Pelican was measured under different load regimes.
The disk throughput was measured using a quiescent system. Seeks are simulated by including a constant latency of 4.2 ms for all disk accesses and it was found that seek latency has negligible impact on performance. It was also found that the latency of spinning the disk up and mounting the volume heavily dominate over seek time.
As the blob size increased, the CDF showed irregular increase as well. This was probably due to the fact that large blobs were for a single contiguous file on disk. The simulator used a distribution of disk sizes as a normal distribution.
#codingexercise
Yesterday we were discussing counting number of squares within a matrix with an alternative approach. This included iteration like this:

int GetCountSquaresFromPosition(int n, int m, int startrow, int startcol)

{
int count = 0;

for (int i =startrow; i <= M; i++){

for (int j = startcol; j <=N; j++){
if (i == startrow && j == startcol) continue;

if (i-startrow+1 = = j-startcol+1)
count++;
}
}
return count;

}
Now we can use this method with each cell as the starting position:

void GetCountSquares(int n, int m)

{
int count = 0;

for (int i = 1; i <= m; i++)

for(int j = 1; j <=n; j++)

count += GetCountSquaresFromPosition(n,m,i,j);
return count;
}

Tuesday, August 2, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

We were discussing implementation considerations. Pelican has to ensure that the disks do not spin when first powered. It also needs to mount and check every drive which it does in parallel.It does this with the help of groups where the request is initialized and scheduled as the first operation in each group. The disks in the group are domain disjoint so these can be performed concurrently. Each disk is initialized with a GPT partition table and formatted with an NTFS system. Furthermore, Pelican had to silence unexpected spin ups of disks from services like disk failure predictors polling the disk. It added a special No Access flag to the Windows Server driver stack that causes all IOs issued to a disk marked to return with a "media not ready" error code. Pelican also had to tackle legacy issues such as ports exposed by the HBA that run out of space and cause the BIOS code to hang. Therefore, the BIOS code was modified to handle all 72 HBAs.

#codingexercise

Yesterday we were discussing counting the number of squares in a rectangle.

We counted the squares formed within the largest square in the rectangle and those from the overlap of the largest square and the extra columns that make up the rectangle. However, we suggested that squares can also be formed exclusively in the region made up of the extra columns beyond the largest square and making up the rectangle. Consequently, we split the problem of counting squares as a recursive call to count the squares in the largest square of the rectangle plus the squares from the overlap of the largest square of the rectangle and the extra columns and those exclusively formed within the extra columns. This is shown as follows:

Int getSquaresCount(int n, int m)

{

If (n < m)

Swap(n,m);

If (n == 0 || m == 0) return 0;

If (n == m && n == 1) return 1;

int count = m(m+1)(2m+1)/6;

count += (n-m)*m(m-1)/2 + GetSquaresCount(n –m, n-m) + GetSquaresCount(n-m, 2m-n);

return count;

}

Next let us look to generalize the problem as

Given a N*M matrix, print all squares/rectangles of all possible sizes

void PrintSquaresRectanglesFromPosition(int n, int m, int startrow, int startcol)

{

for (int i =startrow; i <= M; i++){

for (int j = startcol; j <=N; j++){
if (i == startrow && j == startcol) continue;

Console.Write("top left {0}{1} - bottom right{2}{3}", startrow-1, startcol-1, i-1, j-1);
}
}

}

void PrintSquaresRectangles(int n, int m)

{

for (int i = 1; i <= m; i++)

for(int j = 1; j <=n; j++)

PrintSquaresRectanglesFromPosition(n,m,i,j);

}

Monday, August 1, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

We were reading about the IO scheduler which does request ordering and rate limiting.

We now look at some of the implementation considerations. The first was to not violate the power constraints from when the power is applied until the OS has finished booting and the Pelican service is managing the disks.This is done by ensuring that the disks do not spin when first powered. This is possible via Power up in Standby option (PUIS) where the drives establish a PHY link on power up, but require an explicit SATA command to spin up. This ensures the disks do not spin without the Pelican storage stack managing the constraints.
Second the Pelican storage stack needs to mount and check every drive. If this is done sequentially it would adhere to the power constraints but provide long boot time. Therefore this is done parallely by exploiting the group abstraction. This is the first operation performed for each group.

#codingexercise
Count number of squares in a rectangle. For example, in a matrix of n x m is given, how many squares are in it ?
We have to try out a few cases first.
If m = n = 1, we have 1
If m = n = 2, we have 4 + 1 with four unit squares and 1 two by two.
If m = n = 3, we have 9 + 4 + 1 for increasing square sizes
This follows a pattern of n^2 + n-1 ^2 + ... +1 which equals n(n+1)(2n+1)/6

Now let us consider the case when m and n are not equal. Then we can consider the additions one extra column at a time.
Let us say m <=n,
Then we can calculate for m x m as above
For an additional column, the number of square increases by m + m-1 + m - 2 + ... + 1
With m unit squares + m-1 squares of size 2×2 and so on
This equals m(m+1)/2
But there are n-m such columns and so we have (n-m)×m(m+1)/2 squares

The total is therefore
Int getSquaresCount(int n, int m)
{
If ( n < m)
Swap(n,m)
Return m(m+1)(2m+1)/6 + (n-m)×m(m+1)/2;
}
Question: is it possible to add more squares contained exclusively within n-m columns and m rows
We can do this for n-m × n-m square and proceed so on.
Each time we split the rectangle into a square and a smaller rectangle until we have a unit square left

Sunday, July 31, 2016

In yesterday's post, we talked about extending OneFS over mesos. In that we maintained separation between the layers of OneFS services and mesos. However, we could also consider a hypothetical substitute for mesos but one that is virtual and integrated with OneFS such that the nodes of the cluster is directly maintained by OneFS. The way this differs from the current use of OneFS clusters is that it could use its own cluster nodes as well as mesos so long as it can delegate. One of the requirements for OneFS is that it needs a high speed network between the nodes so that data transfer and metadata operations are supported. When we replace the current physical resources with virtual resources for the OneFS, we should be mindful that the performance is at par so that the services can work seamlessly.
Another important consideration is that some datacenters for the sake of performance choose to pull in the direction away from commodity hardware and use racks as the basic building block of modern data centers instead of individual servers. A rack-scale set of computers is a class of hardware with pooled storage, compute and network such that thousands of cores are interconnected at high speed. For example, at Microsoft rack-scale computing, the hardware, storage, networking and OS stacks are redesigned in a cross layer co-design manner so as to improve efficiencies. By using a direct server to server network topology, they provide the ability to overcome the hardware restrictions that convention OneFS deployments have. While this is another dimension to source the resources for the OneFS, it highlights the abstraction that OneFS core services can make in terms of what it is operating on while continuing to provide their respective functionalities to form scalable self-healing file system.
The use of mesos or the use of racks do not take away any of the functionality that OneFS services provide to stretch the filesystem over resources well beyond the individual server. The design of OneFS already makes the necessary assumptions to support growth and shrink of the resources. The difference is that while this is being done directly based on nodes, we now suggest a redirection through an abstraction that is suitable for all three - direct virtual resources, mesos delegation and rack computing.
This abstraction implies that customers can expect the same seamless managed storage layer across deployments both in the on-premise enterprise, datacenters and large scale cloud deployments.
Moreover, by providing this managed layer instead of raw storage, the applications are relieved of many storage and data maintenance considerations of scale, cost and performance enabling them to focus more on the compute and being able to push down the service to where they themselves can be rolled from one place to another. This improves operational benefits while reducing impact to architecture.
#codingexercise
Given a string and a pattern state whether it follows the pattern. For eg: input = “redblueredgreen” matches pattern “abac” but not “aaab” as ‘red’ means ‘a’, ‘blue’ means ‘b’ and ‘green’ means ‘c’ here. String input only contains words “red”, “green” & “blue” and pattern can have any character.
We could optionally modify the problem above to include any text so that the input elements of red, green, blue don't form well known demarcations in input.
Solution : we build a hashtable for pattern elements and maintain a sequence.
bool IsSimilar(string text, string pattern, ref Hashtable h)
{
if (text =="" && pattern == "") return true;
if (text == "" && pattern != "") return false;
if (text != "" && pattern == "") return false;
for (int i = 1; i <= text.Length; i++)
{
var sub = text.substring(0,i);
if (h.contains(pattern[0])){
If(h[pattern[0]] != sub)
{
Continue;
}else{
h.Add(pattern[0], sub);
}
If( IsSimilar(text.trimleft(sub), pattern.removefirst(), ref h))
return true;
h.remove(pattern[0], sub);
}
Return false;
}

Given a N*M matrix how many different sizes of rectangles or squares can be formed
int GetNumberOfSquaresRectangles(int n, int m)

{

int count = 0;

for (int i =1; i <= M; i++)

for (int j = 1; j <=N; j++)

count += 1;

return count;

}
now count the number of such rectangles formed for each cell as the starting point
int GetAll(int n, int m)
{

int count = 0;

for (int i =1; i <= M; i++)

for (int j = 1; j <=N; j++)

count += GetNumberOfSquaresRectangles(M-i+1, N-j+1);

return count;

}

Saturday, July 30, 2016

Debunking data is sticky in cloud computing

Traditional enterprise software has focused on accumulating incredible amounts of data for their systems. As the data grows, it becomes more and more sticky requiring applications and services to change accordingly. Entire ecosystems then evolve around the data bringing many routines and much maintenance onus.

Compute and Storage have often gone hand in hand with requirements shaping each other mutually. Yet they have fundamentally different needs and require different services to form their own layers. While the separation between the layers may not matter for desktop applications, they do for cloud computing. On premise data translated to a connection string or a file system or a message bus to the application. This was replaced by blobs, key-values, queue messages, file shares and object storage. A database or a file no longer needs to be static and can be moved around, backed up and replicated, compressed, encrypted, aged and archived.

Object storage gave some of this benefits in exposing a raw storage platform that distributed the failure domains for the data. By adding metadata and a globally unique identifier, the data could now live on multiple virtual machines or storage increasing availability, scalability and self-healing abilities. Applications took advantage of this by storing photos and songs.

However object storage is not a managed platform. It is an alternative to file and block storage. If it were managed platform, it would provide services that storage appliances are used for such as deduplication, backup, aging and archival, and self healing enhancements. One such platform came close to providing these manged services. Isilon OneFS provided distributed file system with a storage cluster. It combines three layers of traditional storage architecture namely, a File System, Volume Manager, and data protection and it scales out because it relies on intelligent software, commodity hardware, and distributed architecture.
However OneFS used cluster storage from on-premise hardware. If Isilon could use cluster storage out of cloud resources such as Mesos clusters, then it would harness more cloud computing power with very little physical on-permise footprint. By using distributed systems kernel that runs on every machine, Isilon could use API for resource management and scheduling across entire data center and cloud environments. This provides several benefits:

It provides linear scalability in that the clusters can grow to ten thousands of nodes where as OneFS max volume is around 143 nodes with hundred terabytes each.
It provides high availability in that the master and agents are replicated in a fault tolerant manner and even the upgrades are non-disruptive.
It provides same cluster to support both cloud native and legacy applications with pluggable scheduling policies so that OneFS services can be ported or rewritten.
It provides APIs for operating the cluster and for monitoring leving much of the onus to mesos while enabling customizations. These APIs are already exercised in the web user interface made available from this layer.
It provides pluggable isolation in terms of CPU, memory, disks, ports, GPU and modules for custom resource isolation. This can help OneFS distribute and differentiate the workloads.
It provides a cross platform to run on Linux, OSX and Windows while being cloud provider agnostic. This makes it universal in the deployments and thus provides significant benefits for adoption and onboarding.
With an existing customer base to validate the viability of OneFS and mesos individually, it should be safe to assume the viability and purpose of the one stacked over the other. Lastly, this provides a managed platform for storage rather than the raw services which lets the applications be more compute oriented and less storage conscious.

#codingexercise

Given a number n and k(no of swaps allowed), make the biggest number from n by making at most k swaps. If its already the biggest return the number itself

To solve the problem, we pick the elements that would appear in the sorted descending manner.

Int getnum(int num, int k)
{

var digits = num.toDigits();

int swaps;

for (int i = 0; i < digits.Count; i++)

{

int pos = getmax(digits, i+1);

if (digits[pos] > digits[i]){

Swap(digits, i, pos);

swaps++;

if (swaps > k)

return digits.toNumber();

}

return digits.toNumber();

}

Friday, July 29, 2016

With the help of resource constraints and scheduling units, Pelican aims to be better than its over provisioned counterpart racks using computations in software stacks

We were reading about the IO scheduler which does request ordering and rate limiting. The former is done to batch sets of operation for the same group to amortize the group spin up latency over the set of operations. The latter makes the rebuild operations complete within an upper time bound.

The scheduler maintains two queues - one for rebuild and another for regular operations and performs weighted fair queuing.
If the blob size is uniform, the upper bound on the tolerated reordering is expressed in terms of the number of operations. If the blob size is non-uniform, it is expressed in terms of wall clock time. For each queued request, this process is greedily repeated. The batching is maximized unless it violates fairness for some requests. And the reordering bound is enforced for each request. These two guarantees using the upper bound which controls the reordering. The process expresses the tradeoff between throughput and fairness using this upper bound. If the value of the upper bound is close to zero, the queues provide fairness. On the other hand, if the value of the upper bound is close to infinity, it minimizes the number of spin-ups which increases the throughput.

#codingexercise

Print all possible strings that can be made by placing spaces

Void Permute (List<String> a, List<string> b, bool[] used)

{

If ( b.distinct().Count() >= a.Length) { Console.Write(b.ToString()); return;}

For (int I = 0; I < a.Length ; i++)

{

If (used[i]) continue;

used[i] = true;

B.Add(A[i]);

Permute(a, b, used);
if (i < a.Length - 1){
B.Add(" ");
Permute(a,b,used);
B.RemoveLast();
}

B.RemoveLast();

used[i] = false;

}

Given a sorted dictionary (array of words) of an alien language, find order of characters in the language.

Recall that the topological sort helps list vertices in an order such that if u is before v in the list, then there is no edge from u to v. The same criteria can be used to represent letters in a dictionary where the listing follows an order. All we have to do is add edges to support the topological sort.

topological sorting DFS ( V, E)

For each vertex v in V

V.color=white

V.d = nil

Time = 0

For each vertex v in V:

If v.color == white:

DFS-Visit (V, E)

DFS-VISIT (V,E, u)

time = time + 1

u.d = time

u.color = gray

foreach vertex v adjacent to u

If v.color == white

DFS-VISIT(V,E,v)

Else

If v.d <= u.d < u.f <= v.f throw back edge exception.

u.color = black

time = time + 1

u.f = time

void GetDictOrder(List<string> words, int V)

{

Graph g(V);

for (int i =0; i < words.Count-1; i++)

{

var first = words[i];

var second = words[i+1];

for (int j = 0; j < min(first.length, second..length); j++)

{

if ( first.Length != second.Length)

{

g.AddE(first[j] - 'a', second[j] - 'a');

break;

}

DFS-VISIT(g.V, g.E, words[0][0]);

}

Given a number n and k(no of swaps allowed), make the biggest number from n by making at most k swaps. If its already the biggest return the number itself

Int getnum(int num, int k)

{

Var digits = num.toDigits();

Int swaps = 0;
Var desc = digits.sort().reverse();

For( int i = 0; i < digits.count; i++)
If (digits[i] != desc[j]){
Swaps++;
If (swaps > k)
Var part = desc.getRange(0,i+1)
Return part.union(digits.except(part)).toNumber();
}

}

Return desc.toNumber();

}