Cluster computing

Tuesday, March 7, 2017

The throughput for applications increased linearly when the application increases the number of computing resources to access WAS objects. The throughput increases linearly upto eight VMs but tapers off as the network capacity is reached. For table operations, the batch puts offer about three times more throughput compared to single entity puts because of reduced network round trips and fewer stream writes.
A variety of applications use WAS as cloud storage. Facebook and Twitter search for Bing uses WAS. The XBox GameSaves saves game data into the cloud for XBox users. Similarly the XBox telemetry services stores console generated diagnostics and telemetry information for later secure retrieveal and offline processing. Microsoft's Zune backend uses Windows Azure for media file storage and delivery and it stores files as Blobs. A comparision study of Blobs, tables and queues was made for the applications in terms of requests, capacity, ingress and egress. Generally for these applications, 17.9%of all requests used Blobs, 46.88% used Tables and 35.22% used queues.

#codingexercise

Given a sequence of digits, find the number of subsequences that are divisible by m.

Int GetCount(List<int> digits, int index, int rem, int m)

{

Int count = 0;

If ( digits.count == 0 || m == 0) return 0;

//if (digits.toInt() < m) return 0;

if (index == digits.count) return (rem == 0) ? 1 : 0;

//if (IsDivisible(digits.ToInt(), m) count += 1;

return GetCount(digit, index+1, rem, m) + GetCount(digits, index+1, (rem*10+digits[index]) %m, m);

}

Enumerate all subsequences using combinations of lengths 1 to digits.count and their permutations if reordering requested and check each for divisibility.

Void  Combine(ref string digits, ref stringbuilder b, int m, int start, int level, ref List<int> results)

{

for (int I =start; I < digits.length; I++)

{

if (b[level] != digits[I]){

     b[level] = digits[i];

If(IsDivisible(ToInt(b.toString()),m)

     results.add(ToInt(b.toString()));

}

if (I < digits.length)

    Combine(ref digits, ref b, m, i+1, level+1);

if (b[level] == digits[I]){

     b[level] = '/0';

}

}
Tests for divisibility can be taken as an interchangeable implementation.

Monday, March 6, 2017

Lets take some digression to talk about TCO. Yan Han in his tutorial on TCO in December 2011 gave two examples of Cost Analysis citing examples on AWS. According to him at that time the TCO of an AWS small instance for five years was $2750-$3750 all of which was operational costs not hardware costs which he put at 0$. The TCO of a physical server comparable to an AWS small instance for 5 years was $5868-$7608. The operation expense was $7190-$10,690 ignoring downtime and failure expenses, insurance cost, technology training, and backup process. Similarly he cited costs for an example using storage instead of compute. He said the TCO of a 10TB in Amazon S3 per year is $16,800 all of which was exclusively operational expense. The TCO of an on-premise 10TB physical storage per year was $11,212-$12,612. He broke down the hardware costs as $4168 per year and the operation expense as $1438-$2138 per year. Together with his analysis, he concluded on-premise cost higher than any public cloud offering. In such case, an organization will find it uphill to reduce or justify cost of private cloud to the point of tipping it below the costs incurred on public cloud. Therefore the migration path to public cloud becomes mandatory to be paved.

The migration path for applications and customers could include the following initiatives:

1) classifying usages of customers based on IaaS, PaaS, and SaaS and separating them to more and more

2) surveying all compute and storage usages to determine priority and severity levels of workloads for their migration

3) Onboarding new applications to public cloud with translation of jargons for core services such as logging, backup, monitoring, databases and so on.

4) Determining security and compliance defaults for application destinations in the public cloud.

5) Transferring account ownership and tracking to these resources for individuals and groups based on accounts and roles in the cloud .

6) Setting up alerts and notifications for management and activities involved.

7) Mapping project, resource and data transfers from private to public cloud

8) Perform automatic migrations where customer interactions can be avoided.

The migration plan may also depend on the TCO. Martens, Walterbusch and Teuteberg mention that a major issue of any TCO model is the disregard of a differentiation between private, public and hybrid cloud.

#codingexercise
Print Top view of a binary tree
void PrintTopView(Node root)
{
// similar to level wise traversal
if (root == null) return;
var d = new SortedDictionary<int, int>();
int left = 0;
int right = 0;
var q = new Queue<Node>();
q.enqueue(root);
root.dist = 0;
q.enqueue(null);
while (q.empty() == false)
{
var item = q.dequeue();
if (item == null){
q.enqueue(null);
}else
{
//if (root.dist == 0) { d[root.dist] = root.data;}
if (root.dist < left ) { left = root.dist; d[root.dist] = root.data;}
if (root.dist > right) { right = root.dist; d[root.dist] = root.data;}
if (root.left){
root.left.dist = root.dist-1;
q.enqueue(root.left);
}
if (root.right){
root.right.dist = root.dist+1;
q.enqueue(root.right);
}
}
}
foreach (var kvp in d)
Console.Write("{0} ", kvp.value);
}
int GetMid(Node root)
{
if (root == null) return 0;
var current = root;
while (current.left != null)
current = current.left;
var min = root.data;
current = root;
while (current.right != null)
current = current.right;
var max = root.data;

return (min+max)/2;
}

Sunday, March 5, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage. We saw how erasure coding works in Windows Azure Storage aka WAS and how its different from Pelican. We saw what steps WAS takes to improve durability. We also saw the benefits of journaling. We reviewed the partition layer which translates objects such as blob, table or queue to storage. We saw how the partition manager maintains a massive Object Table that is sharded on different partition servers which maintain their consistency across concurrent transactions using a lock service. We read the partition layer data model and the supported data types and operations, the partition layer architecture and the range partition data structures. we reviewed the load balancing, split and merge operations.
We now look at partition layer interstamp replication. All access pertaining to an AccountName goes to the primary stamp associated with the account as determined by the Location Service. WAS performs interstamp replication between primary and secondary stamps. One of the main scenarios for inter-stamp replication is to geo-replicate an accountant's data between two data centers for disaster recovery. The DNS naming convention is AccountName.service.core.windows.net which points to the VIP for the primary stamp. For every write in the primary stamp, the change is fully replicated by intra stamp replication. Once the changes are committed to the primary stamp, the changes are geo-replicated asynchronously to the secondary stamp using interstamp replication. When a change arrives at the secondary stamp, it is applied in the partition layer which fully replicates using intra stamp replication within the secondary stamp.
We may observe that the interstamp replication is done asynchronously, recent updates that have not been inter-stamp replicated may be lost during disaster recovery. This is minimized by reducing the delay in interstamp replication. The interstamp replication is used for both account geo-replication and migration across stamps. In the disaster recovery case, there is an abrupt failover so changes may be lost whereas in the migration case, we can perform a clean migration. The URIs for blobs, tables and queues do not change after the failover.
Compute and Storage is traditionally separated in independent clusters because they can do their own load balancing and scale independently. The throughput for applications increased linearly when the application increases the number of computing resources to access WAS objects. The throughput increases linearly upto eight VMs but tapers off as the network capacity is reached. This means we could have expected the throughput to increase even more if the network was not limiting.
#codingexercise
int GetMax(Node root)
{
if (root == null) return INT_MAX;
while (root.right != null)
root = root.right;
return root.data;
}

Does a bottom view of a stack => its leaves ?
If so, can we print the leaves of the tree as follows:
void GetLeaves(Node root, ref List<Node> leaves)
{
if (root == null) return;
GetLeaves(root.left, ref leaves);
if (root.left == null && root.right == null)
leaves.Add(root);
GetLeaves(root.right, ref leaves);
}

Print bottom view of a binary tree
void PrintBottomView(Node root)
{
// similar to level wise traversal
if (root == null) return;
var d = new SortedDictionary<int, int>();
int left = 0;
int right = 0;
var q = new Queue<Node>();
q.enqueue(root);
root.dist = 0;
q.enqueue(null);
while (q.empty() == false)
{
var item = q.dequeue();
if (item == null){
q.enqueue(null);
}else
{
//if (root.dist == 0) { d[root.dist] = root.data;}
if (root.dist < left ) { left = root.dist;}
if (root.dist > right) { right = root.dist;}
d[root.dist] = root.data;
if (root.left){
root.left.dist = root.dist-1;
q.enqueue(root.left);
}
if (root.right){
root.right.dist = root.dist+1;
q.enqueue(root.right);
}
}
}
foreach (var kvp in d)
Console.Write("{0} ", kvp.value);

}

Saturday, March 4, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage. We saw how erasure coding works in Windows Azure Storage aka WAS and how its different from Pelican. We saw what steps WAS takes to improve durability. We also saw the benefits of journaling. We reviewed the partition layer which translates objects such as blob, table or queue to storage. We saw how the partition manager maintains a massive Object Table that is sharded on different partition servers which maintain their consistency across concurrent transactions using a lock service. We read the partition layer data model and the supported data types and operations, the partition layer architecture and the range partition data structures. We saw that the RangePartitions are made up of the metadata stream, commit log stream, row data stream and blob data stream. The in-memory data structures use Memory Table, Index Cache, Row Data Cache and Bloom Filters. When the PS receives a write request to the Range Partition, it appends the operations into the commit log, and then puts the newly changed row into the memory table. The Partition Server is responsible for assigning the broken massive Object Table to partition servers and their associated maintenance operations such as load balance, split and merge. The load for each range operation as well as the overall load for each Partition server is tracked on the basis of transactions/second, average pending transaction count, throttling rate, the CPU usage, network usage, request latency and the data size for RangePartition. This information is sent by the PS in response to the hearbeat request from the PM. The PM decides if a PS has too much load and the PS then performs the actual split in the PartitionRanges. The PM could also take away RangePartitions and assign to another PS. For each stamp, there are usually 75 splits and merges and 200 RangePartition load balances per day. In order to keep the loading time of a freshly arrived set of RangePartitions, the commit log is kept small with a checkpoint prior to the handoff.
The Split and Merge operations are also done by the Partition Servers on request from the PM. The splits are made at the location specified by the key comprising of AccountName, PartitionName. The RangePartitions maintain the sizes of the objects and the key at which the size can be halved. The PS chooses a key based on Adaptive Range Profiling which is a technique to find the key ranges that have the most load and to use these to determine the split. Prior to each split the RangePartition is checkpointed. During the split, the PS uses a special operation called MultiModify that takes the source stream and creates a new set of streams for the splits with the same extents in the same order as in the source. This is pretty fast because its merely a pointer manipulation.
The Merge operation requires that the source RangePartitions all come to the same Partition Server. The PM moves these for the PS. The PS checkpoints both and runs the MultiModify operation which in this case is a concatenation of all of the extents from their respective streams. This concatenation takes all of the extents of one and then follows them with all of the extents of the other. The PS then makes the metadata stream with the names of the new commit log and data stream as well as the root of the data index in the merge data stream which can now be correctly loaded.
#codingexercise
Print top view of a binary tree
void PrintTopView(Node root)
{
// similar to level wise traversal
if (root == null) return;

var d = new SortedDictionary<int, int>();
int left = 0;
int right = 0;
var q = new Queue<Node>();
q.enqueue(root);
root.dist = 0;
q.enqueue(null);
while (q.empty() == false)
{
var item = q.dequeue();
if (item == null){
q.enqueue(null);
}else
{
if (root.dist == 0) { if (root.contains(root.dist) == false) d[root.dist] = root.data;}
if (root.dist < left ) { if (root.contains(root.dist) == false) {left = root.dist; d[root.dist] = root.data;}}
if (root.dist > right) { if (root.contains(root.dist) == false){right = root.dist;. d[root.dist] = root.data;}}
if (root.left){
root.left.dist = root.dist-1;
q.enqueue(root.left);
}
if (root.right){
root.right.dist = root.dist+1;
q.enqueue(root.right);
}
}
}
foreach (var kvp in d)
Console.Write("{0} ", kvp.value);
}

Friday, March 3, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage. We saw how erasure coding works in Windows Azure Storage aka WAS and how its different from Pelican. We saw what steps WAS takes to improve durability. We also saw the benefits of journaling. We started reviewing the partition layer which translates objects such as blob, table or queue to storage. It maintains a massively scalable namespace for the objects and performs load balancing across the available partition servers using a data structure called object table. The partition layer consists of the partition manager, partition servers and lock service. The partition manager assigns RangePartitions from the object table to individual partition servers and the PM stores this assignment in the Partition Map Table. The partition servers handle concurrent transactions to objects for a RangePartition using the lock service.
The data structure representing a RangePartition includes the Metadata Stream, the Commit Log Stream, the Row Data Stream and the Blob Data Stream. The Metadata Stream is like the identifier with all information on the RangePartition including the commit logs and the data. It also bears runtime information such as the states of outstanding split and merge operations. The commit log is used to store the recent, insert, update and delete operations since the last checkpoint The two data streams store the checkpoint row data and the blob data bits respectively. At the stream layer, each of the above translates to a stream and as such is persisted. We now look at in-memory data streams.
The in-memory version of the commit log for a RangePartition with changes that have not yet been checkpointed is cached in a memory table. The index cache stores the checkpoint indexes of the row data stream. This differs from the row data cache that stores the checkpoint row data pages.Bloom filters are used to search based on checkpoints so as to avoid blindly examining all the data stream.
The primary server records both changes and change logs in the memory table and commit logs respectively. When either of them grows to a certain size, a checkpoint is added with the contents of the memory table stored on the row data stream. This frees up the commit log. To keep the checkpoints finite, periodically they are combined into a larger checkpoint. Blobs are treated somewhat differently. The blob data bit writes are made directly to the commit log stream and the Blob type property of the row tracks the location.
Extents are stitched together pretty easily because it just involves pointer operations.
Having looked at the data flow and operations, let us now look at the operations performed by the partition manager on the object table. These involve load balancing, split and merge operations. Splitting is done when a single RangePartition has too much load and reassigns them to two or more partition servers. The Merge operation merge lightly used RangePartitions together. WAS keeps the total number of partitions between a low watermark and a high watermark.
#codingexercise
int GetMin(Node root)
{
if (root == null) return INT_MIN;
while (root.left != null)
root = root.left;
return root.data;
}

Thursday, March 2, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage. We were discussing the replication flow in Windows Azure Storage service and we looked at sealing operation among extent nodes. We also saw how erasure coding works in Windows Azure Storage aka WAS and how its different from Pelican. We saw what steps WAS takes to improve durability. We also saw the benefits of journaling. We now look at the partition layer which translates objects such as blob, table or queue to storage. It maintains a massively scalable namespace for the objects and performs load balancing across the available partition servers. It enforces transaction ordering and strong consistency for access to objects. The key data structure maintained by this layer is the Object Table (OT) which can grow to several petabytes. This table is dynamically and horizontally sharded to different RangePartitions. There are several types of Object Tables. For example, the Account table stores metadata and configuration for each storage account assigned to the stamp, the Blob tables stores all blob objects for all accounts, the entity table stores all entity rows for all accounts in the stamp, the message table stores all messages for all accounts' queues in the stamp. Each of these OTs have a fixed schema stored in the schema table. The primary key for the blob, entities and message tables consists of AccountName, PartitionName, and ObjectName which provide the indexing and sort order for those Object Tables.
Among the supported data type and operations, the BlobType is a special property introduced to store large amounts of data backed by blob data stream with a list of extent+offset, length pointers.
Because the Partition layer deals with objects and namespace, it uses a Partition Manager, partition servers and a lock service.
The partition manager splits the object table into partitions and maintains a map of each RangePartition to its designated server. The PartitionManager has many instances. the instances take a lock from the lock service to be the leader. This design helps with the operations. Each partition server maintains its set of RangePartitions persisting their state in streams and keeping track of partitions. Although a set of RangePartition is assigned to any one active partition server, that server uses the lock service to provide strong consistency and ordering of concurrent transactions to objects for a range partition it is serving. The PartitionManager also uses the lock service to ensure that there is only one active partition server for a set of RangePartitions in its map table. The same PartitionServer could however handle more than one set of RangePartitions and usually from different Object Table.

#codingexercise
Given a BinaryTree and a value, check if the path sum from root to leaf equals the given value.
void getPathSum(Node root, int sum, ref List<Node> path, ref List<Node> paths)
{
if (root == null) return;
path.Add(root);
getPathSum(root.left, sum, ref path, ref paths);
getPathSum(root.right, sum, ref path, ref paths);
if (root.left == null && root.right == null && path.Sum() == sum)
paths.add(path);
path.RemoveLast();
}
We can compare the above to
bool hasPathSum(Node root, int sum)

{

if (root ==null) return sum == 0;

int newsum = sum-root.data;

if (newsum == 0 && root.left == null && root.right == null) return true;

return hasPathSum(root.left, newsum) || hasPathSum(root.right, newsum);

}
Two nodes of a BST are swapped, find them
void InOrderTraverse(Node root, ref prev, ref List<Node> pairs)
{
if (root == null) return;
InOrderTraverse(root.left, ref prev, ref pairs);
If (prev && root.data < prev.data) pairs.add(root);
InOrderTraverse(root.right, ref root, ref pairs);
}
The same above can also be found by inorder serializing the tree first.

Wednesday, March 1, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage. We were discussing the replication flow in Windows Azure Storage service and we looked at sealing operation among extent nodes. We also saw how erasure coding works in Windows Azure Storage aka WAS and how its different from Pelican.
WAS also had to come up with its own IO scheduling. This was due to the design of many hard disk drives that favored throughput over fairness. This implied that a long stream would make a disk lock out non -sequential IO for as long as 2300 milliseconds to service large pipelined reads or writes. To avoid this problem, WAS scheduled new IO to a spindle when there is over 100ms of pending IO request scheduled but not serviced for over 200ms. This improved fairness but came at the cost of some latency on sequential requests.
WAS also improves durability by making sure the replicas are stored on power safe storage which safeguards it from cluster wide power failure.
Moreover, on each extent node, a whole hard disk or SSD is reserved as a journal drive for all writes into the extent node. Since the journal data is time series and all sequential, it let WAS take advantage of the full throughput of the disk. When a write happens, all of it is first written to the journal drive and then queued up to the data disk where the extent files live on that EN. when the journal data write succeeds, the data is buffered in memory which handles reads until it is flushed to data disks. By virtue of buffering, writes can now be consolidated. This has a tradeoff for good latency at the cost of an extra writeoff the critical path. It may be interesting to note that all the stream data is append only and yet the journaling has benefits because the appends do not contend with the data going to the disks. The journal allows append time from the partition layer to have more consistent and lower latencies.

#codingexercise
Print all root to leaf paths
void PrintAllRootToLeaf(Node root, ref List<Node> path)
{
if (root == null) return;
path.Add(root);
PrintAllRootToLeaf(root.left, ref path);
PrintAllRootToLeaf(root.right, ref path);
if (root.left == null && root.right == null)
Console.WriteLine(path.ForEach(x => Console.Write("{0} ", x.data)));
path.RemoveLast();
}
Given a BinaryTree and a value, check if the path sum from root to leaf equals the given value.
void getPathSum(Node root, int sum, ref List<Node> path, ref List<Node> paths)
{
if (root == null) return;
path.Add(root);
getPathSum(root.left, sum, ref path, ref paths);
getPathSum(root.right, sum, ref path, ref paths);
if (root.left == null && root.right == null && path.Sum() == sum)
paths.add(path);
path.RemoveLast();
}
We can compare the above to
bool hasPathSum(Node root, int sum)

{

if (root ==null) return sum == 0;

int newsum = sum-root.data;

if (newsum == 0 && root.left == null && root.right == null) return true;

return hasPathSum(root.left, newsum) || hasPathSum(root.right, newsum);

}