Cluster computing

Friday, March 3, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage. We saw how erasure coding works in Windows Azure Storage aka WAS and how its different from Pelican. We saw what steps WAS takes to improve durability. We also saw the benefits of journaling. We started reviewing the partition layer which translates objects such as blob, table or queue to storage. It maintains a massively scalable namespace for the objects and performs load balancing across the available partition servers using a data structure called object table. The partition layer consists of the partition manager, partition servers and lock service. The partition manager assigns RangePartitions from the object table to individual partition servers and the PM stores this assignment in the Partition Map Table. The partition servers handle concurrent transactions to objects for a RangePartition using the lock service.
The data structure representing a RangePartition includes the Metadata Stream, the Commit Log Stream, the Row Data Stream and the Blob Data Stream. The Metadata Stream is like the identifier with all information on the RangePartition including the commit logs and the data. It also bears runtime information such as the states of outstanding split and merge operations. The commit log is used to store the recent, insert, update and delete operations since the last checkpoint The two data streams store the checkpoint row data and the blob data bits respectively. At the stream layer, each of the above translates to a stream and as such is persisted. We now look at in-memory data streams.
The in-memory version of the commit log for a RangePartition with changes that have not yet been checkpointed is cached in a memory table. The index cache stores the checkpoint indexes of the row data stream. This differs from the row data cache that stores the checkpoint row data pages.Bloom filters are used to search based on checkpoints so as to avoid blindly examining all the data stream.
The primary server records both changes and change logs in the memory table and commit logs respectively. When either of them grows to a certain size, a checkpoint is added with the contents of the memory table stored on the row data stream. This frees up the commit log. To keep the checkpoints finite, periodically they are combined into a larger checkpoint. Blobs are treated somewhat differently. The blob data bit writes are made directly to the commit log stream and the Blob type property of the row tracks the location.
Extents are stitched together pretty easily because it just involves pointer operations.
Having looked at the data flow and operations, let us now look at the operations performed by the partition manager on the object table. These involve load balancing, split and merge operations. Splitting is done when a single RangePartition has too much load and reassigns them to two or more partition servers. The Merge operation merge lightly used RangePartitions together. WAS keeps the total number of partitions between a low watermark and a high watermark.
#codingexercise
int GetMin(Node root)
{
if (root == null) return INT_MIN;
while (root.left != null)
root = root.left;
return root.data;
}

Thursday, March 2, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage. We were discussing the replication flow in Windows Azure Storage service and we looked at sealing operation among extent nodes. We also saw how erasure coding works in Windows Azure Storage aka WAS and how its different from Pelican. We saw what steps WAS takes to improve durability. We also saw the benefits of journaling. We now look at the partition layer which translates objects such as blob, table or queue to storage. It maintains a massively scalable namespace for the objects and performs load balancing across the available partition servers. It enforces transaction ordering and strong consistency for access to objects. The key data structure maintained by this layer is the Object Table (OT) which can grow to several petabytes. This table is dynamically and horizontally sharded to different RangePartitions. There are several types of Object Tables. For example, the Account table stores metadata and configuration for each storage account assigned to the stamp, the Blob tables stores all blob objects for all accounts, the entity table stores all entity rows for all accounts in the stamp, the message table stores all messages for all accounts' queues in the stamp. Each of these OTs have a fixed schema stored in the schema table. The primary key for the blob, entities and message tables consists of AccountName, PartitionName, and ObjectName which provide the indexing and sort order for those Object Tables.
Among the supported data type and operations, the BlobType is a special property introduced to store large amounts of data backed by blob data stream with a list of extent+offset, length pointers.
Because the Partition layer deals with objects and namespace, it uses a Partition Manager, partition servers and a lock service.
The partition manager splits the object table into partitions and maintains a map of each RangePartition to its designated server. The PartitionManager has many instances. the instances take a lock from the lock service to be the leader. This design helps with the operations. Each partition server maintains its set of RangePartitions persisting their state in streams and keeping track of partitions. Although a set of RangePartition is assigned to any one active partition server, that server uses the lock service to provide strong consistency and ordering of concurrent transactions to objects for a range partition it is serving. The PartitionManager also uses the lock service to ensure that there is only one active partition server for a set of RangePartitions in its map table. The same PartitionServer could however handle more than one set of RangePartitions and usually from different Object Table.

#codingexercise
Given a BinaryTree and a value, check if the path sum from root to leaf equals the given value.
void getPathSum(Node root, int sum, ref List<Node> path, ref List<Node> paths)
{
if (root == null) return;
path.Add(root);
getPathSum(root.left, sum, ref path, ref paths);
getPathSum(root.right, sum, ref path, ref paths);
if (root.left == null && root.right == null && path.Sum() == sum)
paths.add(path);
path.RemoveLast();
}
We can compare the above to
bool hasPathSum(Node root, int sum)

{

if (root ==null) return sum == 0;

int newsum = sum-root.data;

if (newsum == 0 && root.left == null && root.right == null) return true;

return hasPathSum(root.left, newsum) || hasPathSum(root.right, newsum);

}
Two nodes of a BST are swapped, find them
void InOrderTraverse(Node root, ref prev, ref List<Node> pairs)
{
if (root == null) return;
InOrderTraverse(root.left, ref prev, ref pairs);
If (prev && root.data < prev.data) pairs.add(root);
InOrderTraverse(root.right, ref root, ref pairs);
}
The same above can also be found by inorder serializing the tree first.

Wednesday, March 1, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage. We were discussing the replication flow in Windows Azure Storage service and we looked at sealing operation among extent nodes. We also saw how erasure coding works in Windows Azure Storage aka WAS and how its different from Pelican.
WAS also had to come up with its own IO scheduling. This was due to the design of many hard disk drives that favored throughput over fairness. This implied that a long stream would make a disk lock out non -sequential IO for as long as 2300 milliseconds to service large pipelined reads or writes. To avoid this problem, WAS scheduled new IO to a spindle when there is over 100ms of pending IO request scheduled but not serviced for over 200ms. This improved fairness but came at the cost of some latency on sequential requests.
WAS also improves durability by making sure the replicas are stored on power safe storage which safeguards it from cluster wide power failure.
Moreover, on each extent node, a whole hard disk or SSD is reserved as a journal drive for all writes into the extent node. Since the journal data is time series and all sequential, it let WAS take advantage of the full throughput of the disk. When a write happens, all of it is first written to the journal drive and then queued up to the data disk where the extent files live on that EN. when the journal data write succeeds, the data is buffered in memory which handles reads until it is flushed to data disks. By virtue of buffering, writes can now be consolidated. This has a tradeoff for good latency at the cost of an extra writeoff the critical path. It may be interesting to note that all the stream data is append only and yet the journaling has benefits because the appends do not contend with the data going to the disks. The journal allows append time from the partition layer to have more consistent and lower latencies.

#codingexercise
Print all root to leaf paths
void PrintAllRootToLeaf(Node root, ref List<Node> path)
{
if (root == null) return;
path.Add(root);
PrintAllRootToLeaf(root.left, ref path);
PrintAllRootToLeaf(root.right, ref path);
if (root.left == null && root.right == null)
Console.WriteLine(path.ForEach(x => Console.Write("{0} ", x.data)));
path.RemoveLast();
}
Given a BinaryTree and a value, check if the path sum from root to leaf equals the given value.
void getPathSum(Node root, int sum, ref List<Node> path, ref List<Node> paths)
{
if (root == null) return;
path.Add(root);
getPathSum(root.left, sum, ref path, ref paths);
getPathSum(root.right, sum, ref path, ref paths);
if (root.left == null && root.right == null && path.Sum() == sum)
paths.add(path);
path.RemoveLast();
}
We can compare the above to
bool hasPathSum(Node root, int sum)

{

if (root ==null) return sum == 0;

int newsum = sum-root.data;

if (newsum == 0 && root.left == null && root.right == null) return true;

return hasPathSum(root.left, newsum) || hasPathSum(root.right, newsum);

}

Tuesday, February 28, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage. We were discussing the replication flow in Windows Azure Storage service and we looked at sealing operation among extent nodes.
Sealing the extent means that the commit length does not change again. To do this, all the three ENs are asked for their current commit length. If one or more of the replicas have a different length, the minimum commit length is chosen.  The partition layer has two different read patterns to handle failures - offset and length for data and sequential for metadata and logs. At the start of the partition load, a check for commit length is requested from the primary EN of the last extent of these two streams. This checks whether all the replicas are available and that they all have the same length. If not, the extent is sealed and the reads are only performed during partition load against a replica sealed by the SM.
The extents that are sealed are erasure coded. The Reed-Solomon erasure code algorithm is used for the Windows Azure Storage WAS service. Note that this is not the same as Pelican which was using a different erasure coding scheme for the exa scale cold storage. In this case for n equal sized chunks of data usually at block level boundaries, there are m error correcting chunks and as long as the loss is no more than m, the extent can be recreated. Erasure coding does not conflict with replication. In fact erasure coding increases the durability of the data.
In order to be efficient on read, a deadline is added to the read requests. if the deadline cannot be met, the read requests is failed in which case the client selects a different EN to read the data from Since there are three replicas, this gives some load balancing. This method is also used with erasure coded data.
#codingexercise
Get the max sum level in a binary tree
int GetMaxLevel(Node root, Node delimiter)
{
if (root == null) return -1;
var q = new Queue<Node>();
q.Enqueue(root);
q.Enqueue(delimiter);
var node = q.Dequeue();
while (node)
{
if (node == delimiter) {q.Enqueue(delimiter); }
else{
if (node.left)
q.Enqueue(node.left) ;
if (node.right)
q.Enqueue(node.right);
}
node = q.Dequeue();
}
var levels = q.split(delimiter);
int max = INT_MIN;
int result = 0;
for (int level = 0; level < levels.Count; level++)
if (levels[level].Sum > max)
result = level;
return result;

}
List<List<Node>> static split(this Queue<Node> q, Node delimiter)
{
var result = new List<List<Node>>();
if (q.empty()) return result;
var node = q.Dequeue();
var current = null;
while (node)
{
if (node == delimiter) {
      if (current)
          result.Add(current);
      current = new List<Node>();
}else{
if (current)
    current.Add(Node);
}
node = q.Dequeue();
}
return result;
}

Monday, February 27, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage. We were discussing the replication flow in Windows Azure Storage service. We now look at sealing operation among extent nodes.
Sealing the extent means that the commit length does not change again. To do this, all the three ENs are asked for their current commit length. If one or more of the replicas have a different length, the minimum commit length is chosen. This does not cause data loss because the primary EN does not return success unless all the replicas have been written to disk. This means the minimum will have the complete data. Additionally, the minimum may even have more blocks that were not acknowledged to the client. This is fine because the client deals with timeouts and duplicate records. All of the commit records have a sequence number and the duplicate records have a sequence number.
If the SM could not talk to an EN but the client is able to, then this may cause a discrepancy to the client. The partition layer has two different read patterns to handle this. If the read records are at known locations, the partition layer reads at a specific offset for a specific length. This is true for data streams which are only of two types row and blob. The location information is available from a previous successful append at the stream layer. The append was successfully committed only if it made it to all the three replicas.
The other type of read pattern is when the stream is read sequentially from the start to the finish. This operation occurs for metadata and commit logs. These are the only streams that the partition layer will read sequentially from a starting point to the very last record of a stream. During the partition load time, the partition layer prevents any appends to these two streams At the start of the load, a check for commit length is requested from the primary EN of the last extent of these two streams. This checks whether all the replicas are available and that they all have the same length. If not, the extent is sealed and the reads are only performed during partition load against a replica sealed by the SM. This works for repeated loads.
It may be interesting to note that the two types of access are determined by the nature of the stream being read and not merely by their frequencies. The access pattern for metadata and logs differ from data. the latter is more frequent and arbitrary.

#codingexercise
Print all nodes in a binary tree with k leaves.

int GetNodeWithKLeaves(Node root, int k, ref List<Node> result)
{
if (root == null) return 0;
if (root.left == null && root.right == null) return 1;
int left = GetNodeWithKLeaves(root.left, k, ref result);
int right = GetNodeWithKLeaves(root.right, k, ref result);
if (left + right == k)
{
result.Add(root);
}
return left + right;
}

Sunday, February 26, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage.
The IaaS offerings from Azure storage services include disks and files where as the PaaS offerings from Azure storage services include objects, tables and queues. The storage offerings are built on a unified distributed storage system with guarantees for durability, encryption at Rest, strongly consistent replication, fault tolerance and auto load balancing.
The redundancy features are handled by storage stamps and a location service. A storage stamp is a cluster of N racks of storage nodes, where each rack is built out as a separate fault domain with redundant networking and power. Each storage stamp is located by a VIP and served by layers of Front-Ends, partition layer and stream layer that provides intra stamp replication.The intra stamp provides synchronous replication in the stream layer and ensures that the data is made durable within the storage stamp. The interstamp replication provides asynchronous replication and is focused on replicating data across stamps. We were looking at Blocks, Extents and Streams, the Stream Manager and the Extent Nodes.

We were discussing the replication flow, let us summarize it. First a stream is created. It has three replicas, a primary and two secondary. They are distributed across fault and upgrade domains. The SM says which replica is primary and the client gets to know which extent node has the locations of the replicas. This state is maintained in the SM cache. Writes go to the primary and from the primary to the secondary. The locations of the replicas don't change until the extent is sealed. No leases are used to represent the primary EN for an extent. When one extent gets sealed, the process repeats for another extent. The client can read the information from any replica including the last block. The primary decides the offset of the append in the extent.It orders the appends for concurrent requests. It relays the chosen offset to the secondary. It returns success to the client after all three appends. The sequence of success returned is in the order of append offsets. As appends commit in order for a replica, the last append position is considered to be the current commit length of the replica. This is helpful to determine the integrity of the replicas because the primary EN does not change, it sequences the appends and seals the extent.
When the process is repeated such as on failures, a write failure is returned to the client. The client then contacts the SM which then seals the extent at the current commit length. A new extent is allocated on a different ENs and the informations is relayed to the client.

#codingexercise
Determine if two arrays represent same BST without manipulating the arrays.
If the arrays A and B could be manipulated, we know we could check
A.sort() == B.sort() to see that they are the same BST.
or we could
MakeBST(A) == MakeBST(B)
where MakeBST(A) =
{
Node BST-A = null;
for (int i = 0; i < A.Length; i++)
Insert(BST-A, A[i])
return BST-A;
}
If the two BSTs are identical then each node will occupy the same level and have the same subtree for both arrays. Consequently we could try to check the same for pairwise match of elements in both trees. However, this means we do n-1 comparisions for each of the n elements and is much inefficient than rendering the BST with insertions. Moreover this kind of checking will requiring making BSTS for separate ranges of items - those that are lower than the root and those that are higher than the root. The check also is very inefficient because to keep track of level and siblings, we require the tree. Finally contrast this with just sorting the elements of the array to compare against each other for equality to say BSTs are identical. That kind of check would be most efficient.

Saturday, February 25, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage.
The IaaS offerings from Azure storage services include disks and files where as the PaaS offerings from Azure storage services include objects, tables and queues. The storage offerings are built on a unified distributed storage system with guarantees for durability, encryption at Rest, strongly consistent replication, fault tolerance and auto load balancing.
The redundancy features are handled by storage stamps and a location service. A storage stamp is a cluster of N racks of storage nodes, where each rack is built out as a separate fault domain with redundant networking and power. Each storage stamp is located by a VIP and served by layers of Front-Ends, partition layer and stream layer that provides intra stamp replication.The intra stamp provides synchronous replication in the stream layer and ensures that the data is made durable within the storage stamp. The interstamp replication provides asynchronous replication and is focused on replicating data across stamps. We were looking at Blocks, Extents and Streams, the Stream Manager and the Extent Nodes.
The Stream Layer is protected from malicious threats by the data center, Fabric Controller and Windows Azure Storage WAS service. Faults ranging from disk and node errors to power failures, network issues, bit flip and random hardware-failures and software bugs all may result in data corruption. They are prevented by the same block checksums mentioned earlier.
The replication flow proceeds like this. The Stream Manager assigns three replicas for the first extent (one primary and two secondary) to three extent nodes. They are chosen at random to spread the replicas across different fault and ugprade domains while considering extent node usage for load balancing. The primary is the main replica. All writes go to the primary. The primary then co-ordinates the writes to the secondary. The primary EN and the location of the three replicas don't change when it is appended and until sealed. When the SM allocates the extent, the extent locations is sent back to the clients. When the last extent becomes sealed, another now becomes the last extent and all new appends become the last extent in the stream.
If there are multiple outstanding appends to the same extent, the primary EN will respond success in the order of their offset to the clients. The last append position is considered to be the current commit length of the replica.
#codingexercise
Convert a binary tree to a binary search tree
Node ToBSTFromBT(Node root)
{
if (root == null) return root;
var items = ToList(root);
items.sort()
return ToBSTFromInOrder(items);
}
Node toList(node root)
{
If (root == null) return root;
If (root.left != null)
{
var left = toList(root.left);
for( ; left.right != null; left = left.right);
Left.right = root;
}
If (root.right != null)
{
var right = toList(root);
for(; right.left != null; right = right.left);
root.right = right;
}
Return root;
}

Node SortedListToBST(List<int> num, int start, int end, Node root)
{

int mid;

if ( (start + end) %2 == 0)
mid = (start + end ) / 2 + 1;
else
mid = (start + end) / 2

var node = new Node();
node.data = num[mid];

if (root != null)
if (num[mid] < root.data)
root.left = node;
else
root.right = node

node.left = SortedListToBST(num, start, mid-1, node);

node.right = SortedListToBST(num, mid + 1, end, node);

return node;

};