Cluster computing

Friday, February 17, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure networking. We discussed global DNS, dual stacks for IPV4 and IPV6, and load balancers.Azure load balancer services come at different levels.- cross region, in-region, application gateway and individual vms.
Application Gateway can manage backend with rich diagnostics including access and performance logs, VM scale set support and custom health probes.
The Web Application Firewall security protects applications from web based intrusions and is built using ModSecurity and CoreRule set. It is highly available and fully managed.
Microsoft Azure has way more express route locations than any other public cloud.Microsoft Azure has deeper insights into our network regardless of whether it is ExpressRoute, VirtualNetwork or ApplicationGateway
We now review Azure storage stack. The IaaS offerings from Azure storage services include disks and files where as the PaaS offerings from Azure storage services include objects, tables and queues. The storage offerings are built on a unified distributed storage system with guarantees for durability, encryption at Rest, strongly consistent replication, fault tolerance and auto load balancing. The IaaS is made up of storage arrays, virtual machines and networking. The PaaS is made up of existing frameworks, web and mobile, microservices and serverless compute.
Disks are of two types in Azure - Premium disks (SSD) and Standard Disks (HDD) and are backed by page blobs in Azure storage. Their capabilities are not diminished and offer high i/o performance and low latency > 80000 iops and > 2000 MB/sec disk throughput per VM. Disks are offered out of about 26 Azure regions with server side encryption at rest and Azure disk encryption with BitLocker/DMCrypt. In addition, disks come with blob cache technology, enterprise grade durability with three replicas, snapshot for backup, ability to expand disks, and REST interface for developers. Azure has additionally released features called Azure Backup support, Encryption at Rest, Azure Site Recovery Preview and Incremental Snapshot Copy. In the future, it plans to expand on disk sizes and disk analytics enhancement.
Azure files is a fully managed cloud file storage for use with IaaS and on-premise instances. The scenarios covered lift and shift, host high availability workload data and enables backup and disaster recovery. The Azure files support multiple protocols and operating systems. It has support for SMB 2.1 and 3.0 The files are globally accessible from both On Premise and IaaS instances. They are available in all Azure regions. There's high availability and durability for these files.
Azure files will support snapshots, AD integration, increase scale limits, larger share size, encryption at rest and Backup integration.
Azure blobs are of three types - Block blobs, Append blobs, Page blobs. Block blobs are used for document, images, video etc. Append blobs are used for multi-writer append only scenarios such as logging and big data analytics output. Page blobs are used for page aligned random reads and writes IaaS disks, Event Hub, Block level backup.
#codingexercise
Convert a BST into a Min Heap
Node ToMinHeap(Node root)
{
if (root == null) return null;
var sorted = ToInOrderList(root);
var heap = ToMinHeap(sorted);
return heap;
}
https://1drv.ms/w/s!Ashlm-Nw-wnWrwTmE6AXT0d_sX3k

Thursday, February 16, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure networking. We discussed global DNS, dual stacks for IPV4 and IPV6, and load balancers.
Azure load balancer services come at different levels. There is the traffic manager for cross region direction and availability and exposure to the internet. There is the Azure Load Balancer which provides in-region scalability and availability. There is the Azure application gateway which has URL/content based routing and load balancing. Then there is load balancing on the VMs for web servers.
Load Balancers in Azure use multiple VIPs to simplify designs and reduce cost. They can be setup with internal and external VIP with direct public association.
Application Gateway can manage backend with rich diagnostics including access and performance logs, VM scale set support and custom health probes.
The Web Application Firewall security protects applications from web based intrusions and is built using ModSecurity and CoreRule set. It is highly available and fully managed.
Cross premises connectivity is maintained with P2S SSTP tunnels and IPSEC S2S VPN tunnels to Azure over Internet. If there is a private WAN, an ExpressRoute is maintained to Azure.
There is dual redundancy with Active-Active gateways which is new in Microsoft Azure. This leads to zero downtime during planned maintenance. This is an improvement from active-standby to active-active. It supports both cross-premises and VNet to VNet connectivity and spreads traffic over multiple tunnels simultaneously.
There are thirty five express route locations in Microsoft Azure which is more than any other cloud. This has nearly doubled peering locations and partners. There are also improvements in ExpressRoute It supports upto 10G throughputs to VNets where usually the standard is 1Gbps and the high performance is 2Gbps. It has the best cloud enterprise connectivity SLA. It offers more insights, self-help and troubleshooting tools, improved monitoring, diagnostics and alerting than others. We can also see BGP routes, traffic statistics, and ARP tables.
Microsoft Azure has deeper insights into our network regardless of whether it is ExpressRoute, VirtualNetwork or ApplicationGateway For example, the Express Route has peering connection statistics, ARP table, route summary, and route table. The Virtual network has effective security rules on every NIC, and next hop and effective routes for every NIC in the subnet. The Application Gateway has metrics and alerts and back end health information.
#codingexercise
Find pairs with given sums such that pair elements lie in different BSTs.
void GetPairSum(Node root1, Node root2, int sum)
{
var list1 = root1.ToInOrder();
var list2 = root2.ToInOrder();
int left = 0;
int right = list2.Count-1;
while ( left < list1.Count && right >= 0)
{
if (list1[left] + list2[right] == sum)
{
Console.WriteLine("{0}:{1}", list1[left], list2[right]);
left++;
right--;
}else if ( list1[left] + list2[right] < sum )
{
left++;
}else
right--;
}
}
It might be noted that this does not take care of duplicates.

Wednesday, February 15, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure networking.
The global DNS name resolution is pretty fast with very high availability. It is integrated with the Azure Resource Manager for role based access control, tagging and template based deployment - for both zones and record sets.
Azure virtual machines support native dual stacks : (IPV4+IPV6) on both flavors of the operating system. It is available globally. It maximizes the reach of Azure applications using mobile (4G) and IoT devices
Azure load balancer services come at different levels. There is the traffic manager for cross region direction and availability and exposure to the internet. There is the Azure Load Balancer which provides in-region scalability and availability. There is the Azure application gateway which has URL/content based routing and load balancing. Then there is load balancing on the VMs for web servers.
Load Balancers in Azure use multiple VIPs to simplify designs and reduce cost. There are multiple private VIPs on a load balancer and the backend ports are reused using direct server return (DSR). Secondary NICs associated are also provided to enable connectivity to restricted vnet. Load balancer can now be setup with internal and external VIP with direct public association. Moreover, a NIC can now have multiple private IPs - static or dynamic and multiple public IPs - static or dynamic and unlocks NVA partners. Application Gateway has layer 7 application delivery controller features. It can enforce security with SSL termination and allow/block SSL protocol versions. It can manage session and site with cookie based session affinity and muti-site hosting. It can manage content with URL based routing. It can manage backend with rich diagnostics including access and performance logs, VM scale set support and custom health probes.
The Web application firewall has also been significantly improved based on Open Web Application Security Project (owasp.org). WAF security protects applications from web based intrusions and is built using ModSecurity and CoreRule set. It is highly available and fully managed. It is preconfigured for most common web vulnerabilities such as SQL injection and XSS attacks.
Cross premises connectivity is maintained with P2S SSTP tunnels and IPSEC S2S VPN tunnels to Azure over Internet. If there is a private WAN, an ExpressRoute is maintained to Azure.
Print the largest BST in a given binary tree
int GetMaxBST(Node root)
{
if (IsBST(root))
return treesize(root);
return Math.max( GetMaxBST(root.left), GetMaxBST(root.right));
}

bool isBST(node root)
{
return IsBstHelper(root, INT_MIN, INT_MAX);
}

bool IsBstHelper(node root, int min, int max)
{
if (root==null) return true;
if (root.data < min || root.data> max) return false;
return IsBstHelper(root.left, min, root.data-1) &&
IsBstHelper(root.right, root.data+1, max);

}
int treesize(Node root)
{
if (root == null) return 0;
return treesize(root.left) + treesize(root.right) + 1;
}

we could also combine the operations above to make the traversals more efficient.
int GetMaxBST(node root, int min, int max)
{
if (root == null) return 0;
if (root.data < min || root.data > max) return 0;
int left = GetMaxBST(root.left, min, root.data - 1);
int right = GetMaxBST(root.right, root.data+1, max);
if (left >= 0 && right >= 0)
return left + right + 1;
return 1;
}

Tuesday, February 14, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We discussed Azure stack is a hybrid approach to cloud. Microsoft Azure has developer tools for all platform services targeting web and mobile, Internet of Things, Microservices, and Data + analytics, Identity management, Media streaming, High Performance Compute and Cognitive services. These platform services all utilize core infrastructure of compute, networking, storage and security. The Azure resource manager has multiple resources, role based access control, custom tagging and self-service templates.
1) The compute services are made more agile with the offerings from a VM infrastructure, VM scale sets infrastructure, Container service orchestration and batch/Job orchestration.
2) The Paas platform of Azure can span Azure and AWS both. It can occupy on-Premise, GCP and others as well Containers, serverless and Microservices are different forms of computing.
3) Azure provides a data platform. The Paas platform of Azure can span Azure and AWS both. It can occupy on-Premise, GCP and others as well Containers, serverless and Microservices are different forms of computing. The consumers for this data transformation to actions are people as well as apps and automated systems.
4) Azure Networking is divided into regions that include inside the Azure region, connecting Azure regions, and geographic reach and internet ecosystems. The networking inside the Azure region already comes with security, performance, load balancing, virtual networks, cross-premises connectivity. Azure comes with accelerated networking
We review some more features of Azure networking. Azure DNS is a globally distributed architecture which is resilient to multiple region failure. The global DNS name resolution is pretty fast with very high availability. It is integrated with the Azure Resource Manager for role based access control, tagging and template based deployment - for both zones and record sets. Azure DNS comes with REST API and SDKs for application integration.
Azure virtual machines support native dual stacks : (IPV4+IPV6) on both flavors of the operating system. It is available globally. It maximizes the reach of Azure applications using mobile (4G) and IoT devices
IPv6 is required by governments and their suppliers. IPV6 has AAAA record.
Azure load balancer services come at different levels. There is the traffic manager for cross region direction and availability and exposure to the internet. There is the Azure Load Balancer which provides in-region scalability and availability. There is the Azure application gateway which has URL/content based routing and load balancing. Then there is load balancing on the VMs for web servers.
Load Balancers in Azure use multiple VIPs to simplify designs and reduce cost. There are multiple private VIPs on a load balancer and the backend ports are reused using direct server return (DSR) which comes in very useful when there is concern that the load balancer will become a bottleneck. In such cases, the servers are allowed to respond to the client directly such as when the requests are small but the responses are large.

#codingexercise
Find the closest element to a given value in a binary search tree

void GetMinDiff(Node root, int k, ref int diff, ref int key)
{
if (root == null) return;
if (root.data == k){
key == k;
return;
}
if (diff > Math.abs(root.data-k))
{
diff = Math.abs(root.data -k);
key = k;
}
if (k <root.data)
diff = GetMinDiff(root.left, k, ref diff, ref key);
else
diff = GetMinDiff(root.right, k, ref diff, ref key);
}
The same works for the furthest element from a given value in the same binary search tree with a modification to the comparision of the diff.
In this case, we are trying to maxinize the difference.

Monday, February 13, 2017

Title: Improvements in stream processing of big data

Introduction: In my discussion of online versus batch mode of processing for big data as shared here (http://1drv.ms/1OM29ee), I discussed that the online mode is made possible because of the summation form, however I took the liberty of assuming that the merge operation of the summation form is linear after each node computes its summation. This could be improved further when the summaries are maintained as Fibonacci heaps because this data structure offers the following advantages:

The merge operations is less than linear if not constant time and the potential does not change

The insert operation takes constant amortized time, rather than logarithmic time and

The decrease key operation also takes constant amortized time rather than the logarithmic time.

Moreover, the Fibonacci heaps have the property that the size of a subtree rooted in a node of degree k is at least the (k+2)th Fibonacci number. This lets us make approximations much faster by making approximations on each node. The distribution of the Fibonacci heap also lets us propagate these approximations between heap additions and deletions especially when the potential does not change.

In addition to the data structure, Fibonacci series also plays an interesting role in the algorithm of the online processing with its nature of utilizing the next computation based on the last two steps only. In other words, we don’t repeat the operations of the online processing in every iteration. We skip some and re-evaluate or adjust the online predictions at the end of the iteration corresponding to Fibonacci numbers. We use Fibonacci numbers versus any other series such as binomial or exponential or powers of two because intuitively we are readjusting our approximations in our online processing and Fibonacci gives us a way to adjust them based on previous and current approximations.

Straightforward aggregations using summation forms show a property that the online prediction is improving the prediction from the past in a straight line and uses only the previous approximation in the current prediction. However, if we were to use the predictions from the iterations corresponding to the Fibonacci series, then we refine the online iterations to not just a linear extrapolation but also Fibonacci number based smoothing.

The Fibonacci series is better than exponential because it has better behavior near the asymptote.

Sunday, February 12, 2017

A private cloud provides resources such as compute, storage and networks to customers. These resources could be monitored for resource utilization, application performance, and operational health. The notion is borrowed from a public cloud such as AWS where Amazon CloudWatch is a monitoring service for AWS cloud resources and the applications that run on AWS. It provides system wide visibility and helps us keep the application running which can help customers keep the operations running smoothly. They can also use this service to set up alerts and notifications that are of interest to them about their resources.

This kind of resource monitoring is different from Cloud Service monitoring because the former is useful for customers while the latter is useful for the cloud provider. The latter is used mostly for the health of the services which are offered to more than one customer. The former can be customer specific depending on the registrations or resources and subscriptions for events.

The implementation is also very different between the two. For example, the cloud services often report metrics directly to a metrics database from which health check reports are drawn. Usually the data is often aged and stored in a time series database for cumulative charts. The transformation of data from collection to a time series database for reporting is usually paid by the query requesting the charts.

CloudWatch on the other hand is an event collection framework. In real time, events can be collected, archived, filtered and used for subsequent analysis. The collection of data again flows into a database from which queries can be meaning fully read and sent out. The events are much like the messages in a message broker except that it is tuned for high performance and cloud scale. The events have several attributes and are often extensible as name value pairs by the event producers who register different types of event formats. The actual collection of each event, its firing and its subsequent handling is all done within the event framework. This kind of framework has to rely on small packet sizes for the events and a robust and fast message broker functionality within the event broker. Handlers can be registered for each type of events or queues on which the events are stored.

Thus while the former is based on a time series database for metrics, the latter is based on an event collection and handling engine also referred to as an event driven framework.

#codingexercise

Replace every element with the least greater element to the right. For example,

Input: [8, 58, 71, 18, 31, 32, 63, 92,

43, 3, 91, 93, 25, 80, 28]
Output: [18, 63, 80, 25, 32, 43, 80, 93,

80, 25, 93, -1, 28, -1, -1]
void Replace(ref List<int>A)
{
for (int i = 0; i < A.Count; i++)
{
int min = INT_MAX;
for (int j = i+1; j < A.Count; j++)
if (A[j] < min && A[j] > A[i])
min = A[j];
if (min == INT_MAX)
min = -1;
A[i] = min;
}
}

Saturday, February 11, 2017

A tutorial on asynchronous programming for DevOps:
Introduction – DevOps engineers increasingly rely on writing services for automation and integrating new functionality on underlying systems. These services that involve chaining of one or more operations often incur delay that exceeds user tolerance on a web page. Consequently, they are faced with the challenge of giving an early response to the user even when the resource requested by the user may not be available. The following some of the techniques commonly employed by these engineers:
1) Background tasks – Using a database and a transactional behavior, the engineers used to chain one or more actions within the same transaction scope. However, when each action takes a long time to complete, the chained actions amount to a significant delay. Although semantically correct, this does not lend itself to reasonable delay. Consequently, this is split into a foreground and a background task where a database entry implies a promise that will be fulfilled subsequently by a background task or invalidated on failures. Django-background-tasks is an example of this functionality and it involves merely decorating a method to register as a background task. Additionally, these registrations can specify the schedule in terms of time period as well as the queues on which these tasks are stored. Internally they are implemented with their own background tasks table that allows retries and error handling.
2) Async tasks – Using task parallel library such as django-utils, these allow tasks to merely be executed on a separate thread in the runtime without the additional onus of persistence and formal handling of tasks. While the main thread can service the user and be responsive, the async registered method attempts to parallelize execution without guarantees.
3) Threads and locks – Using concurrent programming, developers such as those using python concurrent.futures look for ways to partition the tasks or stitch execution threads based on resource locks or time sharing. This works well to reduce the overall task execution time by identifying isolated versus shared or dependent actions. The status on the objects indicates their state. Typically this state is progressive so that errors are minimized and users can be informed of intermediate status and the progress made. A notion of optimistic concurrency control goes a step further by not requiring locks on those resources.
4) Message Broker – Using queues to store jobs, this mechanism allows actions to be completed later often by workers independent of the original system that queued the task. This is very helpful to separate queued based on the processors handling the queues and for scaling out to as many workers as necessary to keep the jobs moving on the queue. Additionally, these message brokers come with options to maintain transactional behavior, hold dead letter queues, handle retries and journal the messages. Message brokers can also scale to beyond one server to handle any volume of traffic
5) Django celery – Using this library, the onerous tasks associated with a message broker are internalized and the developers are given very nifty and clean interface to perform their tasks in an asynchronous manner. This is a simple, flexible, reliable and distributed task queue that can process vast amounts of messages with a focus on real-time processing and support for task scheduling. While previously it was available in its own library, it has now become standard with the django framework.
Conclusion – Most application developers choose one or more of the strategies above depending on the service level agreements, the kind of actions, the number of actions on each request, the number of requests, the size and scale of demand and other factors. There is a tradeoff between complexity and the layering or organization of tasks beyond the synchronous programming.