Cluster computing

Monday, February 13, 2017

Title: Improvements in stream processing of big data

Introduction: In my discussion of online versus batch mode of processing for big data as shared here (http://1drv.ms/1OM29ee), I discussed that the online mode is made possible because of the summation form, however I took the liberty of assuming that the merge operation of the summation form is linear after each node computes its summation. This could be improved further when the summaries are maintained as Fibonacci heaps because this data structure offers the following advantages:

The merge operations is less than linear if not constant time and the potential does not change

The insert operation takes constant amortized time, rather than logarithmic time and

The decrease key operation also takes constant amortized time rather than the logarithmic time.

Moreover, the Fibonacci heaps have the property that the size of a subtree rooted in a node of degree k is at least the (k+2)th Fibonacci number. This lets us make approximations much faster by making approximations on each node. The distribution of the Fibonacci heap also lets us propagate these approximations between heap additions and deletions especially when the potential does not change.

In addition to the data structure, Fibonacci series also plays an interesting role in the algorithm of the online processing with its nature of utilizing the next computation based on the last two steps only. In other words, we don’t repeat the operations of the online processing in every iteration. We skip some and re-evaluate or adjust the online predictions at the end of the iteration corresponding to Fibonacci numbers. We use Fibonacci numbers versus any other series such as binomial or exponential or powers of two because intuitively we are readjusting our approximations in our online processing and Fibonacci gives us a way to adjust them based on previous and current approximations.

Straightforward aggregations using summation forms show a property that the online prediction is improving the prediction from the past in a straight line and uses only the previous approximation in the current prediction. However, if we were to use the predictions from the iterations corresponding to the Fibonacci series, then we refine the online iterations to not just a linear extrapolation but also Fibonacci number based smoothing.

The Fibonacci series is better than exponential because it has better behavior near the asymptote.

Sunday, February 12, 2017

A private cloud provides resources such as compute, storage and networks to customers. These resources could be monitored for resource utilization, application performance, and operational health. The notion is borrowed from a public cloud such as AWS where Amazon CloudWatch is a monitoring service for AWS cloud resources and the applications that run on AWS. It provides system wide visibility and helps us keep the application running which can help customers keep the operations running smoothly. They can also use this service to set up alerts and notifications that are of interest to them about their resources.

This kind of resource monitoring is different from Cloud Service monitoring because the former is useful for customers while the latter is useful for the cloud provider. The latter is used mostly for the health of the services which are offered to more than one customer. The former can be customer specific depending on the registrations or resources and subscriptions for events.

The implementation is also very different between the two. For example, the cloud services often report metrics directly to a metrics database from which health check reports are drawn. Usually the data is often aged and stored in a time series database for cumulative charts. The transformation of data from collection to a time series database for reporting is usually paid by the query requesting the charts.

CloudWatch on the other hand is an event collection framework. In real time, events can be collected, archived, filtered and used for subsequent analysis. The collection of data again flows into a database from which queries can be meaning fully read and sent out. The events are much like the messages in a message broker except that it is tuned for high performance and cloud scale. The events have several attributes and are often extensible as name value pairs by the event producers who register different types of event formats. The actual collection of each event, its firing and its subsequent handling is all done within the event framework. This kind of framework has to rely on small packet sizes for the events and a robust and fast message broker functionality within the event broker. Handlers can be registered for each type of events or queues on which the events are stored.

Thus while the former is based on a time series database for metrics, the latter is based on an event collection and handling engine also referred to as an event driven framework.

#codingexercise

Replace every element with the least greater element to the right. For example,

Input: [8, 58, 71, 18, 31, 32, 63, 92,

43, 3, 91, 93, 25, 80, 28]
Output: [18, 63, 80, 25, 32, 43, 80, 93,

80, 25, 93, -1, 28, -1, -1]
void Replace(ref List<int>A)
{
for (int i = 0; i < A.Count; i++)
{
int min = INT_MAX;
for (int j = i+1; j < A.Count; j++)
if (A[j] < min && A[j] > A[i])
min = A[j];
if (min == INT_MAX)
min = -1;
A[i] = min;
}
}

Saturday, February 11, 2017

A tutorial on asynchronous programming for DevOps:
Introduction – DevOps engineers increasingly rely on writing services for automation and integrating new functionality on underlying systems. These services that involve chaining of one or more operations often incur delay that exceeds user tolerance on a web page. Consequently, they are faced with the challenge of giving an early response to the user even when the resource requested by the user may not be available. The following some of the techniques commonly employed by these engineers:
1) Background tasks – Using a database and a transactional behavior, the engineers used to chain one or more actions within the same transaction scope. However, when each action takes a long time to complete, the chained actions amount to a significant delay. Although semantically correct, this does not lend itself to reasonable delay. Consequently, this is split into a foreground and a background task where a database entry implies a promise that will be fulfilled subsequently by a background task or invalidated on failures. Django-background-tasks is an example of this functionality and it involves merely decorating a method to register as a background task. Additionally, these registrations can specify the schedule in terms of time period as well as the queues on which these tasks are stored. Internally they are implemented with their own background tasks table that allows retries and error handling.
2) Async tasks – Using task parallel library such as django-utils, these allow tasks to merely be executed on a separate thread in the runtime without the additional onus of persistence and formal handling of tasks. While the main thread can service the user and be responsive, the async registered method attempts to parallelize execution without guarantees.
3) Threads and locks – Using concurrent programming, developers such as those using python concurrent.futures look for ways to partition the tasks or stitch execution threads based on resource locks or time sharing. This works well to reduce the overall task execution time by identifying isolated versus shared or dependent actions. The status on the objects indicates their state. Typically this state is progressive so that errors are minimized and users can be informed of intermediate status and the progress made. A notion of optimistic concurrency control goes a step further by not requiring locks on those resources.
4) Message Broker – Using queues to store jobs, this mechanism allows actions to be completed later often by workers independent of the original system that queued the task. This is very helpful to separate queued based on the processors handling the queues and for scaling out to as many workers as necessary to keep the jobs moving on the queue. Additionally, these message brokers come with options to maintain transactional behavior, hold dead letter queues, handle retries and journal the messages. Message brokers can also scale to beyond one server to handle any volume of traffic
5) Django celery – Using this library, the onerous tasks associated with a message broker are internalized and the developers are given very nifty and clean interface to perform their tasks in an asynchronous manner. This is a simple, flexible, reliable and distributed task queue that can process vast amounts of messages with a focus on real-time processing and support for task scheduling. While previously it was available in its own library, it has now become standard with the django framework.
Conclusion – Most application developers choose one or more of the strategies above depending on the service level agreements, the kind of actions, the number of actions on each request, the number of requests, the size and scale of demand and other factors. There is a tradeoff between complexity and the layering or organization of tasks beyond the synchronous programming.

Friday, February 10, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We discussed Azure stack is a hybrid approach to cloud. Microsoft Azure has developer tools for all platform services targeting web and mobile, Internet of Things, Microservices, and Data + analytics, Identity management, Media streaming, High Performance Compute and Cognitive services. These platform services all utilize core infrastructure of compute, networking, storage and security. The Azure resource manager has multiple resources, role based access control, custom tagging and self-service templates.
1) The compute services are made more agile with the offerings from a VM infrastructure, VM scale sets infrastructure, Container service orchestration and batch/Job orchestration.
2) The Paas platform of Azure can span Azure and AWS both. It can occupy on-Premise, GCP and others as well Containers, serverless and Microservices are different forms of computing.
3) Azure provides a data platform. The Paas platform of Azure can span Azure and AWS both. It can occupy on-Premise, GCP and others as well Containers, serverless and Microservices are different forms of computing. The consumers for this data transformation to actions are people as well as apps and automated systems.
4) Azure Networking is divided into regions that include inside the Azure region, connecting Azure regions, and geographic reach and internet ecosystems. The networking inside the Azure region already comes with security, performance, load balancing, virtual networks, cross-premises connectivity. Azure comes with accelerated networking

Azure allows vnets which help us define our network and our policies. it comes loaded with features such as dmz, backend subnets, customizable routes, etc discussed earlier. vnet to vnet traffic is via gateway. There is a full mesh network with vnets. The VNets can thus be peered.It can be setup easily and latency and throughput are same as in single peer

#codingexercise
Find if a given sorted subsequence exists in a binary search tree.
bool HasSequence(Node root, List<int> A)
{
int index = 0;
inOrderTraverse(root, A, ref index); // traverse in increasing order of elements
return index == A.Count;
}
void inOrderTraverse(Node root, List<int> A, ref int index)
{
if (root == NULL) return;
inOrderTraverse(root.left, A, ref index);
if (root.data == A[index])
index++;
inOrderTraverse(root.right, A, ref index);
}

Thursday, February 9, 2017

The Paas platform of Azure can span Azure and AWS both. It can occupy on-Premise, GCP and others as well Containers, serverless and Microservices are different forms of computing. A container packages an exe or a jar. Serverless dictates the operational/cost model. Microservices are a 3-tier model involving a thin client SOA and pub/sub and provides a development architecture. The core compute is provided by Batch, Container Service, VM Scale sets and virtual machines in that order. The Platform is provided by Azure functions, App Service, Service fabric and Cloud Services
We now look at data platform. The purpose of this platform to interpret Data to gain intelligence so that it can guide actions. This transformation from data to actions is facilitated by layers of information management, big data stores, machine learning and analytics, and Intelligence services as well as dashboards and visualizations. The data comes from sensors and devices, applications, and other data sources. The information management layer aggregates this data with data factory, data catalog, and event hubs. The big data stores work with data lake store and SQL data warehouse. Machine learning and analytics involves all insight applications such as those for machine learning, data lake analytics, HDInsight and stream analytics.

The Intelligence layer comprises of Cognitive services, Bot framework, and Cortana. The dashboard usually involves Power BI.
The consumers for this data transformation to actions are people as well as apps and automated systems.
Azure Networking is divided into regions that include inside the Azure region, connecting Azure regions, and geographic reach and internet ecosystems. Its the latter two that the internet exchange provider spans. The connections to the Azure region are made over Software defined WAN and optical networks or advanced MPLS services. The networking inside the Azure region already comes with security, performance, load balancing, virtual networks, cross-premises connectivity.
Azure now comes with accelerated networking that provides upto 25Gbps of throughput and reduces network latency up to 10x. Without accelerated networking, the policies were applied in software in the host. With accelerated networking, the policies are applied in hardware accelerators
#codingexercise
Check whether a BST has a dead end.
A dead end is an element after which we cannot insert any more element. It is a value x such that x+1 and x-1 exist. The BST contains positive integer values greater than zero which makes the value 1 an exception
bool HasDeadEnd(Node root)
{
if (root == null) return false;
var all = new List<Node> ();
ToInOrderList(root, ref all);
var leaves = GetLeaves(all); // during traversal, check if left and right are null for selecting a leaf
foreach (var leaf in leaves)
{
if (all.Contains(leaf.data -1) && all.Contains(leaf.data+1))
{
return true;
}
}
return false;
}
}
void ToInOrderList(Node root, ref List<node> all)
{
if (root == null) return;
ToInOrderList(root.left, ref all);
all.Add(root);
ToInOrderList(root.right, ref all);
}
List<Node> GetLeaves(List<Node> all)
{
all.Select(x => x.left == null && x.right == null).ToList();
}
Alternatively
void FindLeaves(Node root, ref List<node> leaves)
{
if (root == null) return;
FindLeaves(root.left, ref leaves);
if(root.left == null && root.right == null)
leaves.Add(root);
FindLeaves(root.right, ref leaves);
}

The order in which the leaves are enumerated depends on the order in which the traversal is done.

Wednesday, February 8, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We discussed Azure stack is a hybrid approach to cloud. Microsoft Azure has developer tools for all platform services targeting web and mobile, Internet of Things, Microservices, and Data + analytics, Identity management, Media streaming, High Performance Compute and Cognitive services. These platform services all utilize core infrastructure of compute, networking, storage and security. The Azure resource manager has multiple resources, role based access control, custom tagging and self-service templates.The compute services are made more agile with the offerings from a VM infrastructure, VM scale sets infrastructure, Container service orchestration and batch/Job orchestration. Container infrastructure layering allows even more scale because it virtualizes the operating system.Azure is an open cloud because it supports open source infrastructure tools such as Linux, ubuntu, docker, etc. layered with databases and middleware such as hadoop, redis, mysql etc., app framework and tools such as nodejs, java, python etc., applications such as Joomla, drupal etc and management applications such as chef, puppet, etc. and finally with devops tools such as jenkins, Gradle, Xamarin etc. Azure involves a lot of fine grained loosely coupled micro services Microservices can be stateful or stateless and can be deployed in a multi-cloud manner.

The Paas platform of Azure can span Azure and AWS both. It can occupy on-Premise, GCP and others as well. As a PaaS offering, this platform enables development and debugging using API, IDE and CI/CD interface.

The web apps written using Azure stack can handle mission-critical load that scales because it has builtin auto scale and load balancing and high-availability with auto-patching. It enables continuous deployment with a variety of source control and supports a variety of applications. It is tightly integrated with Xamarin which is loved by developers and trusted by enterprise.

Writng APIs is a breeze if we have the framework already. We can just publish, manage, secure and analyze our APIs in minutes. We secure the API with Active Directory, single sign-on and OAuth and generate client-proxies or APIs in the language of our choice. Similarly enterprise API can be mashed up and integrated with API management and Logic Apps.

Instead of APIs we can also use serverless apps that help process events. These are cloud scale event handlers and they can scale with customer demand so that the pay is on a usage basis. All this involves is to write functions in C#, Node.js, Python, PHP and schedule event-driven tasks across services. These functions are exposed as HTTP API endpoints, are fully open source and run on serverless infrastructure.

Serverless means that there are no underlying servers which means we don't have to do OS patching or management. Applications are built of event handlers which makes it easy to connect services and components. There is micro-billing involved which pays for execution at very short time increments with significant cost efficiencies.

Containers, serverless and Microservices are different forms of computing. A container packages an exe or a jar. Serverless dictates the operational/cost model. Microservices are a 3-tier model involving a thin client SOA and pub/sub and provides a development architecture.

Micro-serverless, simple cloud platform and service access usually involves callers/events, serverless microservices, and platform. The core compute is provided by Batch, Container Service, VM Scale sets and virtual machines in that order. The Platform is provided by Azure functions, App Service, Service fabric and Cloud Services.

#codingexercise

Find the maximum value of a node between two nodes in a BST.

Solution: Find the LCA between the nodes. Iterate through ancestors of the nodes upto LCA to find max

Node getLCA( node root, int n1, int n2)
{
If (root == null) return null;
If (root.data == n1 || root.data == n2) return root;
Int left = getLCA(root.left, n1, n2);
Int right = getLCA(root.right, n1, n2);
If (left && right) return root;
If (left) return left;
If (right)return right;
Return null;

}
int GetMaxBetween(Node root, int n1, int n2)
{
Node lca = GetLCA(root, n1, n2);
return Math.max(GetMax(lca, n1), GetMax(lca, n2));
}