Cluster computing

Monday, July 25, 2022

This is a continuation of series of articles on hosting solutions and services on Azure public cloud with the most recent discussion on Multitenancy here and picks up the discussion on the checklist for architecting and building multitenant solutions. Administrators would have found the list familiar to them.

While the previous article introduced the checklist as structured around business and technical considerations, it provided specific examples in terms of Microsoft technologies. This article focuses on the open-source scenarios on Azure with the Apache stack specifically.

Each open-source product that is used in a multitenant solution must be carefully reviewed for the features it offers to support multitenancy. While the checklist alluded to some of the general requirements in terms of shared resources and tenant isolation, open-source products might be able to articulate isolation simply by naming containers differently. The considerations to overcome noisy neighbor problems and scaling out infrastructure must still be made to the degree that these products permit.

Let us take a few examples from the Apache stack. The Data partitioning guidance for Apache Cassandra for instance describes how to separate data partitions to be managed and accessed separately. Horizontal, vertical and functional partitioning strategies must be suitably applied. Another example is where Azure public Multi-access edge compute must provide high availability to the tenants. Cassandra can be used to support geo-replication.

Apache Storm is used in edge computing and features true stream processing with low-level APIs. The trained AI models can be brought to the edge with Azure Stack Hub while Storm stores the data. The advantage of hosting the AI models close to the edge is that there is no latency in predictions from the events. The models can always be trained on a high-performance processor including GPUs but do not need heavy duty compute to host and run them for making predictions. Storm can be central store to receive all the events from the edge as well as their predictions.

Since neither Big Data nor relational stores are suitable for the ingestion, processing and analysis of events and those stores can be large enough to overwhelm the continuous processing required for events, it is better to use Storm for the store and the edge to generate the events. Storm is taken as an example for stream processing store but it is not the only one. Readers are encouraged to review Apache Kafka, Apache Flink and Pulsar if they would like to leverage the nuances between their capabilities. There are also options available in the stream processing systems on the public cloud such as HD Insight with Storm which makes it easy to process and query data from Storm. These interactive SQL queries can execute fast and at scale over both structured or unstructured data. Stores like CosmosDB can accommodate diverse and unpredictable IoT workloads without sacrificing ingestion or query performance. If real-time processing systems are required, then Storm and other stores can help with capturing, analysis and generating reports or automated responses with minimal latency.

If batch processing is required, Apache Sqoop can help with automations over Big Data. For example, Sqoop jobs can be used to copy data. Data transfer options such as Azure Import/Export, Data Box and Sqoop can work databases with little or no impact to performance. Oozie and Sqoop can be used to manage batch workflows for captured real-time data.

#codingexercise

Two nodes of a BST are swapped, find them

void InOrderTraverse(Node root, ref prev, ref List<Node> pairs)

{

if (root == null) return;

InOrderTraverse(root.left, ref prev, ref pairs);

If (prev && root.data < prev.data) pairs.add(root);

InOrderTraverse(root.right, ref root, ref pairs);

}

Cluster computing

Monday, July 25, 2022

No comments:

Post a Comment