Cluster computing

Sunday, December 26, 2021

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure SQL Edge which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions the conform to the Big Data architectural style.

The Gen 1 Data Lake was not integrated with the Blob Storage but Gen 2 does. There’s support for file-system semantics in Gen 2 and file security. Since, these features are provided from blob storage, it comes with the best practices in storage engineering that include replication groups, high availability, tiered data storage and storage class, aging and retention policies.

Gen 2 is the current standard for building Enterprise Data Lakes on Azure. A data lake must store petabytes of data while handling bandwidths up to Gigabytes of data transfer per second. The hierarchical namespace of the object storage helps organize objects and files into a deep hierarchy of folders for efficient data access. The naming convention recognizes these folder paths by including the folder separator character in the name itself. With this organization and folder access directly to the object store, the performance of the overall usage of data lake is improved.

Both the object store containers and the containers exposed by Data Lake are transparently available to applications and services. The Blob storage features such as diagnostic logging, access tiers and lifecycle management policies are available to the account. The integration with Blob Storage is only one aspect of the integration from Azure Data Lake. Many other services are also integrated with Azure Data Lake to support data ingestion, data analytics and reporting with visual representations. The data management and analytics form the core scenarios supported by Data Lake. Fine grained access control lists and active directory integration round up the data security considerations. Even if the data lake comprises a few data asset types, some planning phase is required to avoid the dreaded data swamp analogy. Governance and organization are key to avoiding this situation. When the size and number of data systems are several, a robust data catalog system is required. Since Data Lake is a PaaS service, it can support multiple accounts at no overhead. A minimum of three lakes is recommended during the discovery and design phase due to the following factors:

1. Isolation of data environments and predictability

2. Features and functionality at the storage account level or regional versus global data lakes

3. The use of a data catalog, data governance and project tracking tools

For multi-region deployments, it is recommended to have the data landing in one region and then replicated globally using AzCopy, Azure Data Factory or third-party products which assist with migrating data from one place to another.

The best practices for Azure Data Lake involve evaluating feature support and known issues, optimizing for data ingestion, considering data structures, performing ingestion, processing and analysis from several data sources and leveraging monitor telemetry

Saturday, December 25, 2021

Event-driven or Database - the choice is yours.

Public cloud computing must deal with events at an unprecedented scale. The right choice of architectural style plays a big role in the total cost of ownership for a solution involving events. IoT traffic for instance can be channeled via event driven stack available from Azure and via SQL Edge also available from Azure. The distinction between these may not be fully recognized or appreciated by development teams focused on agile and expedient delivery of work items but a sound architecture is like a good investment that increases the return multiple times as opposed to one that might require frequent scaling, revamping or even rewriting. This article explores the differences between the two. It is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure SQL Edge which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category.

Event Driven architecture consists of event producers and consumers. Event producers are those that generate a stream of events and event consumers are ones that listen for events

The scale out can be adjusted to suit the demands of the workload and the events can be responded to in real time. Producers and consumers are isolated from one another. In some extreme cases such as IoT, the events must be ingested at very high volumes. There is scope for a high degree of parallelism since the consumers are run independently and in parallel, but they are tightly coupled to the events. Network latency for message exchanges between producers and consumers is kept to a minimum. Consumers can be added as necessary without impacting existing ones.

Some of the benefits of this architecture include the following: The publishers and subscribers are decoupled. There are no point-to-point integrations. It's easy to add new consumers to the system. Consumers can respond to events immediately as they arrive. They are highly scalable and distributed. There are subsystems that have independent views of the event stream.

Some of the challenges faced with this architecture include the following: Event loss is tolerated so if there needs to be guaranteed delivery, this poses a challenge. Some IoT traffic mandate a guaranteed delivery Events are processed in exactly the order they arrive. Each consumer type typically runs in multiple instances, for resiliency and scalability. This can pose a challenge if the processing logic is not idempotent, or the events must be processed in order.

Some of the best practices demonstrated by this code. Events should be lean and mean and not bloated. Services should share only IDs and/or a timestamp. Large data transfer between services in this case is an antipattern. Loosely coupled event driven systems are best.

Azure SQL Edge is an optimized relational database engine that is geared towards edge computing. It provides a high-performance data storage and processing layer for IoT applications. It provides capabilities to stream, process and analyze data where the data can vary from relational to document to graph to time-series and which makes it a right choice for a variety of modern IoT applications. It is built on the same database engine as the SQL Server and Azure SQL so applications will find it convenient to seamlessly use queries that are written in T-SQL. This makes applications portable between devices, datacenters and cloud.

Azure SQL Edge uses the same stream capabilities as Azure Stream Analytics on IoT edge. This native implementation of data streaming is called T-SQL streaming. It can handle fast streaming from multiple data sources. The patterns and relationships in data is extracted from several IoT input sources. The extracted information can be used to trigger actions, alerts and notifications. A T-SQL Streaming job consists of a Stream Input that defines the connections to a data source to read the data stream from, a stream output job that defines the connections to a data source to write the data stream to, and a stream query job that defines the data transformation, aggregations, filtering, sorting and joins to be applied to the input stream before it is written to the stream output.

Both the storage and the message queue handle large volume of data and the execution can be stages as processing and analysis. The processing can be either batch oriented or stream oriented. The analysis and reporting can be offloaded to a variety of technology stacks with impressive dashboards. While the processing handles the requirements for batch and real-time processing on the big data, the analytics supports exploration and rendering of output from big data. It utilizes components such as data sources, data storage, batch processors, stream processors, real-time message queue, analytics data store, analytics and reporting stacks, and orchestration.

Some of the benefits of this application include the following: The ability to offload processing to a database, elastic scale and interoperability with existing solutions.

Some of the challenges faced with this architectural style include: The complexity to handle isolation for multiple data sources, and the challenge to build, deploy and test data pipelines over a shared architecture. Different products require as many as skillsets and maintenance with a requirement for data and query virtualization. For example, U-SQL which is a combination of SQL and C# is used with Azure Data Lake Analytics while SQL APIs are used with Edge, Hive, HBase, FLink and Spark. With an Event driven processing using heterogenous stack, the emphasis on data security gets diluted and spread over a very large number of components.

Friday, December 24, 2021

SQL Edge is an optimized relational database engine that is geared towards edge computing. It provides a high-performance data storage and processing layer for IoT applications. It provides capabilities to stream, process and analyze data where the data can vary from relational to document to graph to time-series and which makes it a right choice for a variety of modern IoT applications. It is built on the same database engine as the SQL Server and Azure SQL so applications will find it convenient to seamlessly use queries that are written in T-SQL. This makes applications portable between devices, datacenters and cloud.

Azure SQL Edge uses the same stream capabilities as Azure Stream Analytics on IoT edge. This native implementation of data streaming is called T-SQL streaming. It can handle fast streaming from multiple data sources. A T-SQL Streaming job consists of a Stream Input that defines the connections to a data source to read the data stream from, a stream output job that defines the connections to a data source to write the data stream to, and a stream query job that defines the data transformation, aggregations, filtering, sorting and joins to be applied to the input stream before it is written to the stream output.

Azure SQL Edge is also noteworthy for bringing the machine learning technique directly to the edge by running ML models for edge devices. SQL Edge supports Open Neural Network Exchange (ONNX) and the model can be deployed with T-SQL. The model can be pre-trained or custom-trained outside the SQL Edge with a choice of frameworks. The model just needs to be in ONNX format. The ONNX model is simply inserted into the models table in the ONNX database and the connection string is sufficient to send the data into SQL. Then the PREDICT method can be run on the data using the model.

ML pipeline is a newer technology as compared to traditional software development stacks and pipelines have generally been on-premises simply due to the latitude in using different frameworks and development styles. Also, experimentation can get out of control from the limits allowed for free-tier in the public cloud. In some cases, Event processing systems such as Apache Spark and Kafka find it easier to replace Extract-Transform-Load solutions that proliferated with data warehouse. The use of SQL Edge avoids the requirement to perform ETL and machine learning models are end-products. They can be hosted in a variety of environments not just the cloud or the SQL Edge. Some ML users would like to load the model on mobile or edge devices. Many IoT traffic and experts agree that the streaming data from edge devices can be quite heavy in traffic where a database system will out-perform any edge device-based computing. Internet tcp relays are of the order of 250-300 milliseconds whereas ingestion rate for database processing can be upwards of thousands of events per second. These are some of the benefits of using machine learning within the database.

Thursday, December 23, 2021

A summary of the book “The Burnout Fix” written by Jacinta M. Jimenez

This is a book that talks about how to overcome overwhelm, beat busy and sustain success in the new world of work. One would think that burnout is a problem where a solution is screaming at you but there is no rapid relief ointment without systemic changes. The book recognizes burnout as a pervasive social problem in the United States. Jacinta M. Jimenez attributes it to factors that must be addressed by both the individual and his/her organization. The organization must treat the symptoms of burnout and need to prevent it altogether. The individual must foster resilience with her science-backed PULSE practices. These practices help to lead a more purpose driven life and support team members’ well-beings.

We see rightaway that there is a shift of focus from grit to resilience. A hyperconnected world demands more from an individual besides their own beliefs to take on unsustainable volumes of work and remain available to tackle tasks. When this goes on for too long, burnout results. Even if we work harder or smarter, we neglect to nurture a steady personal pulse which makes even our successes short-lived.

There are five capabilities suggested to avoid burnout and lead to improvements in the following areas:

1. Behavioral – where we boost our professional and personal growth by developing a healthy performance pace.

2. Cognitive – where we rid ourselves of unhealthy thought patterns

3. Physical – where we embrace the power of leisure as a strategy to protect and restore the reserves of energy

4. Social – where we build a diverse network of social support to make ourselves more adaptable and improve our thinking.

5. Emotional – where we don’t control our priorities or time, evaluate the effort we exert and take control of ourselves.

Most people tackle their goals by breaking them into smaller concrete steps to help avoid cognitive and emotional exhaustion. There are three P’s involved which are 1. Plan where we assess our skills and progressive towards bigger goals using progress indicators. 2. Practice where we commit to continuous learning by experimenting and receiving feedback while journaling our progress. 3. Ponder where we reflect on what worked and what did not work.

We must reduce distracting thoughts and work towards mental clarity using three C’s 1 Curiosity where we identify recurring thoughts and check if they are grounded in reality, 2. Compassion for ourselves where we treat ourselves to our kindness and 3. Calibration where we switch between showing ourselves more compassion to needing more information. We can cultivate this by stacking habits including new habits, scheduling reminders for mind space, breathing, writing down thoughts and learning from binary thinking, sticking with self-compassion and being consistent.

We must prioritize leisure time The ability to enjoy stress free leisure time is essential to keep calm and centered.

We focus on how fast others respond rather than the quality of responses. To give more space to ourselves, we must practice three S’s which are 1. Silence to eliminate duress from devices or going on meditation retreat 2. Sanctuary to go and spend time in nature or to improve our mood and 3. solitude to spend time by ourselves to slow down sensory input.

Social wellness is equally important and rather limited by symptoms before burnout. We feel we belong and can securely access support from community, we reduce the stress on our brain and create conditions for improved productivity. This can be done with three B’s: Belonging where we strengthen our sense of belonging by actively working to be more compassionate. 2. Breadth where we create a visual map of the circles of support and 3. boundaries where we reflect on our personal values.

Energy is finite so we must manage it carefully. We do this with three E’s: 1. Enduring principles where we determine what guides us in our current stage, 2. Energy expenditure where we assess how we spend our energy and 3. Emotional acuity where we resist the tendency to ignore our emotions.

Similarly, we lead healthy teams by embracing 1. Agency, 2. Benevolence, and 3. Community. When leaders demonstrate and implement techniques to increase resilience, it percolates through the rank and file.

Wednesday, December 22, 2021

Azure Machine Learning provides an environment to create and manage the end-to-end life cycle of Machine Learning models. Unlike general purpose software, Azure machine learning has significantly different requirements such as the use of a wide variety of technologies, libraries and frameworks, separation of training and testing phases before deploying and use of a model and iterations for model tuning independent of the model creation and training etc. Azure Machine Learning’s compatibility with open-source frameworks and platforms like PyTorch and TensorFlow makes it an effective all-in-one platform for integrating and handling data and models which tremendously relieves the onus on the business to develop new capabilities. Azure Machine Learning is designed for all skill levels, with advanced MLOps features and simple no-code model creation and deployment.

We will compare this environment with TensorFlow but for those unfamiliar with the latter, here is a use case with TensorFlow. A JavaScript application performs image processing with a machine learning algorithm. When enough training data images have been processed, the model learns the characteristics of the drawings which results in their labels. Then as it runs through the test data set, it can predict the label of the drawing using the model. TensorFlow has a library called Keras which can help author the model and deploy it to an environment such as Colab where the model can be trained on a GPU. Once the training is done, the model can be loaded and run anywhere else including a browser. The power of TensorFlow is in its ability to load the model and make predictions in the browser itself.

The labeling of drawings starts with a sample of say a hundred classes. The data for each class is available on Google Cloud as numpy arrays with several images numbering say N, for that class. The dataset is pre-processed for training where it is converted to batches and outputs the probabilities.

As with any ML learning example, the data is split into 70% training set and 30% test set. There is no order to the data and the split is taken over a random set.

TensorFlow makes it easy to construct this model using the TensorFlow Lite ModelMaker. It can only present the output after the model is trained. In this case, the model must be run after the training data has labels assigned. This might be done by hand. The model works better with fewer parameters. It might contain 3 convolutional layers and 2 dense layers. The pooling size is specified for each of the convolutional layers and they are stacked up on the model. The model is trained using the tf.train.AdamOptimizer() and compiled with a loss function, optimizer just created, and a metric such as top k in terms of categorical accuracy. The summary of the model can be printed for viewing the model. With a set of epochs and batches, the model can be trained. Annotations help TensorFlow Lite converter to fuse TF.Text API. This fusion leads to a significant speedup than conventional models. The architecture for the model is also tweaked to include projection layer along with the usual convolutional layer and attention encoder mechanism which achieves similar accuracy but with much smaller model size. There is native support for HashTables for NLP models.

With the model and training/test sets defined, it is now as easy to evaluate the model and run the inference. The model can also be saved and restored. It is executed faster when there is GPU added to the computing.

When the model is trained, it can be done in batches of predefined size. The number of passes of the entire training dataset called epochs can also be set up front. A batch size of 256 and the number of steps as 5 could be used. These are called model tuning parameters. Every model has a speed, Mean Average Precision and output. The higher the precision, the lower the speed. It is helpful to visualize the training with the help of a high chart that updates the chart with the loss after each epoch. Usually there will be a downward trend in the loss which is referred to as the model is converging.

When the model is trained, it might take a lot of time say about 4 hours. When the test data has been evaluated, the model’s efficiency can be predicted using precision and recall, terms that are used to refer to positive inferences by the model and those that were indeed positive within those inferences.

Azure Machine Learning has a drag and drop interface that can be used to train and deploy models. It uses a machine learning workspace to organize shared resources such as pipelines, datasets, compute resources, registered models, published pipelines, and real-time endpoints. A visual canvas helps build end to end machine learning workflow. It trains, tests and deploys models all in the designer. The datasets and components can be dragged and dropped onto the canvas. A pipeline draft connects the components. A pipeline run can be submitted using the resources in the workspace. The training pipelines can be converted to inference pipelines and the pipelines can be published to submit a new pipeline that can be run with different parameters and datasets. A training pipeline can be reused for different models and a batch inference pipeline can be used to make predictions on new data.

Tuesday, December 21, 2021

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing the included the most recent discussion on Azure Maps which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. In this article, we explore Azure Logic applications.

Each logic app is a workflow that implements some process. This might be a system-to-system process, such as connecting two or more applications. Alternatively, it might be a user-to-system process, one that connects people with software and potentially has long delays. Logic Apps is designed to support either of these scenarios.

Azure Logic Applications is a member of Azure Integration Services. It simplifies the way legacy, modern and niche systems are connected across cloud, on-premises and hybrid environments. The integrated solutions are very valuable for B2B scenarios. Integration services distinguish themselves with four common components in their design – namely, APIs, Events, Messaging, and Orchestration. APIs are a prerequisite for interactions between services. They facilitate functional programmatic access as well as automation. For example, a workflow orchestration might implement a complete business process by invoking different APIs in different applications, each of which carries out some part of that process. Integrating applications commonly requires implementing all or part of a business process. It can involve connecting software-as-a-service implementation such as Salesforce CRM, update on-premises data stored in SQL Server and Oracle database and invoke operations in an external application. These translate to specific business purposes and custom logic for orchestration. Many backend operations are asynchronous by nature requiring background operations. Even APIs are written with asynchronous processing but long running APIs are not easily tolerated. Some form of background processing is required. Situations like this call for a message queue. Events facilitate the notion of publisher-subscriber so that the polling on messages from a queue can be avoided. For example, Event Grid supports subscribers to avoid polling. Rather than requiring a receiver to poll for new messages, the receiver instead registers an event handler for the event source it’s interested in. Event Grid then invokes that event handler when the specified event occurs. Azure Logic applications are workflows. A workflow can easily span all four of these components for its execution.

Azure Logic Applications can be multi-tenant. It is easier to write the application as multi-tenant when we create a workflow from the template's gallery. These range from simple connectivity for Software-as-a-service applications to advanced B2B solutions. Multi-tenancy means there is a shared, common infrastructure across numerous customers simultaneously, leading to economies of scale

Monday, December 20, 2021

Azure SQL Edge uses the same stream capabilities as Azure Stream Analytics on IoT edge. This native implementation of data streaming is called T-SQL streaming. It can handle fast streaming from multiple data sources. A T-SQL Streaming job consists of a Stream Input that defines the connections to a data source to read the data stream from, a stream output job that defines the connections to a data source to write the data stream to, and a stream query job that defines the data transformation, aggregations, filtering, sorting and joins to be applied to the input stream before it is written to the stream output.

Data can be transferred in and out of SQL Edge. For example, data can be synchronized from SQL Edge to Azure Blob storage by using Azure Data factory. As with all SQL instances, the client tools help create the database and the tables. The SQLPackage.exe is used to create and apply a DAC package file to the SQL Edge container. A stored procedure or trigger is used to update the watermark levels for a table. A watermark table is used to store the last timestamp up to which data has already been synchronized with Azure Storage. The stored procedure is run after every synchronization. A Data factory pipeline is used to synchronize data to Azure Blob storage from a table in Azure SQL Edge. This is created by using its user interface. The PeriodicSync property must be set at the time of creation. A lookup activity is used to get the old watermark value. A dataset is created to represent the data in the watermark table. This table contains the old watermark that was used in the previous copy operation. A new Linked Service is created to source the data from the SQL Edge server using a connection credentials. When the connection is tested, it can be used to preview the data to eliminate surprised during synchronization. The pipeline editor is a designer tool where the WatermarkDataset is selected as the source dataset. The lookup activity gets new watermark value from the table that contains the source data so it can be copied to the destination. A query can be added to the pipeline editor for selecting the maximum value of the timestamp from the Watermark table. Only the first row is selected as the new watermark. Incremental progress is maintained by continually advancing the watermark. Not only the source but the sink must also be specified on the editor. The sink will use a new linked service to the blob storage. The success output of a Copy activity is connected to a stored procedure activity which then writes a new watermark. Finally, the pipeline is scheduled to be triggered periodically.