Cluster computing

Wednesday, December 22, 2021

Azure Machine Learning provides an environment to create and manage the end-to-end life cycle of Machine Learning models. Unlike general purpose software, Azure machine learning has significantly different requirements such as the use of a wide variety of technologies, libraries and frameworks, separation of training and testing phases before deploying and use of a model and iterations for model tuning independent of the model creation and training etc. Azure Machine Learning’s compatibility with open-source frameworks and platforms like PyTorch and TensorFlow makes it an effective all-in-one platform for integrating and handling data and models which tremendously relieves the onus on the business to develop new capabilities. Azure Machine Learning is designed for all skill levels, with advanced MLOps features and simple no-code model creation and deployment.

We will compare this environment with TensorFlow but for those unfamiliar with the latter, here is a use case with TensorFlow. A JavaScript application performs image processing with a machine learning algorithm. When enough training data images have been processed, the model learns the characteristics of the drawings which results in their labels. Then as it runs through the test data set, it can predict the label of the drawing using the model. TensorFlow has a library called Keras which can help author the model and deploy it to an environment such as Colab where the model can be trained on a GPU. Once the training is done, the model can be loaded and run anywhere else including a browser. The power of TensorFlow is in its ability to load the model and make predictions in the browser itself.

The labeling of drawings starts with a sample of say a hundred classes. The data for each class is available on Google Cloud as numpy arrays with several images numbering say N, for that class. The dataset is pre-processed for training where it is converted to batches and outputs the probabilities.

As with any ML learning example, the data is split into 70% training set and 30% test set. There is no order to the data and the split is taken over a random set.

TensorFlow makes it easy to construct this model using the TensorFlow Lite ModelMaker. It can only present the output after the model is trained. In this case, the model must be run after the training data has labels assigned. This might be done by hand. The model works better with fewer parameters. It might contain 3 convolutional layers and 2 dense layers. The pooling size is specified for each of the convolutional layers and they are stacked up on the model. The model is trained using the tf.train.AdamOptimizer() and compiled with a loss function, optimizer just created, and a metric such as top k in terms of categorical accuracy. The summary of the model can be printed for viewing the model. With a set of epochs and batches, the model can be trained. Annotations help TensorFlow Lite converter to fuse TF.Text API. This fusion leads to a significant speedup than conventional models. The architecture for the model is also tweaked to include projection layer along with the usual convolutional layer and attention encoder mechanism which achieves similar accuracy but with much smaller model size. There is native support for HashTables for NLP models.

With the model and training/test sets defined, it is now as easy to evaluate the model and run the inference. The model can also be saved and restored. It is executed faster when there is GPU added to the computing.

When the model is trained, it can be done in batches of predefined size. The number of passes of the entire training dataset called epochs can also be set up front. A batch size of 256 and the number of steps as 5 could be used. These are called model tuning parameters. Every model has a speed, Mean Average Precision and output. The higher the precision, the lower the speed. It is helpful to visualize the training with the help of a high chart that updates the chart with the loss after each epoch. Usually there will be a downward trend in the loss which is referred to as the model is converging.

When the model is trained, it might take a lot of time say about 4 hours. When the test data has been evaluated, the model’s efficiency can be predicted using precision and recall, terms that are used to refer to positive inferences by the model and those that were indeed positive within those inferences.

Azure Machine Learning has a drag and drop interface that can be used to train and deploy models. It uses a machine learning workspace to organize shared resources such as pipelines, datasets, compute resources, registered models, published pipelines, and real-time endpoints. A visual canvas helps build end to end machine learning workflow. It trains, tests and deploys models all in the designer. The datasets and components can be dragged and dropped onto the canvas. A pipeline draft connects the components. A pipeline run can be submitted using the resources in the workspace. The training pipelines can be converted to inference pipelines and the pipelines can be published to submit a new pipeline that can be run with different parameters and datasets. A training pipeline can be reused for different models and a batch inference pipeline can be used to make predictions on new data.

Tuesday, December 21, 2021

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing the included the most recent discussion on Azure Maps which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. In this article, we explore Azure Logic applications.

Each logic app is a workflow that implements some process. This might be a system-to-system process, such as connecting two or more applications. Alternatively, it might be a user-to-system process, one that connects people with software and potentially has long delays. Logic Apps is designed to support either of these scenarios.

Azure Logic Applications is a member of Azure Integration Services. It simplifies the way legacy, modern and niche systems are connected across cloud, on-premises and hybrid environments. The integrated solutions are very valuable for B2B scenarios. Integration services distinguish themselves with four common components in their design – namely, APIs, Events, Messaging, and Orchestration. APIs are a prerequisite for interactions between services. They facilitate functional programmatic access as well as automation. For example, a workflow orchestration might implement a complete business process by invoking different APIs in different applications, each of which carries out some part of that process. Integrating applications commonly requires implementing all or part of a business process. It can involve connecting software-as-a-service implementation such as Salesforce CRM, update on-premises data stored in SQL Server and Oracle database and invoke operations in an external application. These translate to specific business purposes and custom logic for orchestration. Many backend operations are asynchronous by nature requiring background operations. Even APIs are written with asynchronous processing but long running APIs are not easily tolerated. Some form of background processing is required. Situations like this call for a message queue. Events facilitate the notion of publisher-subscriber so that the polling on messages from a queue can be avoided. For example, Event Grid supports subscribers to avoid polling. Rather than requiring a receiver to poll for new messages, the receiver instead registers an event handler for the event source it’s interested in. Event Grid then invokes that event handler when the specified event occurs. Azure Logic applications are workflows. A workflow can easily span all four of these components for its execution.

Azure Logic Applications can be multi-tenant. It is easier to write the application as multi-tenant when we create a workflow from the template's gallery. These range from simple connectivity for Software-as-a-service applications to advanced B2B solutions. Multi-tenancy means there is a shared, common infrastructure across numerous customers simultaneously, leading to economies of scale

Monday, December 20, 2021

SQL Edge is an optimized relational database engine that is geared towards edge computing. It provides a high-performance data storage and processing layer for IoT applications. It provides capabilities to stream, process and analyze data where the data can vary from relational to document to graph to time-series and which makes it a right choice for a variety of modern IoT applications. It is built on the same database engine as the SQL Server and Azure SQL so applications will find it convenient to seamlessly use queries that are written in T-SQL. This makes applications portable between devices, datacenters and cloud.

Azure SQL Edge uses the same stream capabilities as Azure Stream Analytics on IoT edge. This native implementation of data streaming is called T-SQL streaming. It can handle fast streaming from multiple data sources. A T-SQL Streaming job consists of a Stream Input that defines the connections to a data source to read the data stream from, a stream output job that defines the connections to a data source to write the data stream to, and a stream query job that defines the data transformation, aggregations, filtering, sorting and joins to be applied to the input stream before it is written to the stream output.

Data can be transferred in and out of SQL Edge. For example, data can be synchronized from SQL Edge to Azure Blob storage by using Azure Data factory. As with all SQL instances, the client tools help create the database and the tables. The SQLPackage.exe is used to create and apply a DAC package file to the SQL Edge container. A stored procedure or trigger is used to update the watermark levels for a table. A watermark table is used to store the last timestamp up to which data has already been synchronized with Azure Storage. The stored procedure is run after every synchronization. A Data factory pipeline is used to synchronize data to Azure Blob storage from a table in Azure SQL Edge. This is created by using its user interface. The PeriodicSync property must be set at the time of creation. A lookup activity is used to get the old watermark value. A dataset is created to represent the data in the watermark table. This table contains the old watermark that was used in the previous copy operation. A new Linked Service is created to source the data from the SQL Edge server using a connection credentials. When the connection is tested, it can be used to preview the data to eliminate surprised during synchronization. The pipeline editor is a designer tool where the WatermarkDataset is selected as the source dataset. The lookup activity gets new watermark value from the table that contains the source data so it can be copied to the destination. A query can be added to the pipeline editor for selecting the maximum value of the timestamp from the Watermark table. Only the first row is selected as the new watermark. Incremental progress is maintained by continually advancing the watermark. Not only the source but the sink must also be specified on the editor. The sink will use a new linked service to the blob storage. The success output of a Copy activity is connected to a stored procedure activity which then writes a new watermark. Finally, the pipeline is scheduled to be triggered periodically.

Sunday, December 19, 2021

Edge Computing has developed differently from mainstream desktop, enterprise and cloud computing. The focus has always been on speed rather than data processing which is delegated to the core or cloud computing. Edge Servers work well for machine data collection and Internet of Things. Edge computing is typically associated with Event-Driven Architecture style. It relies heavily on asynchronous backend processing. Some form of message broker becomes necessary to maintain order between events, retries and dead-letter queues.

Azure SQL edge supports two deployment modes – those that are connected through Azure IoT edge and those that have disconnected deployment. The connected deployment requires Azure SQL Edge to be deployed as a module for Azure IoT Edge. In the disconnected deployment mode, it can be deployed as a standalone docker container or a Kubernetes cluster.

There are two editions for the Azure SQL edge – a developer edition and a production sku edition and the spec changes from 4 cores/32 GB to 8 cores and 64 GB. Azure SQL Edge uses the same stream capabilities as Azure Stream Analytics on IoT edge. This native implementation of data streaming is called T-SQL streaming. It can handle fast streaming from multiple data sources. The patterns and relationships in data is extracted from several IoT input sources. The extracted information can be used to trigger actions, alerts and notifications. A T-SQL Streaming job consists of a Stream Input that defines the connections to a data source to read the data stream from, a stream output job that defines the connections to a data source to write the data stream to, and a stream query job that defines the data transformation, aggregations, filtering, sorting and joins to be applied to the input stream before it is written to the stream output.

SQL Edge also support machine learning models by integrating with Open Neural Network Exchange runtimes. The models are developed independent of the edge but can be run on the edge.

Saturday, December 18, 2021

Azure Maps and heatmaps

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing. In this article, we continue the discussion on Azure Maps which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. We focus on one of the features of Azure Maps that enables overlay of images and heatmaps.

Azure Maps is a collection of geospatial services and SDKs that fetches the latest geographic data and provides it as a context to web and mobile applications. Specifically, it provides REST APIs to render vector and raster maps as overlays including satellite imagery, provides creator services to enable indoor map data publication, provides search services to locate addresses, places, and points of interest given indoor and outdoor data, provides various routing options such as point-to-point, multipoint, multipoint optimization, isochrone, electric vehicle, commercial vehicle, traffic influenced, and matrix routing, provides traffic flow view and incidents view, for applications that require real-time traffic information, provides Time zone and Geolocation services, provides elevation services with Digital Elevation Model, provides Geofencing service and mapping data storage, with location information hosted in Azure and provides Location intelligence through geospatial analytics.

The Web SDK for Azure Maps allows several features with the use of its map control. We can create a map, change the style of the map, add controls to the map, add layers on top of the map, add html markers, show traffic, cluster point data, and use data-driven style expressions, use image templates, react to events and make app accessible.

Heatmaps are also known as point density maps because they represent the density of data and the relative density of each data point using a range of colors. This can be overlaid on the maps as a layer. Heat maps can be used in different scenarios including temperature data, data for noise sensors, and GPS trace.

The addition of heat map is as simple as:

Map.layers.add(new atlas.layer.HeatMapLayer(datasource, null, { radius: 10, opacity: 0.8}), ‘labels’);

The opacity or transparency is normalized between 0 and 1. The intensity is a multiplier to the weight of each data point. The weight is a measure of the number of times the data point applies to the map.

Azure maps provides consistent zoomable heat map and the data aggregates together and the heat map might look different from when it was normal focus. Scaling the radius also changes the heat map because it doubles with each zoom level.

All of this processing is on the client side for the rendering of given data points.

Friday, December 17, 2021

Location queries
Location is a datatype. It can be represented either as a point or a polygon and each helps with answering questions such as getting top 3 stores near to a geographic point or stores within a region. Since it is a data type, there is some standardization available. SQL Server defines not one but two data types for the purpose of specifying location: the Geography data type and the Geometry data type. The Geography data type stores ellipsoidal data such as GPS Latitude and Longitude and the geometry data type stores Euclidean (flat) coordinate system. The point and the polygon are examples of the Geography data type. Both the geography and the geometry data type must have reference to a spatial system and since there are many of them, it must be used specifically in association with one. This is done with the help of a parameter called the Spatial Reference Identifier or SRID for short. The SRID 4326 is the well-known GPS coordinates that give information in the form of latitude/Longitude. Translation of an address to a Latitude/Longitude/SRID tuple is supported with the help of built-in functions that simply drill down progressively from the overall coordinate span. A table such as ZipCode could have an identifier, code, state, boundary, and center point with the help of these two data types. The boundary could be considered the polygon formed by the zip and the Center point as the central location in this zip. Distances between stores and their membership to zip can be calculated based on this center point. Geography data type also lets us perform clustering analytics which answers questions such as the number of stores or restaurants satisfying a certain spatial condition and/or matching certain attributes. These are implemented using R-Tree data structures that support such clustering techniques. The geometry data type supports operations such as area and distance because it translates to coordinates. It has its own rectangular coordinate system that we can use to specify the boundaries or the ‘bounding box’ that the spatial index covers.

The operations performed with these data types include the distance between two geography objects, the method to determine a range from a point such as a buffer or a margin, and the intersection of two geographic locations. The geometry data type supports operations such as area and distance because it translates to coordinates. Some other methods supported with these data types include contains, overlaps, touches, and within.

A note about the use of these data types now follows. One approach is to store the coordinates in a separate table where the primary keys are saved as the pair of latitude and longitude and then to describe them as unique such that a pair of latitude and longitude does not repeat. Such an approach is questionable because the uniqueness constraint for locations has a maintenance overhead. For example, two locations could refer to the same point and then unreferenced rows might need to be cleaned up. Locations also change ownership, for example, store A could own a location that was previously owned by store B, but B never updates its location. Moreover, stores could undergo renames or conversions. Thus, it may be better to keep the spatial data associated in a repeatable way along with the information about the location. Also, these data types do not participate in set operations. That is easy to do with collections and enumerable with the programming language of choice and usually consist of the following four steps: answer initialization, return an answer on termination, accumulation called for each row, and merge called when merging the processing from parallel workers. These steps are like a map-reduce algorithm. These data types and operations are improved with the help of a spatial index. These indexes continue to be like indexes of other data types and are stored using B-Tree. Since this is an ordinary one-dimensional index, the reduction of the dimensions of the two-dimensional spatial data is performed by means of tessellation which divides the area into small subareas and records the subareas that intersect each spatial instance. For example, with a given geography data type, the entire globe is divided into hemispheres and each hemisphere is projected onto a plane. When that given geography instance covers one or more subsections or tiles, the spatial index would have an entry for each such tile that is covered. The geometry data type has its own rectangular coordinate system that you define which you can use to specify the boundaries or the ‘bounding box’ that the spatial index covers. Visualizers support overlays with spatial data which is popular with mapping applications that super-impose information over the map with the help of transparent layers. An example is the Azure Maps with GeoFence as described here.

Thursday, December 16, 2021

Adding Azure Maps to an Android Application

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing. In this article, we continue the discussion on Azure Maps which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category but with an emphasis on writing mobile applications. Specifically, we target the Android platform.

We leverage an Event-Driven architecture style where the Service Bus delivers the messages that the mobile application processes. As with the case of GeoFencing, different messages can be used for different handling. The mobile application is a consumer for the message making occassional API calls that generates messages on the backend of a web-queue server. The scope of this document is to focus on just the mobile application stack. The tracking and producing of messages are done in the backend and the mobile application uses the Bing Maps to display the location. We will need an active Azure Maps account and key for this purpose. The subscription, resource group, name and pricing tier must be determined beforehand. The mobile application merely adds an Azure Maps control to the application.

An Android application will require Java based deployment. Since the communication is over HTTP, the technology stack can be independent between the backend and the mobile application. The Azure Maps Android SDK will be leveraged for this purpose. The top-level build.gradle file will define the URL https://atlas.microsoft.com/sdk/android. Java 8 can be chosen as the appropriate version to use. The SDK can be imported into the build.gradle with the artifact description as "com.azure.android:azure-maps-control:1.0.0". The application will introduce the map control as <com.azure.android.maps.control.MapControl android:id="@+id/mapcontrol" android:layout_width="match_parent" android:layout_height="match_parent" /> in the main activity xml file. The corresponding Java file will add imports for the Azure Map SDK, set the Azure Maps Authentication information and get the map control instance in the onCreate method. SetSubscriptionKey and SetAadProperties can be used to add the authentication information on every view. The control will display the map even on the emulator. Sample Android application can be seen here.

As with all application, the activity control loops must be tightened and guide the user through specific workflows. The views, their lifetime and activity must be controlled, and the user should not see the application as hung or spinning. The interactivity for the control is assured if the application is recycling and cleaning up the associated resources as the user moves in and out of a page to another page.

It is highly recommended to get the activity framework and navigations worked out and planned independent of the content. The views corresponding to the content are going to be restricted to the one that displays the control, so the application focuses mostly on the user navigations and activities.