Cluster computing

Monday, August 9, 2021

A search service over blobs

Cloud attracts unstructured data in the form of blobs in object storage. Since it is inevitable and unstructured content escapes the usual organization associated with relational data, some form of search service could be helpful to the end-user who must search the blobs from different data sources. This document focuses on the technical considerations for making this leap and leaves the possibility of making a business case as out of scope in this discussion.

Desktop search, enterprise search and even internet search speak to the versatility and popularity of search tools. These tools are popular for many reasons, but they are all targeted towards the end-user who must navigate through a large collection in its absence. The query layer for search service sometimes requires its own pipeline albeit from the same dataset. Relational data management provided a foundation for SQL queries to be written.

Let us consider object storage as a destination for logs to improve production support drastically with the ability to search the store. A query layer can directly re-use the object storage as its data source. The store is limitless and has no maintenance. Log stores are typically time-series databases. A time-series database makes progressive buckets as each one fills with events and this can be done easily with object storage too. The namespace-bucket-object hierarchy is well suited for time-series data. There is no limit to the number of objects within a bucket and we can roll over buckets in the same hot-warm-cold manner that time series databases do. Moreover, with the data available in the object storage, it is easily accessible to all users for reading over the HTTP. The only caveat is that some production support requests may be made to accommodate separate object–storage for the persistence of objects in the cache from the object-storage for the persistence of logs. This is quite reasonable and maybe accommodated on-premise or in the cloud depending on the worth of the data and the cost incurred. The log stores can be periodically trimmed as well. In addition, the entire querying stack for reading these entries can be built on copies or selections of buckets and objects. More about saving logs and their indexes in object storage is available at: https://1drv.ms/w/s!Ashlm-Nw-wnWt3eeNpsNk1f3BZVM

Sunday, August 8, 2021

Introduction:

This article is a continuation of the series of articles starting with the description of SignalR service. In this article, we begin to discuss Azure services for Containers including the Azure Kubernetes Service and the Azure Container Instance after the last article on Azure Stream Analytics. AKS provides a Kubernetes cluster in a serverless environment hosted in the cloud which can elastically scale your Kubernetes cluster to meet the challenges of your workload. The workload can be run in the cloud at the edge or as a hybrid with support for running .Net applications on Windows Server containers. Java applications on Linux containers or microservice applications in different languages and environments. Essentially, this provisions a datacenter on top of Azure stack HCI. The hyper-converged infrastructure is an Azure service that provides security performance and feature updates and allows the data center to be extended to the cloud. When the AKS is deployed on a Windows server 2019 data center the cluster is local to the Windows Server but when it is deployed to the Azure stack HCI it can scale seamlessly because the HCI is hosted on its set of clusters.

Well-known container orchestration frameworks such as Kubernetes rose in popularity meteorically with their widespread acceptance on Linux hosts. The challenge with running Kubernetes is that the Kubernetes Control Plane must be hosted on Linux flavor Virtual machines. While it can support workload servers on both Linux and Windows, it is not easy to get much further by trying to run the control plane on Windows Server. Minikube tried to reduce this gap but it could not avoid this requirement either. Once we have a virtual box on a window server, an Ubuntu flavor image can be used to launch a host and then deploying the Kubernetes is easy. Azure Kubernetes Service also requires Linux flavor servers for launching the Kubernetes control plane and it can run worker nodes on either operating system.

Kubernetes uses pods to run an instance of the microservice or user application. The pods, deployments, stateful sets, and daemon sets are the worker node components. The pod hosts, deployment controllers, and Kubernetes scheduler ensures that the replicas are distributed efficiently. When the pods have images that correspond to windows servers and OS flavors, they can run .NET applications. Mixed-mode operating systems can also be utilized for hosting an application by leveraging Kubernetes node selectors, taint, and toleration techniques.

The issues faced in hosting Kubernetes control plane on Windows with Minikube, MicrosK8s, and other Kubernetes runtimes include the following:

1) Use of an external insecure docker registry

a. The insecure term is used only for HTTP versus HTTPS. It was required because docker and minikube on windows work together by taking an –insecure-registry as a start-up parameter. This is not the case on Linux where we can do without this parameter and have the minikube host its own docker registry. On windows, we install the Docker toolbox and Minikube separately. This gives us two virtual machines by name ‘default’ for docker and ‘minikube’ for the Kubernetes cluster. Both are Linux environments

2) Sufficient disk space and NAT traversal

a. The insecure registry requires sufficient disk space. The default size is about 18GB and this is not sufficient when there are multiple images required. The typical size for commercial workloads is at least 30GB. The Docker Toolbox is preferred over other software packages for installing Docker on Windows. It is preferable to have Minikube start with at least 2 CPU and 8 GB memory. Small minikube deployments are expected to have frequent restarts due to low resources. The specification of CPU, memory and disk for the whole host does not necessarily lower the size of various clusters used with workloads. A minimal-dev-values file could help specify the lower end of the number of replicas and containers and their consumption.

3) Minikube’s support is usually host-facing as opposed to the external world which is also why it requires an insecure registry.

4) Security of the overall Minikube installation as opposed to the deployment of the Kubernetes stack on the Linux nodes is a concern because the instance is already completely owned as opposed to a managed and locked-down instance.

Cloud services do not encounter these issues because they have the luxury of elastic and even hyper-converged resources and they have absolutely no need to host the Kubernetes control plane on Windows.

Saturday, August 7, 2021

.SYNOPSIS

This script can be called from a runbook and uses Azure REST methods.

Unlike user identity, applications and service principals cannot connect to Az account.

This module shows how to get a token so that resources can be created, updated and deleted using REST methods.

#https://docs.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-client-creds-grant-flow

#similar to Connect-AzAccount -identity

function Get-Payload() {

param (

[Parameter(Mandatory=$true)][string]$ClientId,

[Parameter(Mandatory=$true)][string]$ClientSecret,

[string]$Resource = "https://management.core.windows.net/"

)

$encoded=[System.Web.HttpUtility]::UrlEncode($ClientSecret)

$payload = "grant_type=client_credentials&client_id=$ClientId&client_secret=$encoded&resource=$Resource"

return $payload

}

function Get-Token(){

param (

[Parameter(Mandatory=$true)][string]$TenantId,

[Parameter(Mandatory=$true)][string]$ClientId,

[Parameter(Mandatory=$true)][string]$ClientSecret,

[string]$Resource = "https://management.core.windows.net/",

[string]$RequestAccessTokenUri = "https://login.microsoftonline.com/$TenantId/oauth2/token"

)

$payload = Get-Payload $ClientId $ClientSecret

$Token = Invoke-RestMethod -Method Post -Uri $RequestAccessTokenUri -body $payload -ContentType 'application/x-www-form-urlencoded'

return $Token

}

function Get-ResourceGroups(){

param (

[Parameter(Mandatory=$true)][string]$TenantId,

[Parameter(Mandatory=$true)][string]$SubscriptionId,

[Parameter(Mandatory=$true)][string]$ClientId,

[Parameter(Mandatory=$true)][string]$ClientSecret,

[Parameter(Mandatory=$true)][string]$ResourceGroupName,

[string]$Resource = "https://management.core.windows.net/",

[string]$environment = "AzureCloud",

[string]$RequestAccessTokenUri = "https://login.microsoftonline.com/$TenantId/oauth2/token"

)

$Token = Get-Token $TenantId ClientId $ClientSecret $Resource $RequestAccessTokenUri

$ApiUri = "https://management.azure.com/subscriptions/$($SubscriptionId)/resourcegroups?api-version=2017-05-10"

$Headers = @{}

$Headers.Add("Authorization","$($Token.token_type) "+ " " + "$($Token.access_token)")

$ResourceGroups = Invoke-RestMethod -Method Get -Uri $ApiUri -Headers $Headers

return $ResourceGroups

}

function Get-Cache() {

param (

[Parameter(Mandatory=$true)][string]$TenantId,

[Parameter(Mandatory=$true)][string]$SubscriptionId,

[Parameter(Mandatory=$true)][string]$ClientId,

[Parameter(Mandatory=$true)][string]$ClientSecret,

[Parameter(Mandatory=$true)][string]$ResourceGroupName,

[Parameter(Mandatory=$true)][string]$CacheName,

[string]$Resource = "https://management.core.windows.net/",

[string]$environment = "AzureCloud",

[string]$RequestAccessTokenUri = "https://login.microsoftonline.com/$TenantId/oauth2/token"

)

$Token = Get-Token $TenantId $ClientId $ClientSecret $Resource $RequestAccessTokenUri

$ApiUri="https://management.azure.com/subscriptions/$SubscriptionId/resourceGroups/$ResourceGroupName/providers/Microsoft.Cache/redis/$($CacheName)?api-version=2020-06-01"

$Headers = @{}

$Headers.Add("Authorization","$($Token.token_type) "+ " " + "$($Token.access_token)")

$Cache = Invoke-RestMethod -Method Get -Uri $ApiUri -Headers $Headers

return $Cache

}

function New-Cache() {

param (

[Parameter(Mandatory=$true)][string]$TenantId,

[Parameter(Mandatory=$true)][string]$SubscriptionId,

[Parameter(Mandatory=$true)][string]$ClientId,

[Parameter(Mandatory=$true)][string]$ClientSecret,

[Parameter(Mandatory=$true)][string]$ResourceGroupName,

[Parameter(Mandatory=$true)][string]$CacheName,

[string]$Resource = "https://management.core.windows.net/",

[string]$environment = "AzureCloud",

[string]$RequestAccessTokenUri = "https://login.microsoftonline.com/$TenantId/oauth2/token"

)

$CacheName = "AGS-redis-"

$guid = New-Guid

$CacheName = $CacheName + $guid.Guid

$Token = Get-Token $TenantId $ClientId $ClientSecret $Resource $RequestAccessTokenUri

$ApiUri = "https://management.azure.com/subscriptions/$SubscriptionId/resourceGroups/$ResourceGroupName/providers/Microsoft.Cache/redis/$($CacheName)?api-version=2020-06-01"

$payload = @"

{

"location": "West US 2",

"properties": {

"sku": {

"name":"Premium",

"family":"P",

"capacity":1

"size": "P1"

}

$Headers = @{}

$Headers.Add("Authorization","$($Token.token_type) "+ " " + "$($Token.access_token)")

$Cache = Invoke-RestMethod -contentType "application/json" -Method Put -Uri $ApiUri -Headers $Headers -Body $payload

return $Cache

}

function Remove-Cache() {

param (

[Parameter(Mandatory=$true)][string]$TenantId,

[Parameter(Mandatory=$true)][string]$SubscriptionId,

[Parameter(Mandatory=$true)][string]$ClientId,

[Parameter(Mandatory=$true)][string]$ClientSecret,

[Parameter(Mandatory=$true)][string]$ResourceGroupName,

[Parameter(Mandatory=$true)][string]$CacheName,

[string]$Resource = "https://management.core.windows.net/",

[string]$environment = "AzureCloud",

[string]$RequestAccessTokenUri = "https://login.microsoftonline.com/$TenantId/oauth2/token"

)

$Token = Get-Token $TenantId $ClientId $ClientSecret $Resource $RequestAccessTokenUri

$ApiUri="https://management.azure.com/subscriptions/$SubscriptionId/resourceGroups/$ResourceGroupName/providers/Microsoft.Cache/redis/$($CacheName)?api-version=2020-06-01"

$Headers = @{}

$Headers.Add("Authorization","$($Token.token_type) "+ " " + "$($Token.access_token)")

$Cache = Invoke-RestMethod -Method Delete -Uri $ApiUri -Headers $Headers

return $Cache

}

Export-ModuleMember -Function Get-Token

Export-ModuleMember -Function Get-ResourceGroups

Export-ModuleMember -Function New-Cache

Export-ModuleMember -Function Get-Cache

Export-ModuleMember -Function Remove-Cache

#codingexercise https://1drv.ms/w/s!Ashlm-Nw-wnWrwRgdOFj3KLA0XSi

Friday, August 6, 2021

Introduction:

This article is a continuation of the series of articles starting with the description of SignalR service. In this article, we continue with our study of Azure Stream Analytics from the last article. We were comparing Apache Flink and Kafka with The Azure Stream Analytics and were observing comparisons between Flink and Azure stream analytics to determine watermarks and sequence events along their timelines. We now investigate automations for performing stream analysis.

Jobs and clusters form the two main components of the stream analytics. When the job is created, the deployment can be validated. The job itself is represented by an ARM template which is a JSON notation and it defines the infrastructure and configuration for the project. The template uses declarative syntax so there is no need to write commands to create the deployment. The template takes parameters such as the location, stream analytics job name and number of streaming units which is then applied to the resource and the job is created.

Deployment using the template can be kicked off directly from the CLI, PowerShell, portal and SDK. All of these provide programmability options. Resources can be cleaned up by deleting the resource group.

The input to the job can also be configured with the help of a Powershell cmdlet. The New-AzStreamAnalyticsInput cmdlet takes the job name, job input name and resource group name, and the job input definition as parameters. Even the blob storage can be passed in as an input. AccessPolicyKey and Shared AccessKey are derived from the connection strings to the data source. The output to the job is similarly configured with the help of a JobOutputDefinition and it takes the storage account access key as a parameter. Blobs will be stored in a container from that account. Finally the transformation query can be specified via the New-AzStreamAnalyticsTransformation cmdlet which takes the job name, job transformation name, resource group name, and the job transformation definition as parameters. This declaration contains a query property that defines the transformation query.

The Start-AzStreamAnalyticsJob cmdlet takes the job name, resource group name, output start mode and startTime as parameters and kick starts the job.

Thursday, August 5, 2021

Introduction:

This article is a continuation of the series of articles starting with the description of SignalR service. In this article, we continue with our study of Azure Stream Analytics from the last article. We were comparing Apache Flink and Kafka with The Azure Stream Analytics and were observing the utilization of Kubernetes to leverage containers for the clusters and running jobs for analysis. One of the most interesting applications of stream processing is the support for watermarks and we explore this comparison next.

Flink provides three different types of processing based on timestamps which are independent of the above two methods. There can be three different types of timestamps corresponding to: processing time, event time and ingestion time.

Out of these only the event time guarantees completely consistent and deterministic results. All three processing types can be set on the StreamExecutionEnvironment prior to the execution of queries.

Event time also support watermarks. Watermarks is the mechanism in Flink to measure progress in event time. They are simply inlined with the events. As a processor advances its timestamp, it introduces a watermark for the downstream operators to process. In the case of distributed systems where an operator might get inputs from more than one streams, the watermark on the outgoing stream is determined from the minimum of the watermarks from the invoking streams. As the input streams update their event times, so does the operator. Flink also provides a way to coalesce events within the window.

Flink-connector has an EventTimeOrderingOperator. This uses watermark and managed state to buffer elements which helps to order elements by event time. This class extends the AbstractStreamOperator and implements the OneInputStreamOperator. The last seen watermark is initialized to min value. It uses a timer service and mapState stashed in the runtime Context. It processes each stream record one by one. If the event does not have a timestamp, it simply forwards. If the event has a timestamp, it buffers all the events between the current and the next watermark.

When the event Timer fires due to watermark progression, it polls all the event time stamp that are less than or equal to the current watermark. If the timestamps are empty, the queued state is cleared otherwise the next watermark is registered. The sorted list of timestamps from buffered events is maintained in a priority queue.

AscendingTimestampExtractor is a timestamp assigner and watermark generator for streams where timestamps are monotonously ascending. The timestamps continuously increase for data such as log files. The local watermarks are easily assigned because they follow the strictly increasing timestamps which are periodic.

Microsoft Azure Stream analytics also follows a timeline for events. There are two choices – arrival time and application/event time. It bases its watermark on the largest event time the service has seen minus the out of order tolerance window size. If there are no incoming event, the watermark is the current estimated arrival time minus the late arrival tolerance window. This can only be estimated because the real arrival time is on the data forwarders such as Event Hubs. The design serves two additional purposes other than generating watermarks – the system generates results in a timely fashion with or without incoming events and the system behavior needs to be repeatable. Since the data forwarder guarantees continuously increasing streaming data, the service disregards configurations for out-of-order tolerance and late arrival tolerance when analytics applications choose arrival time as event time.

Wednesday, August 4, 2021

Introduction:

This article is a continuation of the series of articles starting with the description of SignalR service.

In this article, we continue with our study of Azure Stream Analytics from the last article. We were comparing Apache Flink and Kafka with The Azure Stream Analytics and were observing the utilization of Kubernetes to leverage containers for the clusters and running jobs for analysis.

Windows Azure also hosts Kubernetes for customers. Native Containers are small and fast. They have two characteristics. First, the containers are isolated from each other and from the host in that they even have their own file system. which makes it portable across cloud and os distributions. Second, the immutable container images can be created at build/release time rather than the deployment time of the application since each application doesn't need to be composed with the rest of the application stack nor tied to the production infrastructure environment. Kubernetes extends this idea of app+container all the way where the host can be nodes of a cluster. Kubernetes evolved as an industry effort from the native Linux containers support of the operating system. It can be considered as a step towards a truly container-centric development environment. Containers decouple applications from infrastructure which separates dev from ops. Containers demonstrate better resource isolation and improved resource utilization.

At this point, it is important to differentiate Kubernetes from PaaS. Kubernetes is not a traditional, all-inclusive PaaS. Unlike PaaS that restricts applications, dictates the choice of application frameworks, restricts supported language runtimes, or distinguishes apps from services, Kubernetes aims to support an extremely diverse variety of workloads. If the application has been compiled to run in a container, it will work with Kubernetes. PaaS provides databases, message buses, cluster storage systems but those can run on Kubernetes. There is also no click to deploy service marketplace. Kubernetes does not build user code or deploy it but it facilitates CI workflows to run on it.

Kubernetes allows users to choose to log, monitoring and alerting Kubernetes also does not require a comprehensive application language or system. It is independent of machine configuration or management. But PaaS can run on Kubernetes and extend its reach to different clouds.

The section talks about Azure stream analytics cluster this cluster offers a single tenant deployment for complex streaming scenarios clusters can process more than 200 megabytes per second in real time the jobs running on these clusters can leverage the features of they should stream analytics add can directly read from inputs and outputs that a private to organization clusters are built by streaming units (SU) which represent a unit of CPU and memory resources allocated to a cluster. Cluster size can range from 36 SUs to 216 Sus.

Since the clusters our single tenant dedicated clusters, they can be run in a multi-tenant environment with complete isolation from other tenants. These clusters can be scaled, and a virtual net added to support your jobs to connect to resources securely over private endpoints. These clusters come with zero maintenance cost so that the application can focus only on the streaming jobs with sharing of results with multiple teams.

Tuesday, August 3, 2021

Azure stream analytics

Introduction:

This article is a continuation of the series of articles starting with the description of SignalR service sometime back. In this article, we focus on stream analytics from Azure. As the name suggests, this is a service used for analyzing events that are ordered and arrive in a continuous manner. Like its industry counterparts, this service also defines notions of jobs and running them on clusters analysis done with the help of data arriving in this form include identifying patterns and relationships and applies to data sources that range from devices sensors clickstreams social media feeds and other applications action be taken certain patterns and workflow scared that provide alerts and notifications to users’ data can also be transformed and channeled via pipelines for automating. This service is available on Azure IoT Edge runtime environment and enables processing data on those devices.

Data from device traffic usually build timestamps and are discreet, often independent of one another. They are also characterized as unstructured data and arriving in an ordered manner where it's generally not possible to store all of them at once for subsequent analysis. When the analysis is done in batches it becomes a batch processing job that runs on a cluster and scales out batches to different nodes, as many as the cluster will allow. Holding sets of data events in batches might introduce latency, so the notion of micro batching is introduced for more processing. Streaming actions, take it even further to process one event at a time.

Some of the use cases for continuous events involve geospatial analytics for fleet management and driverless vehicles weblogs in clickstream analytics and point of sale data from inventory control. In all these cases there is a point of ingestion from data sources typically via Azure Event Hubs, IoT hub, or BLOB storage. Even tottering options and time windows can be suitably adjusted to perform aggregations. The language of query is SQL and it can be extended with JavaScript or C sharp user-defined functions. Queries written in SQL are easy to apply to filtering, sorting, and aggregation. The topology between ingestion and delivery is handled by this stream analytics service while allowing extensions with the help of reference data stores, Azure functions, and real-time scoring via machine learning services. Event Hubs, Azure BLOB storage, and IoT hubs can collect data on the ingestion side, while they are distributed after analysis via alerts and notifications, dynamic dashboarding, data warehousing, and storage/archival. The fan-out of data to different services is itself a value addition but the ability to transform events into processed events also generates more possibilities for downstream usages including reporting and visualizations. As with all the services in the Azure portfolio, it comes with standard deployment using Azure resource manager templates, health monitoring via Azure monitoring, billing usages that can drive down costs, and various forms of programmability options such as SDK, REST-based API services, command-line interfaces, and PowerShell automation. It is a fully managed PaaS offering so the infrastructure and workflow initializers need not be set up by hand. It can also run in the cloud and scale to many events with relatively low latency. This service is not only production ready but also reliable in mission-critical deployments. Security and compliance are not sacrificed for the sake of performance. Finally, it integrates with Visual Studio to bring comprehensive testing, debugging, publishing, and authoring convenience.