Cluster computing

Sunday, August 15, 2021

Zone-Down simulation

Introduction:

Zone down is a drill for Availability zones in public cloud computing. Availability Zones are massive assets for a public cloud. They build resilience and availability for applications and services. It comprises of multiple datacenters and are almost indistinguishable from one another. A region may have three availability zones for redundancy and availability and each zone may host a variety of cloud computing resources – small or big. Since there are several stakeholders in an availability zone, it is a challenge to see what happens when one fails. Even the method to finding that information is quite drastic which involves powering down the zone. There are a lot of network connections to and from cloud resources, so it becomes hard to find an alternative. This article proposes a solution based on diversion of traffic between zones.

Proposal:

This is a multi-tiered approach to the problem statement.

Step 1. The first tier is about the use of virtual network around the PaaS services from the cloud from which resources have been provisioned under a subscription. This can be done with the help of Azure Private Link which helps connect resources over an internal subnet so that we don’t have to access them publicly. Setting up this virtual network is critical to the stated goal of isolating zones which would otherwise have participated in a shared zone-redundancy. If each zone could be put in a separate virtual network and if the networks do not share anything, it can be assumed that traffic flowing to one network only will prevent the utilization of the other zones, thus simulating a network zone-down. The internal traffic goes over the Microsoft backbone network and does not have to traverse the internet.

The following PowerShell command describes the steps to do this:

New-AzPrivateLinkService

-Name <String>

-ResourceGroupName <String>

-Location <String>

-LoadBalancerFrontendIpConfiguration <PSFrontendIPConfiguration[]>

-IpConfiguration <PSPrivateLinkServiceIpConfiguration[]>

Which is used in the steps like this.

$vnet = Get-AzVirtualNetwork -ResourceName 'myvnet' -ResourceGroupName 'myresourcegroup'

$subnet = $vnet | Select-Object -ExpandProperty subnets | Where-Object Name -eq 'mysubnet'

$IPConfig = New-AzPrivateLinkServiceIpConfig -Name 'IP-Config' -Subnet $subnet -PrivateIpAddress '10.0.0.5'

$publicip = Get-AzPublicIpAddress -ResourceGroupName 'myresourcegroup'

$frontend = New-AzLoadBalancerFrontendIpConfig -Name 'FrontendIpConfig01' -PublicIpAddress $publicip

$lb = New-AzLoadBalancer -Name 'MyLoadBalancer' -ResourceGroupName 'myresourcegroup' -Location 'West US' -FrontendIpConfiguration $frontend

New-AzPrivateLinkService -Name 'mypls' -ResourceGroupName myresourcegroup -Location "West US" -LoadBalancerFrontendIpConfiguration $frontend -IpConfiguration $IPConfig

Step 2. The second tier is an application gateway that onboards user traffic to one of the virtual networks

$ResourceGroup = New-AzResourceGroup -Name "ResourceGroup01" -Location "West US" -Tag @{Name = "Department"; Value = "Marketing"}

$Subnet = New-AzVirtualNetworkSubnetConfig -Name "Subnet01" -AddressPrefix 10.0.0.0/24

$VNet = New-AzVirtualNetwork -Name "VNet01" -ResourceGroupName "ResourceGroup01" -Location "West US" -AddressPrefix 10.0.0.0/16 -Subnet $Subnet

$VNet = Get-AzVirtualNetwork -Name "VNet01" -ResourceGroupName "ResourceGroup01"

$Subnet = Get-AzVirtualNetworkSubnetConfig -Name "Subnet01" -VirtualNetwork $VNet

$GatewayIPconfig = New-AzApplicationGatewayIPConfiguration -Name "GatewayIp01" -Subnet $Subnet

$Pool = New-AzApplicationGatewayBackendAddressPool -Name "Pool01" -BackendIPAddresses 10.10.10.1, 10.10.10.2, 10.10.10.3

$PoolSetting = New-AzApplicationGatewayBackendHttpSettings -Name "PoolSetting01" -Port 80 -Protocol "Http" -CookieBasedAffinity "Disabled"

$FrontEndPort = New-AzApplicationGatewayFrontendPort -Name "FrontEndPort01" -Port 80

$PublicIp = New-AzPublicIpAddress -ResourceGroupName "ResourceGroup01" -Name "PublicIpName01" -Location "West US" -AllocationMethod "Dynamic"

$FrontEndIpConfig = New-AzApplicationGatewayFrontendIPConfig -Name "FrontEndConfig01" -PublicIPAddress $PublicIp

$Listener = New-AzApplicationGatewayHttpListener -Name "ListenerName01" -Protocol "Http" -FrontendIpConfiguration $FrontEndIpConfig -FrontendPort $FrontEndPort

$Rule = New-AzApplicationGatewayRequestRoutingRule -Name "Rule01" -RuleType basic -BackendHttpSettings $PoolSetting -HttpListener $Listener -BackendAddressPool $Pool

$Sku = New-AzApplicationGatewaySku -Name "Standard_Small" -Tier Standard -Capacity 2

$Gateway = New-AzApplicationGateway -Name "AppGateway01" -ResourceGroupName "ResourceGroup01" -Location "West US" -BackendAddressPools $Pool -BackendHttpSettingsCollection $PoolSetting -FrontendIpConfigurations $FrontEndIpConfig -GatewayIpConfigurations $GatewayIpConfig -FrontendPorts $FrontEndPort -HttpListeners $Listener -RequestRoutingRules $Rule -Sku $Sku

An application gateway can be used with both public and private addresses. It comprises of Frontend IP addresses, Listeners, Request routing rules, HTTP settings and backend pools. A backend pool routes request to backend servers, which serve the request. Backend pools can contain:

NICs

Virtual machine scale sets

Public IP addresses

Internal IP addresses

FQDN

Multitenant backends (such as App Service)

This allows an Application Gateway to communicate with instances outside of the virtual network. Application Gateway backend pool members aren't tied to an availability set. An application gateway can communicate with instances outside of the virtual network that it's in. As a result, the members of the backend pools can be across clusters, across datacenters, or outside Azure, if there's IP connectivity.

With the use of internal IPs as backend pool members, virtual network peering or a VPN gateway must be used. Virtual network peering is supported and beneficial for load-balancing traffic in other virtual networks. Different backend pools for different types of requests. For example, one backend pool for private network traffic, and then another backend pool for public IP members.

We don’t require more than one load balancer or a load balancer per availability zone.

Step 3: the use of a traffic manager to switch between the traffic to application gateway for experiment purposes and to the default where the traffic is not diverted and reaches the public endpoints. The traffic manager is not a replacement for application gateway which is required but it facilitates cases where there is a designated public IP address destination for the undiverted traffic and the application gateway ip address for the diverted traffic.

https://1drv.ms/u/s!Ashlm-Nw-wnWzWDDWMXCinJgxN0Q

Conclusion:

The use of a multi-tiered approach is needed only because the virtual network is internal, and the customer traffic is external, but diversion of traffic is a simple and elegant solution for zone-drill power statement.

Saturday, August 14, 2021

Azure Cognitive Search 

This article is a continuation of the series of articles starting with the description of SignalR service. In this article, we begin to discuss Azure cognitive search service aka Azure Search, after the last article on Azure Stream Analytics.   

Azure Cognitive Search differs from the Do-It-Yourself techniques in that it is a fully managed search-as-a-service, but it is primarily a full-text search. It provides a rich user experience with searching all types of content including vision, language, and speech. It provides machine learning features to contextually rank search results. It is powered by deep learning models. It can extract and enrich content using AI-powered algorithms. Different content can be consolidated to build a single index. 

The search service supports primarily indexing and querying. Indexing is associated with the input data path to the search service. It processes the content and converts them to JSON documents. If the content includes mixed files, searchable text can be extracted from the files. Heterogeneous content can be consolidated into a private user-defined search index. Large amounts of data stored in external repositories including Blob storage, Cosmos DB, or other storage can now be indexed.  The index can be protected against data loss, corruption, and disasters via the same mechanisms that are used for the content.  The index is also independent of the service, so another can read the same service if one goes down.  

We evaluate the features of the Azure Cognitive Service next.

The indexing features of assured cognitive search include a full-text search engine, but assistant storage of search indexes integrated AI and API and tools. The data sources for the indexing can be arbitrary if the more of transferring data is a JSON document. Indexers automate data transfer from these data sources and send it to searchable content in the primary storage. Connectors help the indexers with the data transfer and are specific to the data sources such as Azure SQL databases cosmos DB or Azure BLOB storage. Complex data types and collections allow us to model any type of JSON data structure within a search index. The use of collections and complex types helps with one too many and many too many mappings. The analyzers can be used for linguistic analysis of the data ingested into the indexes.

The standard Lucene analyzer is used by default, but it can be overridden with a language analyzer, or a custom analyzer, or one of many predefined analyzers which produce tokens used for search. The AI processing for image and text analysis can be applied to an indexing pipeline at the time of extracting text information some examples of built-in skills include optical character cognition and key phrase recognition. It can also be integrated with azure machine learning authored skills

The indexing pipeline also generates a knowledge store. Instead of sending tokenized terms to an index, it can enrich documents and send them to a knowledge store. This store could be native to Azure in the form of BLOB storage or table storage. The purpose of the knowledge store is to support downstream analysis for processing. With the availability of a separate knowledge store, all analysis and reporting stacks can now be decoupled from the indexing pipeline.

Another feature of the indexing pipeline is the cached content. This limits the processing to just the documents that are changed by specific edits to the pipeline. Most usages read from the cache.

Query pipeline also has several features to enhance the analysis from the Lucene search store. these include free-form text search, relevance, and geo-search. The freeform text search is the primary use case for queries. The simple syntax might include logical operators, phrase operators, suffix operators, and precedence operators, and others. Extensions to this search could include proximity search, term boosting, and regular expressions. Simple scoring is a key benefit of this search. A set of scoring rules is used to model relevance for the documents. These rulesets can be built using tags for personalized scoring based on customer search preferences.

Friday, August 13, 2021

Azure Cognitive Search

Azure Cognitive Search differs from the Do-It-Yourself techniques in that it is a fully managed search-as-a-service, but it is primarily a full-text search. It provides rich user experience with searching all types of content including vision, language and speech. It provides machine learning features to contextually rank search results. It is powered by deep learning models. It can extract and enrich content using AI-powered algorithms. Different content can be consolidated to build a single index.

The search service supports primarily indexing and querying. Indexing is associated with the input data path to the search service. It processes the content and converts them to JSON documents. If the content includes mixed files, searchable text can be extracted from the files. Heterogeneous content can be consolidated into a private user-defined search index. Large amounts of data stored in external repositories including Blob storage, Cosmos DB or other storage can now be indexed. The index can be protected against data loss, corruption and disasters via the same mechanisms that are used for the content. Index is also independent from the service so if one goes down, another can read the same service.

The querying service supports search experience from a variety of clients and occurs on the outbound path of the search service. The index and the querying service are separate. In this article, we will compare this service with other search services.

The Microsoft Search differs from this Azure Search service in that it searches Sharepoint. It enables users who are authenticated by Microsoft 365 and need to query over content in Sharepoint. The content flows into the libraries via connectors. Cognitive Search service, on the other hand, searches an index that the user determines and specifies what content must be indexed. The indexing pipeline can be enhanced with machine learning and text analysis. This service is also positioned as a plugin for a variety of applications.

Bing Search API maintains an index for the internet using web crawlers. There is an option for custom search where the same technology for different content types can be scoped to individual web sites. Cognitive search is geared for content from Azure data sources. It can index any json document that conforms to its service across clients and data sources.

The Database search technology is an offering from database platforms that provide a builtin search capability for the content stored in their databases. There is probably the most overlap between data that can be indexed in this case. Both database platforms and cognitive search can index this data very well but the latter provides advanced features for deep learning. If search and storage must be combined, SQL Server and CosmosDB have out-of-box features to support this use case. Many solutions use both but only Cognitive service can perform advance text and natural language processing with its features for autocorrection of misspelled words, synonyms, suggestions, scoring controls, facets and custom tokenization. The Azure cognitive search persists data only in the form of an inverted index and it provides no solution for data storage. The use case where a data storage might be required to be independent from the search service could include the case where the database is targeted towards the online transaction processing and the cloud service is externalized to adjust elastically to query volume.

If a dedicated search functionality is desired, on-premise solutions and cloud service can be compared. The cloud service provides a turn-key and one-stop-shop solution for search. The on-premise solutions provider greater flexibility for controls over indexing, querying and results filtering syntax. There might be specialized solutions that span the cloud but they are meant for advanced users and still might not match the experience from Azure Search Service.

These are some of the advantages of using the Azure Search Service.

Thursday, August 12, 2021

Azure Cognitive Search

The Fulltext search query is based on Lucene functionality that has been customized with extensions and lock downs to enable core scenarios. There are four stages to the query execution involving query parsing, lexical analysis, document matching, and scoring. When the query text comes in, the Query Parser must separate query terms from the query operators and create the query tree to be sent to the search engine. The separated terms are sent to the analyzers which must perform stemming, canonicalization and removals to efficiently utilize the terms. The analyzed terms are sent back to the parser. The terms proceed to the search engine that must store and organize searchable terms extracted from indexed documents. This index lives separately from the document, and it is easy to regenerate it offline from query execution. Finally, the search engine scores and retrieves the contents of the inverted index to display the top matches. A sample program to illustrate this example is included here.

The REST API for Azure Cognitive Search takes a payload with properties such as “search”, “searchFields”, “searchMode”, “filter”, “order by”, and “queryType”. The query is broken down into three sub-queries involving a term query, a phrase query and a prefix query. The search terms can include wild cards for matching several terms say as prefix. The search engine scans the fields specified in the searchFields property for documents that match one or more of the search terms. The resulting sets are ordered, and it is easy to specify geography data type-based queries for proximity basis to sorting the results.

The search service supports primarily indexing and querying. Indexing is associated with the input data path to the search service. It processes the content and converts them to JSON documents. If the content includes mixed files, searchable text can be extracted from the files. Heterogeneous content can be consolidated into a private user-defined search index. Large amounts of data stored in external repositories including Blob storage, Cosmos DB or other storage can now be indexed. The index can be protected against data loss, corruption and disasters via the same mechanisms that are used for the content. Index is also independent from the service so if one goes down, another can read the same service.

The querying service supports search experience from a variety of clients and occurs on the outbound path of the search service. The index and the querying service are separate. In the next article, we will compare this service with other search services.

Wednesday, August 11, 2021

Azure Congitive Search

This article is a continuation of the series of articles starting with the description of SignalR service. In this article, we begin to discuss Azure cognitive search service after the last article on Azure Stream Analytics. We had also discussed Azure Kubernetes service that provides a Kubernetes cluster in a serverless environment hosted in the cloud. The workload can be run in the cloud at the edge or as a hybrid with support for running .NET applications on Windows Server containers. Java applications on Linux containers or microservice applications in different languages and environments. Essentially, this provisions a datacenter on top of Azure stack HCI. The hyper-converged infrastructure is an Azure service that provides security performance and feature updates and allows the data center to be extended to the cloud. When the AKS is deployed on a Windows server 2019 data center the cluster is local to the Windows Server but when it is deployed to the Azure stack HCI it can scale seamlessly because the HCI is hosted on its set of clusters. We also reviewed Azure Stream analytics service that provides a scalable approach to analyzing streams with its notion of jobs and clusters. Let us review the Azure Cognitive search service for it search capabilities.

Azure Cognitive Search differs from the Do-It-Yourself techniques in that it is a fully managed search-as-a-service but it is primarily a full-text search. It provides rich user experience with searching all types of content including vision, language and speech. It provides machine learning features to contextually rank search results. It is powered by deep learning models. It can extract and enrich content using AI-powered algorithms. Different content can be consolidated to build a single index.

The REST API for Azure Cognitive Search takes a payload with properties such as “search”, “searchFields”, “searchMode”, “filter”, “order by”, and “queryType”. The query is broken down into three sub-queries involving a term query, a phrase query and a prefix query. The search terms can include wild-cards for matching several terms say as prefix. The search engine scans the fields specified in the searchFields property for documents that match one or more of the search terms. The resulting sets are ordered and it is easy to specify geography data type based queries for proximity basis to sorting the results.

Tuesday, August 10, 2021

Introduction:

This article is a continuation of the series of articles starting with the description of SignalR service. In this article, we begin to discuss Azure cognitive search service after the last article on Azure Stream Analytics. We had also discussed Azure Kubernetes service that provides a Kubernetes cluster in a serverless environment hosted in the cloud. The workload can be run in the cloud at the edge or as a hybrid with support for running .Net applications on Windows Server containers. Java applications on Linux containers or microservice applications in different languages and environments. Essentially, this provisions a datacenter on top of Azure stack HCI. The hyper-converged infrastructure is an Azure service that provides security performance and feature updates and allows the data center to be extended to the cloud. When the AKS is deployed on a Windows server 2019 data center the cluster is local to the Windows Server but when it is deployed to the Azure stack HCI it can scale seamlessly because the HCI is hosted on its set of clusters. We also reviewed Azure Stream analytics service that provides a scalable approach to analyzing streams with its notion of jobs and clusters.

Jobs and clusters form the two main components of the stream analytics. When the job is created, the deployment can be validated. The job itself is represented by an ARM template which is a JSON notation, and it defines the infrastructure and configuration for the project. The template uses declarative syntax so there is no need to write commands to create the deployment. The template takes parameters such as the location, stream analytics job name and number of streaming units which is then applied to the resource and the job is created.

Since the infrastructure is no longer a concern at this point, we can now review a few Do-It-yourself approaches to implementing a search service before we start with the Cognitive search service. These include:

1) Implementing a query layer that can directly search over an object storage. The store is limitless and has no maintenance. Log stores are typically time-series databases. A time-series database makes progressive buckets as each one fills with events, and this can be done easily with object storage too. The namespace-bucket-object hierarchy is well suited for time-series data. There is no limit to the number of objects within a bucket and we can roll over buckets in the same hot-warm-cold manner that time series databases do. Moreover, with the data available in the object storage, it is easily accessible to all users for reading over the HTTP. The only caveat is that some production support requests may be made to accommodate separate object–storage for the persistence of objects in the cache from the object-storage for the persistence of logs. This is quite reasonable and maybe accommodated on-premises or in the cloud depending on the worth of the data and the cost incurred. The log stores can be periodically trimmed as well.

2) The use of indexable key values in full-text search deserves special mention. On one hand, Lucene has ways to populate meta keys and meta values which it calls fields in the indexer documents. On the other hand, each of the objects in the bucket can not only store the raw document but also the meta keys and meta values. This calls for keeping the raw data and the indexer fields together in the same object. When we search over the objects enumerated from the bucket, we no longer use the actual object and thus avoid searching through large objects. Instead, we search the metadata and we lost only those objects where the metadata has the relevant terms. However, we make an improvement to this model by separating the index objects from the raw objects. The raw objects are no longer required to be touched when the metadata changes. Similarly, the indexer objects can be deleted and recreated regardless of the objects so that we can re-index at different sites. Also keeping the indexer documents as key-value entries reduced space and keeps them together so that a greater range of objects can be searched. This technique has been quite popular with many indexes

The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups. Moreover, by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index.

The language for the query has traditionally been SQL. Tools like LogParser allow SQL queries to be executed over enumerable. SQL has been supporting user defined operators for a while now. These user defined operators help with additional computations that are not present as built-ins. In the case of relational data, these generally have been user defined functions or user defined aggregates. With the enumerable data set, the SQL is somewhat limited for LogParser. Any implementation of a query execution layer over the object storage could choose to allow or disallow user defined operators. These enable computation on say user defined data types that are not restricted by the system defined types. Such types have been useful with say spatial co-ordinates or geographical data for easier abstraction and simpler expression of computational logic. For example, vector addition can be done with user defined data types and user defined operators.