Cluster computing

Saturday, August 14, 2021

Azure Cognitive Search 

This article is a continuation of the series of articles starting with the description of SignalR service. In this article, we begin to discuss Azure cognitive search service aka Azure Search, after the last article on Azure Stream Analytics.   

Azure Cognitive Search differs from the Do-It-Yourself techniques in that it is a fully managed search-as-a-service, but it is primarily a full-text search. It provides a rich user experience with searching all types of content including vision, language, and speech. It provides machine learning features to contextually rank search results. It is powered by deep learning models. It can extract and enrich content using AI-powered algorithms. Different content can be consolidated to build a single index. 

The search service supports primarily indexing and querying. Indexing is associated with the input data path to the search service. It processes the content and converts them to JSON documents. If the content includes mixed files, searchable text can be extracted from the files. Heterogeneous content can be consolidated into a private user-defined search index. Large amounts of data stored in external repositories including Blob storage, Cosmos DB, or other storage can now be indexed.  The index can be protected against data loss, corruption, and disasters via the same mechanisms that are used for the content.  The index is also independent of the service, so another can read the same service if one goes down.  

We evaluate the features of the Azure Cognitive Service next.

The indexing features of assured cognitive search include a full-text search engine, but assistant storage of search indexes integrated AI and API and tools. The data sources for the indexing can be arbitrary if the more of transferring data is a JSON document. Indexers automate data transfer from these data sources and send it to searchable content in the primary storage. Connectors help the indexers with the data transfer and are specific to the data sources such as Azure SQL databases cosmos DB or Azure BLOB storage. Complex data types and collections allow us to model any type of JSON data structure within a search index. The use of collections and complex types helps with one too many and many too many mappings. The analyzers can be used for linguistic analysis of the data ingested into the indexes.

The standard Lucene analyzer is used by default, but it can be overridden with a language analyzer, or a custom analyzer, or one of many predefined analyzers which produce tokens used for search. The AI processing for image and text analysis can be applied to an indexing pipeline at the time of extracting text information some examples of built-in skills include optical character cognition and key phrase recognition. It can also be integrated with azure machine learning authored skills

The indexing pipeline also generates a knowledge store. Instead of sending tokenized terms to an index, it can enrich documents and send them to a knowledge store. This store could be native to Azure in the form of BLOB storage or table storage. The purpose of the knowledge store is to support downstream analysis for processing. With the availability of a separate knowledge store, all analysis and reporting stacks can now be decoupled from the indexing pipeline.

Another feature of the indexing pipeline is the cached content. This limits the processing to just the documents that are changed by specific edits to the pipeline. Most usages read from the cache.

Query pipeline also has several features to enhance the analysis from the Lucene search store. these include free-form text search, relevance, and geo-search. The freeform text search is the primary use case for queries. The simple syntax might include logical operators, phrase operators, suffix operators, and precedence operators, and others. Extensions to this search could include proximity search, term boosting, and regular expressions. Simple scoring is a key benefit of this search. A set of scoring rules is used to model relevance for the documents. These rulesets can be built using tags for personalized scoring based on customer search preferences.

Friday, August 13, 2021

Azure Cognitive Search

Azure Cognitive Search differs from the Do-It-Yourself techniques in that it is a fully managed search-as-a-service, but it is primarily a full-text search. It provides rich user experience with searching all types of content including vision, language and speech. It provides machine learning features to contextually rank search results. It is powered by deep learning models. It can extract and enrich content using AI-powered algorithms. Different content can be consolidated to build a single index.

The search service supports primarily indexing and querying. Indexing is associated with the input data path to the search service. It processes the content and converts them to JSON documents. If the content includes mixed files, searchable text can be extracted from the files. Heterogeneous content can be consolidated into a private user-defined search index. Large amounts of data stored in external repositories including Blob storage, Cosmos DB or other storage can now be indexed. The index can be protected against data loss, corruption and disasters via the same mechanisms that are used for the content. Index is also independent from the service so if one goes down, another can read the same service.

The querying service supports search experience from a variety of clients and occurs on the outbound path of the search service. The index and the querying service are separate. In this article, we will compare this service with other search services.

The Microsoft Search differs from this Azure Search service in that it searches Sharepoint. It enables users who are authenticated by Microsoft 365 and need to query over content in Sharepoint. The content flows into the libraries via connectors. Cognitive Search service, on the other hand, searches an index that the user determines and specifies what content must be indexed. The indexing pipeline can be enhanced with machine learning and text analysis. This service is also positioned as a plugin for a variety of applications.

Bing Search API maintains an index for the internet using web crawlers. There is an option for custom search where the same technology for different content types can be scoped to individual web sites. Cognitive search is geared for content from Azure data sources. It can index any json document that conforms to its service across clients and data sources.

The Database search technology is an offering from database platforms that provide a builtin search capability for the content stored in their databases. There is probably the most overlap between data that can be indexed in this case. Both database platforms and cognitive search can index this data very well but the latter provides advanced features for deep learning. If search and storage must be combined, SQL Server and CosmosDB have out-of-box features to support this use case. Many solutions use both but only Cognitive service can perform advance text and natural language processing with its features for autocorrection of misspelled words, synonyms, suggestions, scoring controls, facets and custom tokenization. The Azure cognitive search persists data only in the form of an inverted index and it provides no solution for data storage. The use case where a data storage might be required to be independent from the search service could include the case where the database is targeted towards the online transaction processing and the cloud service is externalized to adjust elastically to query volume.

If a dedicated search functionality is desired, on-premise solutions and cloud service can be compared. The cloud service provides a turn-key and one-stop-shop solution for search. The on-premise solutions provider greater flexibility for controls over indexing, querying and results filtering syntax. There might be specialized solutions that span the cloud but they are meant for advanced users and still might not match the experience from Azure Search Service.

These are some of the advantages of using the Azure Search Service.

Thursday, August 12, 2021

Azure Cognitive Search

The Fulltext search query is based on Lucene functionality that has been customized with extensions and lock downs to enable core scenarios. There are four stages to the query execution involving query parsing, lexical analysis, document matching, and scoring. When the query text comes in, the Query Parser must separate query terms from the query operators and create the query tree to be sent to the search engine. The separated terms are sent to the analyzers which must perform stemming, canonicalization and removals to efficiently utilize the terms. The analyzed terms are sent back to the parser. The terms proceed to the search engine that must store and organize searchable terms extracted from indexed documents. This index lives separately from the document, and it is easy to regenerate it offline from query execution. Finally, the search engine scores and retrieves the contents of the inverted index to display the top matches. A sample program to illustrate this example is included here.

The REST API for Azure Cognitive Search takes a payload with properties such as “search”, “searchFields”, “searchMode”, “filter”, “order by”, and “queryType”. The query is broken down into three sub-queries involving a term query, a phrase query and a prefix query. The search terms can include wild cards for matching several terms say as prefix. The search engine scans the fields specified in the searchFields property for documents that match one or more of the search terms. The resulting sets are ordered, and it is easy to specify geography data type-based queries for proximity basis to sorting the results.

The search service supports primarily indexing and querying. Indexing is associated with the input data path to the search service. It processes the content and converts them to JSON documents. If the content includes mixed files, searchable text can be extracted from the files. Heterogeneous content can be consolidated into a private user-defined search index. Large amounts of data stored in external repositories including Blob storage, Cosmos DB or other storage can now be indexed. The index can be protected against data loss, corruption and disasters via the same mechanisms that are used for the content. Index is also independent from the service so if one goes down, another can read the same service.

The querying service supports search experience from a variety of clients and occurs on the outbound path of the search service. The index and the querying service are separate. In the next article, we will compare this service with other search services.

Wednesday, August 11, 2021

Azure Congitive Search

This article is a continuation of the series of articles starting with the description of SignalR service. In this article, we begin to discuss Azure cognitive search service after the last article on Azure Stream Analytics. We had also discussed Azure Kubernetes service that provides a Kubernetes cluster in a serverless environment hosted in the cloud. The workload can be run in the cloud at the edge or as a hybrid with support for running .NET applications on Windows Server containers. Java applications on Linux containers or microservice applications in different languages and environments. Essentially, this provisions a datacenter on top of Azure stack HCI. The hyper-converged infrastructure is an Azure service that provides security performance and feature updates and allows the data center to be extended to the cloud. When the AKS is deployed on a Windows server 2019 data center the cluster is local to the Windows Server but when it is deployed to the Azure stack HCI it can scale seamlessly because the HCI is hosted on its set of clusters. We also reviewed Azure Stream analytics service that provides a scalable approach to analyzing streams with its notion of jobs and clusters. Let us review the Azure Cognitive search service for it search capabilities.

Azure Cognitive Search differs from the Do-It-Yourself techniques in that it is a fully managed search-as-a-service but it is primarily a full-text search. It provides rich user experience with searching all types of content including vision, language and speech. It provides machine learning features to contextually rank search results. It is powered by deep learning models. It can extract and enrich content using AI-powered algorithms. Different content can be consolidated to build a single index.

The REST API for Azure Cognitive Search takes a payload with properties such as “search”, “searchFields”, “searchMode”, “filter”, “order by”, and “queryType”. The query is broken down into three sub-queries involving a term query, a phrase query and a prefix query. The search terms can include wild-cards for matching several terms say as prefix. The search engine scans the fields specified in the searchFields property for documents that match one or more of the search terms. The resulting sets are ordered and it is easy to specify geography data type based queries for proximity basis to sorting the results.

Tuesday, August 10, 2021

Introduction:

This article is a continuation of the series of articles starting with the description of SignalR service. In this article, we begin to discuss Azure cognitive search service after the last article on Azure Stream Analytics. We had also discussed Azure Kubernetes service that provides a Kubernetes cluster in a serverless environment hosted in the cloud. The workload can be run in the cloud at the edge or as a hybrid with support for running .Net applications on Windows Server containers. Java applications on Linux containers or microservice applications in different languages and environments. Essentially, this provisions a datacenter on top of Azure stack HCI. The hyper-converged infrastructure is an Azure service that provides security performance and feature updates and allows the data center to be extended to the cloud. When the AKS is deployed on a Windows server 2019 data center the cluster is local to the Windows Server but when it is deployed to the Azure stack HCI it can scale seamlessly because the HCI is hosted on its set of clusters. We also reviewed Azure Stream analytics service that provides a scalable approach to analyzing streams with its notion of jobs and clusters.

Jobs and clusters form the two main components of the stream analytics. When the job is created, the deployment can be validated. The job itself is represented by an ARM template which is a JSON notation, and it defines the infrastructure and configuration for the project. The template uses declarative syntax so there is no need to write commands to create the deployment. The template takes parameters such as the location, stream analytics job name and number of streaming units which is then applied to the resource and the job is created.

Since the infrastructure is no longer a concern at this point, we can now review a few Do-It-yourself approaches to implementing a search service before we start with the Cognitive search service. These include:

1) Implementing a query layer that can directly search over an object storage. The store is limitless and has no maintenance. Log stores are typically time-series databases. A time-series database makes progressive buckets as each one fills with events, and this can be done easily with object storage too. The namespace-bucket-object hierarchy is well suited for time-series data. There is no limit to the number of objects within a bucket and we can roll over buckets in the same hot-warm-cold manner that time series databases do. Moreover, with the data available in the object storage, it is easily accessible to all users for reading over the HTTP. The only caveat is that some production support requests may be made to accommodate separate object–storage for the persistence of objects in the cache from the object-storage for the persistence of logs. This is quite reasonable and maybe accommodated on-premises or in the cloud depending on the worth of the data and the cost incurred. The log stores can be periodically trimmed as well.

2) The use of indexable key values in full-text search deserves special mention. On one hand, Lucene has ways to populate meta keys and meta values which it calls fields in the indexer documents. On the other hand, each of the objects in the bucket can not only store the raw document but also the meta keys and meta values. This calls for keeping the raw data and the indexer fields together in the same object. When we search over the objects enumerated from the bucket, we no longer use the actual object and thus avoid searching through large objects. Instead, we search the metadata and we lost only those objects where the metadata has the relevant terms. However, we make an improvement to this model by separating the index objects from the raw objects. The raw objects are no longer required to be touched when the metadata changes. Similarly, the indexer objects can be deleted and recreated regardless of the objects so that we can re-index at different sites. Also keeping the indexer documents as key-value entries reduced space and keeps them together so that a greater range of objects can be searched. This technique has been quite popular with many indexes

The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups. Moreover, by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index.

The language for the query has traditionally been SQL. Tools like LogParser allow SQL queries to be executed over enumerable. SQL has been supporting user defined operators for a while now. These user defined operators help with additional computations that are not present as built-ins. In the case of relational data, these generally have been user defined functions or user defined aggregates. With the enumerable data set, the SQL is somewhat limited for LogParser. Any implementation of a query execution layer over the object storage could choose to allow or disallow user defined operators. These enable computation on say user defined data types that are not restricted by the system defined types. Such types have been useful with say spatial co-ordinates or geographical data for easier abstraction and simpler expression of computational logic. For example, vector addition can be done with user defined data types and user defined operators.

Monday, August 9, 2021

A search service over blobs

Cloud attracts unstructured data in the form of blobs in object storage. Since it is inevitable and unstructured content escapes the usual organization associated with relational data, some form of search service could be helpful to the end-user who must search the blobs from different data sources. This document focuses on the technical considerations for making this leap and leaves the possibility of making a business case as out of scope in this discussion.

Desktop search, enterprise search and even internet search speak to the versatility and popularity of search tools. These tools are popular for many reasons, but they are all targeted towards the end-user who must navigate through a large collection in its absence. The query layer for search service sometimes requires its own pipeline albeit from the same dataset. Relational data management provided a foundation for SQL queries to be written.

Let us consider object storage as a destination for logs to improve production support drastically with the ability to search the store. A query layer can directly re-use the object storage as its data source. The store is limitless and has no maintenance. Log stores are typically time-series databases. A time-series database makes progressive buckets as each one fills with events and this can be done easily with object storage too. The namespace-bucket-object hierarchy is well suited for time-series data. There is no limit to the number of objects within a bucket and we can roll over buckets in the same hot-warm-cold manner that time series databases do. Moreover, with the data available in the object storage, it is easily accessible to all users for reading over the HTTP. The only caveat is that some production support requests may be made to accommodate separate object–storage for the persistence of objects in the cache from the object-storage for the persistence of logs. This is quite reasonable and maybe accommodated on-premise or in the cloud depending on the worth of the data and the cost incurred. The log stores can be periodically trimmed as well. In addition, the entire querying stack for reading these entries can be built on copies or selections of buckets and objects. More about saving logs and their indexes in object storage is available at: https://1drv.ms/w/s!Ashlm-Nw-wnWt3eeNpsNk1f3BZVM

Sunday, August 8, 2021

Introduction:

This article is a continuation of the series of articles starting with the description of SignalR service. In this article, we begin to discuss Azure services for Containers including the Azure Kubernetes Service and the Azure Container Instance after the last article on Azure Stream Analytics. AKS provides a Kubernetes cluster in a serverless environment hosted in the cloud which can elastically scale your Kubernetes cluster to meet the challenges of your workload. The workload can be run in the cloud at the edge or as a hybrid with support for running .Net applications on Windows Server containers. Java applications on Linux containers or microservice applications in different languages and environments. Essentially, this provisions a datacenter on top of Azure stack HCI. The hyper-converged infrastructure is an Azure service that provides security performance and feature updates and allows the data center to be extended to the cloud. When the AKS is deployed on a Windows server 2019 data center the cluster is local to the Windows Server but when it is deployed to the Azure stack HCI it can scale seamlessly because the HCI is hosted on its set of clusters.

Well-known container orchestration frameworks such as Kubernetes rose in popularity meteorically with their widespread acceptance on Linux hosts. The challenge with running Kubernetes is that the Kubernetes Control Plane must be hosted on Linux flavor Virtual machines. While it can support workload servers on both Linux and Windows, it is not easy to get much further by trying to run the control plane on Windows Server. Minikube tried to reduce this gap but it could not avoid this requirement either. Once we have a virtual box on a window server, an Ubuntu flavor image can be used to launch a host and then deploying the Kubernetes is easy. Azure Kubernetes Service also requires Linux flavor servers for launching the Kubernetes control plane and it can run worker nodes on either operating system.

Kubernetes uses pods to run an instance of the microservice or user application. The pods, deployments, stateful sets, and daemon sets are the worker node components. The pod hosts, deployment controllers, and Kubernetes scheduler ensures that the replicas are distributed efficiently. When the pods have images that correspond to windows servers and OS flavors, they can run .NET applications. Mixed-mode operating systems can also be utilized for hosting an application by leveraging Kubernetes node selectors, taint, and toleration techniques.

The issues faced in hosting Kubernetes control plane on Windows with Minikube, MicrosK8s, and other Kubernetes runtimes include the following:

1) Use of an external insecure docker registry

a. The insecure term is used only for HTTP versus HTTPS. It was required because docker and minikube on windows work together by taking an –insecure-registry as a start-up parameter. This is not the case on Linux where we can do without this parameter and have the minikube host its own docker registry. On windows, we install the Docker toolbox and Minikube separately. This gives us two virtual machines by name ‘default’ for docker and ‘minikube’ for the Kubernetes cluster. Both are Linux environments

2) Sufficient disk space and NAT traversal

a. The insecure registry requires sufficient disk space. The default size is about 18GB and this is not sufficient when there are multiple images required. The typical size for commercial workloads is at least 30GB. The Docker Toolbox is preferred over other software packages for installing Docker on Windows. It is preferable to have Minikube start with at least 2 CPU and 8 GB memory. Small minikube deployments are expected to have frequent restarts due to low resources. The specification of CPU, memory and disk for the whole host does not necessarily lower the size of various clusters used with workloads. A minimal-dev-values file could help specify the lower end of the number of replicas and containers and their consumption.

3) Minikube’s support is usually host-facing as opposed to the external world which is also why it requires an insecure registry.

4) Security of the overall Minikube installation as opposed to the deployment of the Kubernetes stack on the Linux nodes is a concern because the instance is already completely owned as opposed to a managed and locked-down instance.

Cloud services do not encounter these issues because they have the luxury of elastic and even hyper-converged resources and they have absolutely no need to host the Kubernetes control plane on Windows.