Tuesday, August 10, 2021

 

Introduction:  

This article is a continuation of the series of articles starting with the description of SignalR service. In this article, we begin to discuss Azure cognitive search service after the last article on Azure Stream Analytics.  We had also discussed Azure Kubernetes service that provides a Kubernetes cluster in a serverless environment hosted in the cloud. The workload can be run in the cloud at the edge or as a hybrid with support for running .Net applications on Windows Server containers. Java applications on Linux containers or microservice applications in different languages and environments. Essentially, this provisions a datacenter on top of Azure stack HCI. The hyper-converged infrastructure is an Azure service that provides security performance and feature updates and allows the data center to be extended to the cloud. When the AKS is deployed on a Windows server 2019 data center the cluster is local to the Windows Server but when it is deployed to the Azure stack HCI it can scale seamlessly because the HCI is hosted on its set of clusters. We also reviewed Azure Stream analytics service that provides a scalable approach to analyzing streams with its notion of jobs and clusters.

Jobs and clusters form the two main components of the stream analytics. When the job is created, the deployment can be validated. The job itself is represented by an ARM template which is a JSON notation, and it defines the infrastructure and configuration for the project. The template uses declarative syntax so there is no need to write commands to create the deployment. The template takes parameters such as the location, stream analytics job name and number of streaming units which is then applied to the resource and the job is created.

Since the infrastructure is no longer a concern at this point, we can now review a few Do-It-yourself approaches to implementing a search service before we start with the Cognitive search service. These include:

1) Implementing a query layer that can directly search over an object storage. The store is limitless and has no maintenance. Log stores are typically time-series databases. A time-series database makes progressive buckets as each one fills with events, and this can be done easily with object storage too. The namespace-bucket-object hierarchy is well suited for time-series data. There is no limit to the number of objects within a bucket and we can roll over buckets in the same hot-warm-cold manner that time series databases do. Moreover, with the data available in the object storage, it is easily accessible to all users for reading over the HTTP.  The only caveat is that some production support requests may be made to accommodate separate object–storage for the persistence of objects in the cache from the object-storage for the persistence of logs. This is quite reasonable and maybe accommodated on-premises or in the cloud depending on the worth of the data and the cost incurred. The log stores can be periodically trimmed as well.

2) The use of indexable key values in full-text search deserves special mention. On one hand, Lucene has ways to populate meta keys and meta values which it calls fields in the indexer documents. On the other hand, each of the objects in the bucket can not only store the raw document but also the meta keys and meta values. This calls for keeping the raw data and the indexer fields together in the same object. When we search over the objects enumerated from the bucket, we no longer use the actual object and thus avoid searching through large objects. Instead, we search the metadata and we lost only those objects where the metadata has the relevant terms. However, we make an improvement to this model by separating the index objects from the raw objects. The raw objects are no longer required to be touched when the metadata changes. Similarly, the indexer objects can be deleted and recreated regardless of the objects so that we can re-index at different sites. Also keeping the indexer documents as key-value entries reduced space and keeps them together so that a greater range of objects can be searched. This technique has been quite popular with many indexes

The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups.  Moreover, by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index. 

The language for the query has traditionally been SQL. Tools like LogParser allow SQL queries to be executed over enumerable. SQL has been supporting user defined operators for a while now. These user defined operators help with additional computations that are not present as built-ins. In the case of relational data, these generally have been user defined functions or user defined aggregates. With the enumerable data set, the SQL is somewhat limited for LogParser. Any implementation of a query execution layer over the object storage could choose to allow or disallow user defined operators. These enable computation on say user defined data types that are not restricted by the system defined types. Such types have been useful with say spatial co-ordinates or geographical data for easier abstraction and simpler expression of computational logic. For example, vector addition can be done with user defined data types and user defined operators.

 

No comments:

Post a Comment