Introduction:
This article is
a continuation of the series of articles starting with the description
of SignalR service. In this article, we begin to discuss Azure
cognitive search service after the last article on Azure Stream Analytics. We had also discussed
Azure Kubernetes service that provides a Kubernetes cluster in a serverless
environment hosted in the cloud. The workload can be run in the cloud at the
edge or as a hybrid with support for running .Net applications on Windows
Server containers. Java applications on Linux containers or microservice
applications in different languages and environments. Essentially, this
provisions a datacenter on top of Azure stack HCI. The hyper-converged
infrastructure is an Azure service that provides security performance and
feature updates and allows the data center to be extended to the cloud. When
the AKS is deployed on a Windows server 2019 data center the cluster is local
to the Windows Server but when it is deployed to the Azure stack HCI it can
scale seamlessly because the HCI is hosted on its set of clusters. We also
reviewed Azure Stream analytics service that provides a scalable approach to
analyzing streams with its notion of jobs and clusters.
Jobs and clusters form the two main
components of the stream analytics. When the job is created, the deployment can
be validated. The job itself is represented by an ARM template which is a JSON
notation, and it defines the infrastructure and configuration for the project. The
template uses declarative syntax so there is no need to write commands to
create the deployment. The template takes parameters such as the location,
stream analytics job name and number of streaming units which is then applied
to the resource and the job is created.
Since the infrastructure is no longer a concern at this
point, we can now review a few Do-It-yourself approaches to implementing a
search service before we start with the Cognitive search service. These
include:
1) Implementing a query layer that can directly search
over an object storage. The store is limitless and has no maintenance. Log
stores are typically time-series databases. A time-series database makes
progressive buckets as each one fills with events, and this can be done easily
with object storage too. The namespace-bucket-object hierarchy is well suited
for time-series data. There is no limit to the number of objects within a
bucket and we can roll over buckets in the same hot-warm-cold manner that time
series databases do. Moreover, with the data available in the object storage,
it is easily accessible to all users for reading over the HTTP. The only caveat is that some production
support requests may be made to accommodate separate object–storage for the
persistence of objects in the cache from the object-storage for the persistence
of logs. This is quite reasonable and maybe accommodated on-premises or in the
cloud depending on the worth of the data and the cost incurred. The log stores
can be periodically trimmed as well.
2) The use of indexable key values in full-text search
deserves special mention. On one hand, Lucene has ways to populate meta keys
and meta values which it calls fields in the indexer documents. On the other
hand, each of the objects in the bucket can not only store the raw document but
also the meta keys and meta values. This calls for keeping the raw data and the
indexer fields together in the same object. When we search over the objects
enumerated from the bucket, we no longer use the actual object and thus avoid
searching through large objects. Instead, we search the metadata and we lost
only those objects where the metadata has the relevant terms. However, we make
an improvement to this model by separating the index objects from the raw
objects. The raw objects are no longer required to be touched when the metadata
changes. Similarly, the indexer objects can be deleted and recreated regardless
of the objects so that we can re-index at different sites. Also keeping the
indexer documents as key-value entries reduced space and keeps them together so
that a greater range of objects can be searched. This technique has been quite
popular with many indexes
The inverted indexes over object storage may not be as
performant as the query execution over relational tables in a sql database but
they fill the space for enabling search over the storage. Spotlight on MacOS
and Google page-rank algorithm on internet documents also use tokens and
lookups. Moreover, by recognizing the
generic organization of inverted indexes, we can apply any grouping, ranking
and sorting algorithm we like. Those algorithms are now independent of the
organization of the index in the object storage and each one can use the same
index.
The language for the query has traditionally been SQL.
Tools like LogParser allow SQL queries to be executed over enumerable. SQL has
been supporting user defined operators for a while now. These user defined
operators help with additional computations that are not present as built-ins.
In the case of relational data, these generally have been user defined
functions or user defined aggregates. With the enumerable data set, the SQL is
somewhat limited for LogParser. Any implementation of a query execution layer
over the object storage could choose to allow or disallow user defined
operators. These enable computation on say user defined data types that are not
restricted by the system defined types. Such types have been useful with say
spatial co-ordinates or geographical data for easier abstraction and simpler
expression of computational logic. For example, vector addition can be done
with user defined data types and user defined operators.
No comments:
Post a Comment