Cluster computing

Azure Cognitive Search

This article is a continuation of the series of articles starting with the description of SignalR service. In this article, we begin to discuss Azure cognitive search service after the last article on Azure Stream Analytics.

Azure Cognitive Search differs from the Do-It-Yourself techniques in that it is a fully managed search-as-a-service, but it is primarily a full-text search. It provides rich user experience with searching all types of content including vision, language and speech. It provides machine learning features to contextually rank search results. It is powered by deep learning models. It can extract and enrich content using AI-powered algorithms. Different content can be consolidated to build a single index.

The Fulltext search query is based on Lucene functionality that has been customized with extensions and lock downs to enable core scenarios. There are four stages to the query execution involving query parsing, lexical analysis, document matching, and scoring. When the query text comes in, the Query Parser must separate query terms from the query operators and create the query tree to be sent to the search engine. The separated terms are sent to the analyzers which must perform stemming, canonicalization and removals to efficiently utilize the terms. The analyzed terms are sent back to the parser. The terms proceed to the search engine that must store and organize searchable terms extracted from indexed documents. This index lives separately from the document, and it is easy to regenerate it offline from query execution. Finally, the search engine scores and retrieves the contents of the inverted index to display the top matches. A sample program to illustrate this example is included here.

The REST API for Azure Cognitive Search takes a payload with properties such as “search”, “searchFields”, “searchMode”, “filter”, “order by”, and “queryType”. The query is broken down into three sub-queries involving a term query, a phrase query and a prefix query. The search terms can include wild cards for matching several terms say as prefix. The search engine scans the fields specified in the searchFields property for documents that match one or more of the search terms. The resulting sets are ordered, and it is easy to specify geography data type-based queries for proximity basis to sorting the results.

The search service supports primarily indexing and querying. Indexing is associated with the input data path to the search service. It processes the content and converts them to JSON documents. If the content includes mixed files, searchable text can be extracted from the files. Heterogeneous content can be consolidated into a private user-defined search index. Large amounts of data stored in external repositories including Blob storage, Cosmos DB or other storage can now be indexed. The index can be protected against data loss, corruption and disasters via the same mechanisms that are used for the content. Index is also independent from the service so if one goes down, another can read the same service.

The querying service supports search experience from a variety of clients and occurs on the outbound path of the search service. The index and the querying service are separate. In the next article, we will compare this service with other search services.

Cluster computing

Thursday, August 12, 2021

No comments:

Post a Comment