Azure Cognitive
Search
This article is
a continuation of the series of articles starting with the description
of SignalR service. In this article, we begin to discuss Azure
cognitive search service after the last article on Azure Stream Analytics.
Azure Cognitive Search differs from the Do-It-Yourself techniques in that it is a fully managed search-as-a-service, but
it is primarily a full-text search. It provides rich user experience with
searching all types of content including vision, language and speech. It
provides machine learning features to contextually rank search results. It is
powered by deep learning models. It can extract and enrich content using
AI-powered algorithms. Different content can be consolidated to build a single
index.
The Fulltext search query is based on Lucene
functionality that has been customized with extensions and lock downs to enable
core scenarios. There are four stages to the query execution involving query
parsing, lexical analysis, document matching, and scoring. When the query text
comes in, the Query Parser must separate query terms from the query operators
and create the query tree to be sent to the search engine. The separated terms
are sent to the analyzers which must perform stemming, canonicalization and
removals to efficiently utilize the terms. The analyzed terms are sent back to
the parser. The terms proceed to the search engine that must store and organize
searchable terms extracted from indexed documents. This index lives separately
from the document, and it is easy to regenerate it offline from query
execution. Finally, the search engine scores and retrieves the contents of the
inverted index to display the top matches. A sample program to illustrate this
example is included here.
The REST API for Azure Cognitive Search takes a payload
with properties such as “search”, “searchFields”, “searchMode”, “filter”,
“order by”, and “queryType”. The query is broken down into three sub-queries
involving a term query, a phrase query and a prefix query. The search terms can
include wild cards for matching several terms say as prefix. The search engine
scans the fields specified in the searchFields property for documents that
match one or more of the search terms. The resulting sets are ordered, and it
is easy to specify geography data type-based queries for proximity basis to
sorting the results.
The search service supports primarily indexing and
querying. Indexing is associated with the input data path to the search
service. It processes the content and converts them to JSON documents. If the
content includes mixed files, searchable text can be extracted from the files.
Heterogeneous content can be consolidated into a private user-defined search
index. Large amounts of data stored in external repositories including Blob
storage, Cosmos DB or other storage can now be indexed. The index can be protected against data loss,
corruption and disasters via the same mechanisms that are used for the
content. Index is also independent from
the service so if one goes down, another can read the same service.
The querying service supports search experience from a
variety of clients and occurs on the outbound path of the search service. The
index and the querying service are separate. In the next article, we will
compare this service with other search
services.
No comments:
Post a Comment