Cluster computing

Sunday, April 20, 2025

Continuous indexing

Azure AI Search supports continuous indexing of documents, enabling real-time updates to the search index as new data is ingested. It can connect to various data sources, such as Azure Blob Storage, SQL databases, or Cosmos DB, to ingest documents continuously. Indexers are configured to monitor these sources for changes and update the search index accordingly. The indexer scans the data source for new, updated, or deleted documents. The time taken to index new documents depends on factors like the size of the data, complexity of the schema, and the indexing tier. For large datasets, indexing may take longer, especially if the indexer is resource starved. Once documents are indexed, they are available for querying. However, query latency can vary based on the size of the index, query complexity, and service tier. The minimum interval for indexer runs is 5 minutes. If this pull from data source is not sufficiently fast enough, individual data item can be indexed by directly pushing to index using the index client. Both these are shown via code samples below:

from azure.identity import DefaultAzureCredential

from azure.mgmt.search import SearchManagementClient

Replace with your Azure credentials and configuration

subscription_id = ""

resource_group_name = ""

search_service_name = ""

blob_storage_account_name = ""

blob_container_name = ""

connection_string = ""

Authenticate using DefaultAzureCredential

credential = DefaultAzureCredential()

Initialize the Azure Search Management Client

search_client = SearchManagementClient(credential, subscription_id)

Define the data source

data_source_name = "blob-data-source"

data_source_definition = {

type": "AzureBlob",

credentials": {

connectionString": connection_string

container": { name": blob_container_name } }

Create or update the data source in Azure Search

search_client.data_sources.create_or_update( resource_group_name=resource_group_name, search_service_name=search_service_name,

data_source_name=data_source_name,

data_source=data_source_definition )

Define the index

index_name = "blob-index"

index_definition =

{

fields": [

{"name": "id", "type": "Edm.String", "key": True},

{"name": "content", "type": "Edm.String"},

{"name": "category", "type": "Edm.String"},

{"name": "sourcefile", "type": "Edm.String"},

{"name": "metadata_storage_name", "type": "Edm.String"} ] }

Create or update the index

search_client.indexes.create_or_update(

resource_group_name=resource_group_name, search_service_name=search_service_name,

index_name=index_name,

index=index_definition )

Define the indexer

indexer_name = "blob-indexer"

indexer_definition = {

dataSourceName": data_source_name,

targetIndexName": index_name,

schedule":

{

interval": "PT5M" # Run every 5 minutes

} }

Create or update the indexer

search_client.indexers.create_or_update( resource_group_name=resource_group_name, search_service_name=search_service_name, indexer_name=indexer_name, indexer=indexer_definition )

print("Configured continuous indexing from Azure Blob Storage to Azure AI Search!")

Replace with your Azure credentials and configuration

service_name = ""

admin_key = ""

Initialize the SearchIndexClient

endpoint = f"https://{service_name}.search.windows.net/"

credential = AzureKeyCredential(admin_key)

index_client = SearchIndexClient(endpoint=endpoint, credential=credential)

Upload documents to index:

def index_document(filename):

print(f"Indexing document '{filename}' into search index '{index_name}'")

search_client = SearchClient(endpoint=f"https://{searchservice}.search.windows.net/", index_name=index, credential=search_creds)

batch = []

with open(filename, 'r') as fin:

text = fin.read()

batch += [text]

if len(batch) > 0:

results = search_client.upload_documents(documents=batch)

succeeded = sum([1 for r in results if r.succeeded])

print(f"\tIndexed {len(results)} documents, {succeeded} succeeded")

The default rate limit for adding documents to the index varies with service tiers, replicas and partitions. Higher service tiers have higher rate limit. Adding replicas increases query throughput. Adding partitions increases indexing throughput. 1000 documents can be sent in a batch, and batching optimizes throughput and reduces the likelihood of hitting rate limits.

Cluster computing

Sunday, April 20, 2025

No comments:

Post a Comment