Continuous indexing
Azure AI Search supports continuous indexing of documents, enabling real-time updates to the search index as new data is ingested. It can connect to various data sources, such as Azure Blob Storage, SQL databases, or Cosmos DB, to ingest documents continuously. Indexers are configured to monitor these sources for changes and update the search index accordingly. The indexer scans the data source for new, updated, or deleted documents. The time taken to index new documents depends on factors like the size of the data, complexity of the schema, and the indexing tier. For large datasets, indexing may take longer, especially if the indexer is resource starved. Once documents are indexed, they are available for querying. However, query latency can vary based on the size of the index, query complexity, and service tier. The minimum interval for indexer runs is 5 minutes. If this pull from data source is not sufficiently fast enough, individual data item can be indexed by directly pushing to index using the index client. Both these are shown via code samples below:
from azure.identity import DefaultAzureCredential
from azure.mgmt.search import SearchManagementClient
Replace with your Azure credentials and configuration
subscription_id = ""
resource_group_name = ""
search_service_name = ""
blob_storage_account_name = ""
blob_container_name = ""
connection_string = ""
Authenticate using DefaultAzureCredential
credential = DefaultAzureCredential()
Initialize the Azure Search Management Client
search_client = SearchManagementClient(credential, subscription_id)
Define the data source
data_source_name = "blob-data-source"
data_source_definition = {
type": "AzureBlob",
credentials": {
connectionString": connection_string
},
container": { name": blob_container_name } }
Create or update the data source in Azure Search
search_client.data_sources.create_or_update( resource_group_name=resource_group_name, search_service_name=search_service_name,
data_source_name=data_source_name,
data_source=data_source_definition )
Define the index
index_name = "blob-index"
index_definition =
{
fields": [
{"name": "id", "type": "Edm.String", "key": True},
{"name": "content", "type": "Edm.String"},
{"name": "category", "type": "Edm.String"},
{"name": "sourcefile", "type": "Edm.String"},
{"name": "metadata_storage_name", "type": "Edm.String"} ] }
Create or update the index
search_client.indexes.create_or_update(
resource_group_name=resource_group_name, search_service_name=search_service_name,
index_name=index_name,
index=index_definition )
Define the indexer
indexer_name = "blob-indexer"
indexer_definition = {
dataSourceName": data_source_name,
targetIndexName": index_name,
schedule":
{
interval": "PT5M" # Run every 5 minutes
} }
Create or update the indexer
search_client.indexers.create_or_update( resource_group_name=resource_group_name, search_service_name=search_service_name, indexer_name=indexer_name, indexer=indexer_definition )
print("Configured continuous indexing from Azure Blob Storage to Azure AI Search!")
Replace with your Azure credentials and configuration
service_name = ""
admin_key = ""
Initialize the SearchIndexClient
endpoint = f"https://{service_name}.search.windows.net/"
credential = AzureKeyCredential(admin_key)
index_client = SearchIndexClient(endpoint=endpoint, credential=credential)
Upload documents to index:
def index_document(filename):
print(f"Indexing document '{filename}' into search index '{index_name}'")
search_client = SearchClient(endpoint=f"https://{searchservice}.search.windows.net/", index_name=index, credential=search_creds)
batch = []
with open(filename, 'r') as fin:
text = fin.read()
batch += [text]
if len(batch) > 0:
results = search_client.upload_documents(documents=batch)
succeeded = sum([1 for r in results if r.succeeded])
print(f"\tIndexed {len(results)} documents, {succeeded} succeeded")
The default rate limit for adding documents to the index varies with service tiers, replicas and partitions. Higher service tiers have higher rate limit. Adding replicas increases query throughput. Adding partitions increases indexing throughput. 1000 documents can be sent in a batch, and batching optimizes throughput and reduces the likelihood of hitting rate limits.
No comments:
Post a Comment