Cluster computing

It’s surprising that vector stores make it difficult to export and import vectors while models which also comprise of vectors are available to download. It seems as if the vectors are not really data that can be exported and imported and that every vector store must treat its data as proprietary without support for interoperability as first class data type.

Therefore, the following scripts assist in taking backups of your data from an Azure AI Search resource to an Azure storage account for say 70,000 entries in the index each with 1536 dimension vector field with a total index size of just over a GigaByte.

Step 1. Export the schema:

#! /bin/bash

# Variables

search_service="srch-vision-01"

index_name="index007"

resource_group="rg-ctl-2"

schema_file=$(echo index-"$index_name"-schema.json)

echo $search_service

echo $index_name

echo $resource_group

echo $schema_file

# Get admin key

admin_key=$(az search admin-key show --service-name $search_service --resource-group $resource_group --query primaryKey --output tsv)

echo $admin_key

# Export schema using REST API

curl -X GET "https://$search_service.search.windows.net/indexes/$index_name?api-version=2023-10-01-Preview" \

-H "api-key: $admin_key" \

-H "Content-Type: application/json" \

-o $schema_file

echo "schema exported"

Step 2. Export the data:

#! /bin/bash

# Export one document at a time using REST API and loop

# Variables

search_service="srch-vision-01"

index_name="index007"

resource_group="rg-ctl-2"

storage_account="sadronevideo"

container_name="metadata"

total_docs=27

api_version="2023-10-preview"

echo $search_service

echo $index_name

echo $resource_group

echo $storage_account

echo $container_name

echo $total_docs

# Get admin key

admin_key=$(az search admin-key show --service-name $search_service --resource-group $resource_group --query primaryKey --output tsv)

echo $admin_key

storage_key=$(az storage account keys list \

--account-name $storage_account \

--resource-group $resource_group \

--query "[0].value" --output tsv)

echo $storage_key

for ((i=0; i<$total_docs; i++)); do

file_name="doc_$i.json"

blob_name="indexes/$index_name/data/$file_name"

# Check if blob already exists

exists=$(az storage blob exists \

--account-name $storage_account \

--account-key $storage_key \

--container-name $container_name \

--name $blob_name \

--query exists --output tsv)

if [ "$exists" == "true" ]; then

echo "Skipping export for doc $i (already exists in blob)"

continue

# Export one document

curl -s -X POST "https://$search_service.search.windows.net/indexes/$index_name/docs/search?api-version=2023-10-01-Preview" \

-H "api-key: $admin_key" \

-H "Content-Type: application/json" \

-d "{\"search\":\"*\",\"top\":1,\"skip\":$i}" \

| jq '.value[0]' > "$file_name"

# Upload to blob

az storage blob upload \

--account-name $storage_account \

--account-key $storage_key \

--container-name $container_name \

--name $blob_name \

--file $file_name

# Clean up local file

rm "$file_name"

done

Step 3: Import the schema:

#! /bin/bash

# Variables

search_service="srch-vision-01"

index_name="index007"

dest_index_name="$index_name"copy

resource_group="rg-ctl-2"

storage_account="sadronevideo"

container_name="metadata"

total_docs=2

api_version="2023-10-preview"

echo $search_service

echo $index_name

echo $dest_index_name

echo $resource_group

echo $storage_account

echo $container_name

echo $total_docs

# Get admin key

admin_key=$(az search admin-key show --service-name $search_service --resource-group $resource_group --query primaryKey --output tsv)

echo $admin_key

storage_key=$(az storage account keys list \

--account-name $storage_account \

--resource-group $resource_group \

--query "[0].value" --output tsv)

echo $storage_key

exists=$(az storage blob exists \

--account-name $storage_account \

--account-key $storage_key \

--container-name $container_name \

--name $blob_name \

--query exists --output tsv --only-show-errors)

if [ "$exists" != "true" ]; then

echo "Skipping import for schema $blob_name (blob missing)"

exit

file_name="$index_name"-schema.json

echo $file_name

# Download blob

az storage blob download \

--account-name $storage_account \

--account-key $storage_key \

--container-name $container_name \

--name $blob_name \

--file $file_name \

-o none

schema_exists=$(curl -X GET "https://$search_service.search.windows.net/indexes/$dest_index_name?api-version=2023-10-01-Preview" \

-H "api-key: $admin_key" \

-H "Content-Type: application/json" \

| jq -r 'if .error then "false" else "true" end')

if [ "$exists_in_index" == "true" ]; then

echo "Skipping import for schema (already exists in index)"

rm "$file_name"

continue

sed -i "s/$index_name/$dest_index_name/g" "$file_name"

curl -X PUT "https://$search_service.search.windows.net/indexes/$dest_index_name?api-version=2023-10-01-Preview" \

-H "api-key: $admin_key" \

-H "Content-Type: application/json" \

--data-binary "@$file_name"

echo "schema imported"

Step 4: Import the data:

#! /bin/bash

# Export one document at a time using REST API and loop

# Variables

search_service="srch-vision-01"

index_name="index007"

dest_index_name="$index_name"copy

resource_group="rg-ctl-2"

storage_account="sadronevideo"

container_name="metadata"

total_docs=27

api_version="2023-10-preview"

echo $search_service

echo $index_name

echo $dest_index_name

echo $resource_group

echo $storage_account

echo $container_name

echo $total_docs

# Get admin key

admin_key=$(az search admin-key show --service-name $search_service --resource-group $resource_group --query primaryKey --output tsv)

echo $admin_key

storage_key=$(az storage account keys list \

--account-name $storage_account \

--resource-group $resource_group \

--query "[0].value" --output tsv)

echo $storage_key

for ((i=0; i<$total_docs; i++)); do

file_name="doc_$i.json"

blob_name="indexes/$index_name/data/$file_name"

# Check if blob exists

exists=$(az storage blob exists \

--account-name $storage_account \

--account-key $storage_key \

--container-name $container_name \

--name $blob_name \

--query exists --output tsv)

if [ "$exists" != "true" ]; then

echo "Skipping import for doc $i (blob missing)"

continue

# Download blob

az storage blob download \

--account-name $storage_account \

--account-key $storage_key \

--container-name $container_name \

--name $blob_name \

--file $file_name \

-o none

if [ ! -f "$file_name" ]; then

echo "Skipping import for doc $i (download failed)"

continue

# Extract document ID

doc_id=$(jq -r '.["@search.documentKey"] // .id // .Id // .ID' "$file_name")

if [ -z "$doc_id" ]; then

echo "Skipping import for doc $i (missing ID)"

rm "$file_name"

continue

echo $doc_id

# Check if document already exists in index

exists_in_index=$(curl -s -X GET "https://$search_service.search.windows.net/indexes/$dest_index_name/docs/$doc_id?api-version=2023-10-01-Preview" \

-H "api-key: $admin_key" \

-H "Content-Type: application/json" \

| jq -r 'if .error then "false" else "true" end')

if [ "$exists_in_index" == "true" ]; then

echo "Skipping import for doc $i (already exists in index)"

rm "$file_name"

continue

# jq 'with_entries(select(.key != "id"))' "$file_name" > "filtered_$file_name"

jq '{value: [.]}' "$file_name" > "filtered_$file_name"

# Import to index

curl -s -X POST "https://$search_service.search.windows.net/indexes/$dest_index_name/docs/index?api-version=2023-10-01-Preview" \

-H "api-key: $admin_key" \

-H "Content-Type: application/json" \

--data-binary "@filtered_$file_name"

# Clean up local file

rm filtered_"$file_name"

rm "$file_name"

done

Errors encountered that are already addressed by the script:

1. Api-Version must match:

{"error":{"code":"","message":"Invalid or missing api-version query string parameter."}}

2. The document downloaded has metadata so the data is only taken from the value field. Evident from the messages during import:

a. {"error":{"code":"","message":"The request is invalid. Details: The parameter 'id' in the request payload is not a valid parameter for the operation 'index'."}}

b. {"error":{"code":"","message":"The request is invalid. Details: The parameter 'description' in the request payload is not a valid parameter for the operation 'index'."}}

Conclusion: A practice of moving data from Azure AI search resource to storage account itself for an index of size 1GB itself saves a hundred dollars every month in the billing not to mention the benefits of aging, tiering, disaster recovery and other benefits.

Cluster computing

Monday, November 17, 2025

No comments:

Post a Comment