Monday, November 17, 2025

 It’s surprising that vector stores make it difficult to export and import vectors while models which also comprise of vectors are available to download. It seems as if the vectors are not really data that can be exported and imported and that every vector store must treat its data as proprietary without support for interoperability as first class data type.

Therefore, the following scripts assist in taking backups of your data from an Azure AI Search resource to an Azure storage account for say 70,000 entries in the index each with 1536 dimension vector field with a total index size of just over a GigaByte.

Step 1. Export the schema:

#! /bin/bash

# Variables

search_service="srch-vision-01"

index_name="index007"

resource_group="rg-ctl-2"

schema_file=$(echo index-"$index_name"-schema.json)

echo $search_service

echo $index_name

echo $resource_group

echo $schema_file

# Get admin key

admin_key=$(az search admin-key show --service-name $search_service --resource-group $resource_group --query primaryKey --output tsv)

echo $admin_key

# Export schema using REST API

curl -X GET "https://$search_service.search.windows.net/indexes/$index_name?api-version=2023-10-01-Preview" \

  -H "api-key: $admin_key" \

  -H "Content-Type: application/json" \

  -o $schema_file

echo "schema exported"

Step 2. Export the data:

#! /bin/bash

# Export one document at a time using REST API and loop

# Variables

search_service="srch-vision-01"

index_name="index007"

resource_group="rg-ctl-2"

storage_account="sadronevideo"

container_name="metadata"

total_docs=27

api_version="2023-10-preview"

echo $search_service

echo $index_name

echo $resource_group

echo $storage_account

echo $container_name

echo $total_docs

# Get admin key

admin_key=$(az search admin-key show --service-name $search_service --resource-group $resource_group --query primaryKey --output tsv)

echo $admin_key

storage_key=$(az storage account keys list \

  --account-name $storage_account \

  --resource-group $resource_group \

  --query "[0].value" --output tsv)

echo $storage_key

for ((i=0; i<$total_docs; i++)); do

  file_name="doc_$i.json"

  blob_name="indexes/$index_name/data/$file_name"

  # Check if blob already exists

  exists=$(az storage blob exists \

    --account-name $storage_account \

    --account-key $storage_key \

    --container-name $container_name \

    --name $blob_name \

    --query exists --output tsv)

  if [ "$exists" == "true" ]; then

    echo "Skipping export for doc $i (already exists in blob)"

    continue

  fi

  # Export one document

  curl -s -X POST "https://$search_service.search.windows.net/indexes/$index_name/docs/search?api-version=2023-10-01-Preview" \

    -H "api-key: $admin_key" \

    -H "Content-Type: application/json" \

    -d "{\"search\":\"*\",\"top\":1,\"skip\":$i}" \

    | jq '.value[0]' > "$file_name"

  # Upload to blob

  az storage blob upload \

    --account-name $storage_account \

    --account-key $storage_key \

    --container-name $container_name \

    --name $blob_name \

    --file $file_name

  # Clean up local file

  rm "$file_name"

done

Step 3: Import the schema:

#! /bin/bash

# Variables

search_service="srch-vision-01"

index_name="index007"

dest_index_name="$index_name"copy

resource_group="rg-ctl-2"

storage_account="sadronevideo"

container_name="metadata"

total_docs=2

api_version="2023-10-preview"

echo $search_service

echo $index_name

echo $dest_index_name

echo $resource_group

echo $storage_account

echo $container_name

echo $total_docs

# Get admin key

admin_key=$(az search admin-key show --service-name $search_service --resource-group $resource_group --query primaryKey --output tsv)

echo $admin_key

storage_key=$(az storage account keys list \

  --account-name $storage_account \

  --resource-group $resource_group \

  --query "[0].value" --output tsv)

echo $storage_key

exists=$(az storage blob exists \

  --account-name $storage_account \

  --account-key $storage_key \

  --container-name $container_name \

  --name $blob_name \

  --query exists --output tsv --only-show-errors)

if [ "$exists" != "true" ]; then

  echo "Skipping import for schema $blob_name (blob missing)"

  exit

fi

file_name="$index_name"-schema.json

echo $file_name

# Download blob

az storage blob download \

  --account-name $storage_account \

  --account-key $storage_key \

  --container-name $container_name \

  --name $blob_name \

  --file $file_name \

  -o none

schema_exists=$(curl -X GET "https://$search_service.search.windows.net/indexes/$dest_index_name?api-version=2023-10-01-Preview" \

  -H "api-key: $admin_key" \

  -H "Content-Type: application/json" \

    | jq -r 'if .error then "false" else "true" end')

if [ "$exists_in_index" == "true" ]; then

  echo "Skipping import for schema (already exists in index)"

  rm "$file_name"

  continue

fi

sed -i "s/$index_name/$dest_index_name/g" "$file_name"

curl -X PUT "https://$search_service.search.windows.net/indexes/$dest_index_name?api-version=2023-10-01-Preview" \

  -H "api-key: $admin_key" \

  -H "Content-Type: application/json" \

  --data-binary "@$file_name"

echo "schema imported"

Step 4: Import the data:

#! /bin/bash

# Export one document at a time using REST API and loop

# Variables

search_service="srch-vision-01"

index_name="index007"

dest_index_name="$index_name"copy

resource_group="rg-ctl-2"

storage_account="sadronevideo"

container_name="metadata"

total_docs=27

api_version="2023-10-preview"

echo $search_service

echo $index_name

echo $dest_index_name

echo $resource_group

echo $storage_account

echo $container_name

echo $total_docs

# Get admin key

admin_key=$(az search admin-key show --service-name $search_service --resource-group $resource_group --query primaryKey --output tsv)

echo $admin_key

storage_key=$(az storage account keys list \

  --account-name $storage_account \

  --resource-group $resource_group \

  --query "[0].value" --output tsv)

echo $storage_key

for ((i=0; i<$total_docs; i++)); do

  file_name="doc_$i.json"

  blob_name="indexes/$index_name/data/$file_name"

  # Check if blob exists

  exists=$(az storage blob exists \

    --account-name $storage_account \

    --account-key $storage_key \

    --container-name $container_name \

    --name $blob_name \

    --query exists --output tsv)

  if [ "$exists" != "true" ]; then

    echo "Skipping import for doc $i (blob missing)"

    continue

  fi

  # Download blob

  az storage blob download \

    --account-name $storage_account \

    --account-key $storage_key \

    --container-name $container_name \

    --name $blob_name \

    --file $file_name \

    -o none

  if [ ! -f "$file_name" ]; then

    echo "Skipping import for doc $i (download failed)"

    continue

  fi

  # Extract document ID

  doc_id=$(jq -r '.["@search.documentKey"] // .id // .Id // .ID' "$file_name")

  if [ -z "$doc_id" ]; then

    echo "Skipping import for doc $i (missing ID)"

    rm "$file_name"

    continue

  fi

  echo $doc_id

  # Check if document already exists in index

  exists_in_index=$(curl -s -X GET "https://$search_service.search.windows.net/indexes/$dest_index_name/docs/$doc_id?api-version=2023-10-01-Preview" \

    -H "api-key: $admin_key" \

    -H "Content-Type: application/json" \

    | jq -r 'if .error then "false" else "true" end')

  if [ "$exists_in_index" == "true" ]; then

    echo "Skipping import for doc $i (already exists in index)"

    rm "$file_name"

    continue

  fi

  # jq 'with_entries(select(.key != "id"))' "$file_name" > "filtered_$file_name"

  jq '{value: [.]}' "$file_name" > "filtered_$file_name"

  # Import to index

  curl -s -X POST "https://$search_service.search.windows.net/indexes/$dest_index_name/docs/index?api-version=2023-10-01-Preview" \

    -H "api-key: $admin_key" \

    -H "Content-Type: application/json" \

    --data-binary "@filtered_$file_name"

  # Clean up local file

  rm filtered_"$file_name"

  rm "$file_name"

done

Errors encountered that are already addressed by the script:

1. Api-Version must match:

{"error":{"code":"","message":"Invalid or missing api-version query string parameter."}}

2. The document downloaded has metadata so the data is only taken from the value field. Evident from the messages during import:

a. {"error":{"code":"","message":"The request is invalid. Details: The parameter 'id' in the request payload is not a valid parameter for the operation 'index'."}}

b. {"error":{"code":"","message":"The request is invalid. Details: The parameter 'description' in the request payload is not a valid parameter for the operation 'index'."}}

Conclusion: A practice of moving data from Azure AI search resource to storage account itself for an index of size 1GB itself saves a hundred dollars every month in the billing not to mention the benefits of aging, tiering, disaster recovery and other benefits.


No comments:

Post a Comment