It’s surprising that vector stores make it difficult to export and import vectors while models which also comprise of vectors are available to download. It seems as if the vectors are not really data that can be exported and imported and that every vector store must treat its data as proprietary without support for interoperability as first class data type.
Therefore, the following scripts assist in taking backups of your data from an Azure AI Search resource to an Azure storage account for say 70,000 entries in the index each with 1536 dimension vector field with a total index size of just over a GigaByte.
Step 1. Export the schema:
#! /bin/bash
# Variables
search_service="srch-vision-01"
index_name="index007"
resource_group="rg-ctl-2"
schema_file=$(echo index-"$index_name"-schema.json)
echo $search_service
echo $index_name
echo $resource_group
echo $schema_file
# Get admin key
admin_key=$(az search admin-key show --service-name $search_service --resource-group $resource_group --query primaryKey --output tsv)
echo $admin_key
# Export schema using REST API
curl -X GET "https://$search_service.search.windows.net/indexes/$index_name?api-version=2023-10-01-Preview" \
-H "api-key: $admin_key" \
-H "Content-Type: application/json" \
-o $schema_file
echo "schema exported"
Step 2. Export the data:
#! /bin/bash
# Export one document at a time using REST API and loop
# Variables
search_service="srch-vision-01"
index_name="index007"
resource_group="rg-ctl-2"
storage_account="sadronevideo"
container_name="metadata"
total_docs=27
api_version="2023-10-preview"
echo $search_service
echo $index_name
echo $resource_group
echo $storage_account
echo $container_name
echo $total_docs
# Get admin key
admin_key=$(az search admin-key show --service-name $search_service --resource-group $resource_group --query primaryKey --output tsv)
echo $admin_key
storage_key=$(az storage account keys list \
--account-name $storage_account \
--resource-group $resource_group \
--query "[0].value" --output tsv)
echo $storage_key
for ((i=0; i<$total_docs; i++)); do
file_name="doc_$i.json"
blob_name="indexes/$index_name/data/$file_name"
# Check if blob already exists
exists=$(az storage blob exists \
--account-name $storage_account \
--account-key $storage_key \
--container-name $container_name \
--name $blob_name \
--query exists --output tsv)
if [ "$exists" == "true" ]; then
echo "Skipping export for doc $i (already exists in blob)"
continue
fi
# Export one document
curl -s -X POST "https://$search_service.search.windows.net/indexes/$index_name/docs/search?api-version=2023-10-01-Preview" \
-H "api-key: $admin_key" \
-H "Content-Type: application/json" \
-d "{\"search\":\"*\",\"top\":1,\"skip\":$i}" \
| jq '.value[0]' > "$file_name"
# Upload to blob
az storage blob upload \
--account-name $storage_account \
--account-key $storage_key \
--container-name $container_name \
--name $blob_name \
--file $file_name
# Clean up local file
rm "$file_name"
done
Step 3: Import the schema:
#! /bin/bash
# Variables
search_service="srch-vision-01"
index_name="index007"
dest_index_name="$index_name"copy
resource_group="rg-ctl-2"
storage_account="sadronevideo"
container_name="metadata"
total_docs=2
api_version="2023-10-preview"
echo $search_service
echo $index_name
echo $dest_index_name
echo $resource_group
echo $storage_account
echo $container_name
echo $total_docs
# Get admin key
admin_key=$(az search admin-key show --service-name $search_service --resource-group $resource_group --query primaryKey --output tsv)
echo $admin_key
storage_key=$(az storage account keys list \
--account-name $storage_account \
--resource-group $resource_group \
--query "[0].value" --output tsv)
echo $storage_key
exists=$(az storage blob exists \
--account-name $storage_account \
--account-key $storage_key \
--container-name $container_name \
--name $blob_name \
--query exists --output tsv --only-show-errors)
if [ "$exists" != "true" ]; then
echo "Skipping import for schema $blob_name (blob missing)"
exit
fi
file_name="$index_name"-schema.json
echo $file_name
# Download blob
az storage blob download \
--account-name $storage_account \
--account-key $storage_key \
--container-name $container_name \
--name $blob_name \
--file $file_name \
-o none
schema_exists=$(curl -X GET "https://$search_service.search.windows.net/indexes/$dest_index_name?api-version=2023-10-01-Preview" \
-H "api-key: $admin_key" \
-H "Content-Type: application/json" \
| jq -r 'if .error then "false" else "true" end')
if [ "$exists_in_index" == "true" ]; then
echo "Skipping import for schema (already exists in index)"
rm "$file_name"
continue
fi
sed -i "s/$index_name/$dest_index_name/g" "$file_name"
curl -X PUT "https://$search_service.search.windows.net/indexes/$dest_index_name?api-version=2023-10-01-Preview" \
-H "api-key: $admin_key" \
-H "Content-Type: application/json" \
--data-binary "@$file_name"
echo "schema imported"
Step 4: Import the data:
#! /bin/bash
# Export one document at a time using REST API and loop
# Variables
search_service="srch-vision-01"
index_name="index007"
dest_index_name="$index_name"copy
resource_group="rg-ctl-2"
storage_account="sadronevideo"
container_name="metadata"
total_docs=27
api_version="2023-10-preview"
echo $search_service
echo $index_name
echo $dest_index_name
echo $resource_group
echo $storage_account
echo $container_name
echo $total_docs
# Get admin key
admin_key=$(az search admin-key show --service-name $search_service --resource-group $resource_group --query primaryKey --output tsv)
echo $admin_key
storage_key=$(az storage account keys list \
--account-name $storage_account \
--resource-group $resource_group \
--query "[0].value" --output tsv)
echo $storage_key
for ((i=0; i<$total_docs; i++)); do
file_name="doc_$i.json"
blob_name="indexes/$index_name/data/$file_name"
# Check if blob exists
exists=$(az storage blob exists \
--account-name $storage_account \
--account-key $storage_key \
--container-name $container_name \
--name $blob_name \
--query exists --output tsv)
if [ "$exists" != "true" ]; then
echo "Skipping import for doc $i (blob missing)"
continue
fi
# Download blob
az storage blob download \
--account-name $storage_account \
--account-key $storage_key \
--container-name $container_name \
--name $blob_name \
--file $file_name \
-o none
if [ ! -f "$file_name" ]; then
echo "Skipping import for doc $i (download failed)"
continue
fi
# Extract document ID
doc_id=$(jq -r '.["@search.documentKey"] // .id // .Id // .ID' "$file_name")
if [ -z "$doc_id" ]; then
echo "Skipping import for doc $i (missing ID)"
rm "$file_name"
continue
fi
echo $doc_id
# Check if document already exists in index
exists_in_index=$(curl -s -X GET "https://$search_service.search.windows.net/indexes/$dest_index_name/docs/$doc_id?api-version=2023-10-01-Preview" \
-H "api-key: $admin_key" \
-H "Content-Type: application/json" \
| jq -r 'if .error then "false" else "true" end')
if [ "$exists_in_index" == "true" ]; then
echo "Skipping import for doc $i (already exists in index)"
rm "$file_name"
continue
fi
# jq 'with_entries(select(.key != "id"))' "$file_name" > "filtered_$file_name"
jq '{value: [.]}' "$file_name" > "filtered_$file_name"
# Import to index
curl -s -X POST "https://$search_service.search.windows.net/indexes/$dest_index_name/docs/index?api-version=2023-10-01-Preview" \
-H "api-key: $admin_key" \
-H "Content-Type: application/json" \
--data-binary "@filtered_$file_name"
# Clean up local file
rm filtered_"$file_name"
rm "$file_name"
done
Errors encountered that are already addressed by the script:
1. Api-Version must match:
{"error":{"code":"","message":"Invalid or missing api-version query string parameter."}}
2. The document downloaded has metadata so the data is only taken from the value field. Evident from the messages during import:
a. {"error":{"code":"","message":"The request is invalid. Details: The parameter 'id' in the request payload is not a valid parameter for the operation 'index'."}}
b. {"error":{"code":"","message":"The request is invalid. Details: The parameter 'description' in the request payload is not a valid parameter for the operation 'index'."}}
Conclusion: A practice of moving data from Azure AI search resource to storage account itself for an index of size 1GB itself saves a hundred dollars every month in the billing not to mention the benefits of aging, tiering, disaster recovery and other benefits.
No comments:
Post a Comment