Thursday, January 2, 2025

 Serverless SQL in Azure offers a flexible and cost-effective way to manage SQL databases and data processing without the need to manage the underlying infrastructure. Here are some key aspects:

Azure SQL Database Serverless

Autoscaling: Automatically scales compute based on workload demand. It bills for the amount of compute used per second2.

Auto-Pause and Resume: Pauses databases during inactive periods when only storage is billed and resumes when activity returns.

Configurable Parameters: You can configure the minimum and maximum vCores, memory, and IO limits.

Cost-Effective: Ideal for single databases with intermittent, unpredictable usage patterns.

Azure Synapse Analytics Serverless SQL Pool

Query Service: Provides a query service over data in your data lake, allowing you to query data in place without moving it.

T-SQL Support: Uses familiar T-SQL syntax for querying data.

High Reliability: Built for large-scale data processing with built-in query execution fault-tolerance.

Pay-Per-Use: You are only charged for the data processed by your queries.

Benefits

Scalability: Easily scales to accommodate varying workloads.

Cost Efficiency: Only pay for what you use, making it cost-effective for unpredictable workloads.

Ease of Use: No infrastructure setup or maintenance required.

The product Neon Database was launched in 2021 for going serverless on a cloud platform as a relational database. Recently it has become cloud native to Azure just like it has been on AWS. This deeper integration of Neon in Azure facilitates rapid app development because postgres sql is the developers’ choice. Serverless reduces operational overhead and frees the developers to focus on the data model, access and CI/CD integration to suit their needs. In fact, Microsoft’s investments in GitHub, VSCode, TypeScript, OpenAI and Copilot align well with the developers’ agenda.

Even the ask for a vector store from AI can be facilitated within a relational database as both Azure SQL and Neon have demonstrated. The compute seamlessly scale up for expensive index builds and back down for normal queries or RAG queries. Since pause during inacitivity and resume for load is automated in serverless, the cost savings are significant. In addition, both databases focus on data privacy.

The following is a way to test the ai vector cosine similarity in a relational database.

1. Step 1: upload a dataset to a storage account from where it can be accessed easily. This must be a csv file with headers like:

id,url,title,text,title_vector,content_vector,vector_id

Sample uploaded file looks like this:



2. Step 2: Use Azure Portal Query Editor or any client to run the following SQL:

a. 00-setup-blob-access.sql

/*

 Cleanup if needed

*/

if not exists(select * from sys.symmetric_keys where [name] = '##MS_DatabaseMasterKey##')

begin

 create master key encryption by password = 'Pa$$w0rd!'

end

go

if exists(select * from sys.[external_data_sources] where name = 'openai_playground')

begin

 drop external data source [openai_playground];

end

go

if exists(select * from sys.[database_scoped_credentials] where name = 'openai_playground')

begin

 drop database scoped credential [openai_playground];

end

go

/*

 Create database scoped credential and external data source.

 File is assumed to be in a path like:

 https://saravinoteblogs.blob.core.windows.net/playground/wikipedia/vector_database_wikipedia_articles_embedded.csv

 Please note that it is recommened to avoid using SAS tokens: the best practice is to use Managed Identity as described here:

 https://learn.microsoft.com/en-us/sql/relational-databases/import-export/import-bulk-data-by-using-bulk-insert-or-openrowset-bulk-sql-server?view=sql-server-ver16#bulk-importing-from-azure-blob-storage

*/

create database scoped credential [openai_playground]

with identity = 'SHARED ACCESS SIGNATURE',

secret = 'sp=rwdme&st=2024-11-22T03:37:08Z&se=2024-11-29T11:37:08Z&spr=https&sv=2022-11-02&sr=b&sig=EWag2qRCAY7kRsF7LtBRRRExdWgR5h4XWrU%2'; -- make sure not to include the ? at the beginning

go

create external data source [openai_playground]

with

(

 type = blob_storage,

  location = 'https://saravinoteblogs.blob.core.windows.net/playground',

  credential = [openai_playground]

);

Go

b. 01-import-wikipedia.sql:

/*

Create table

*/

drop table if exists [dbo].[wikipedia_articles_embeddings];

create table [dbo].[wikipedia_articles_embeddings]

(

[id] [int] not null,

[url] [varchar](1000) not null,

[title] [varchar](1000) not null,

[text] [varchar](max) not null,

[title_vector] [varchar](max) not null,

[content_vector] [varchar](max) not null,

[vector_id] [int] not null

)

go

/*

Import data

*/

bulk insert dbo.[wikipedia_articles_embeddings]

from 'wikipedia/vector_database_wikipedia_articles_embedded.csv'

with (

data_source = 'openai_playground',

    format = 'csv',

    firstrow = 2,

    codepage = '65001',

fieldterminator = ',',

rowterminator = '0x0a',

    fieldquote = '"',

    batchsize = 1000,

    tablock

)

go

/*

Add primary key

*/

alter table [dbo].[wikipedia_articles_embeddings]

add constraint pk__wikipedia_articles_embeddings primary key clustered (id)

go

/*

Add index on title

*/

create index [ix_title] on [dbo].[wikipedia_articles_embeddings](title)

go

/*

Verify data

*/

select top (10) * from [dbo].[wikipedia_articles_embeddings]

go

select * from [dbo].[wikipedia_articles_embeddings] where title = 'Alan Turing'

go

c. 02-use-native-vectors.sql:

/*

    Add columns to store the native vectors

*/

alter table wikipedia_articles_embeddings

add title_vector_ada2 vector(1536);

alter table wikipedia_articles_embeddings

add content_vector_ada2 vector(1536);

go

/*

    Update the native vectors

*/

update

    wikipedia_articles_embeddings

set

    title_vector_ada2 = cast(title_vector as vector(1536)),

    content_vector_ada2 = cast(content_vector as vector(1536))

go

/*

    Remove old columns

*/

alter table wikipedia_articles_embeddings

drop column title_vector;

go

alter table wikipedia_articles_embeddings

drop column content_vector;

go

/*

Verify data

*/

select top (10) * from [dbo].[wikipedia_articles_embeddings]

go

select * from [dbo].[wikipedia_articles_embeddings] where title = 'Alan Turing'

go

d. 03-store-openai-credentials.sql

/*

    Create database credentials to store API key

*/

if exists(select * from sys.[database_scoped_credentials] where name = 'https://postssearch.openai.azure.com')

begin

drop database scoped credential [https://postssearch.openai.azure.com];

end

create database scoped credential [https://postssearch.openai.azure.com]

with identity = 'HTTPEndpointHeaders', secret = '{"api-key": "7cGuGvTm7FQEJtzFIrZBZpOCJxXbAsGOMDd8uG0RIBivUXIfOUJRJQQJ99AKACYeBjFXJ3w3AAABACOGAL8U"}';

go

e. 04-create-get-embeddings-procedure.sql:

/*

    Get the embeddings for the input text by calling the OpenAI API

*/

create or alter procedure dbo.get_embedding

@deployedModelName nvarchar(1000),

@inputText nvarchar(max),

@embedding vector(1536) output

as

declare @retval int, @response nvarchar(max);

declare @payload nvarchar(max) = json_object('input': @inputText);

declare @url nvarchar(1000) = 'https://postssearch.openai.azure.com/openai/deployments/' + @deployedModelName + '/embeddings?api-version=2023-03-15-preview'

exec @retval = sp_invoke_external_rest_endpoint

    @url = @url,

    @method = 'POST',

    @credential = [https://postssearch.openai.azure.com],

    @payload = @payload,

    @response = @response output;

declare @re nvarchar(max) = null;

if (@retval = 0) begin

    set @re = json_query(@response, '$.result.data[0].embedding')

end else begin

    select @response as 'Error message from OpenAI API';

end

set @embedding = cast(@re as vector(1536));

return @retval

go

f. 05-find-similar-articles.sql:

/*

    Get the embeddings for the input text by calling the OpenAI API

    and then search the most similar articles (by title)

    Note: postssearchembedding needs to be replaced with the deployment name of your embedding model in Azure OpenAI

*/

declare @inputText nvarchar(max) = 'the foundation series by isaac asimov';

declare @retval int, @embedding vector(1536);

exec @retval = dbo.get_embedding 'postssearchembedding', @inputText, @embedding output;

select top(10)

    a.id,

    a.title,

    a.url,

    vector_distance('cosine', @embedding, title_vector_ada2) cosine_distance

from

    dbo.wikipedia_articles_embeddings a

order by

    cosine_distance;

go

3. Finally, manually review the results.




No comments:

Post a Comment