Sunday, June 16, 2024

 This is a continuation of a study involving a software application that responds to a chat like query on the data contained in the ten-year collection of my blog posts from https://ravinote.blogspot.com. Each article collected on the blog post is a daily routine and contains mostly unstructured text on quality explanations of software engineering practices and code samples from personal, enterprise and cloud computing. The earlier part of the study referred to leveraging Azure OpenAI search service to perform a semantic search based on the chat like query to create a generated response. This part of the study follows up on taking the data completely private so that the model built to respond to the query can be hosted on any lightweight compute including handheld devices using mobile browsers. The lessons learned in this section now follows:

First, a brief introduction of the comparison of the search methodologies between the two: 

1. Azure AI Toolkit for VS Code:

o Approach: This simplifies generative AI app development by bringing together AI tools and models from Azure AI catalog. We specifically use the Phi-2 small language model. It also helps to fine-tune and deploy models to the cloud.

o Matching: It matches based on similarity between query vectors and content vectors. This enables matching across semantic or conceptual likeness (e.g., “dog” and “canine”). Phi-2 is a 2.7 billion-parameter language model not the order of trillions in large language model but sufficiently compact to demonstrate outstanding reasoning and language understanding capabilities. Phi-2 is a Transformer based model with a next-word prediction objective that was originally trained on a large mixture of synthetic datasets for NLP and coding. 

o Scenarios Supported: 

Find a supported model from the Model Catalog.

Test model inference in the Model Playground.

Fine-Tune model locally or remotely in Model Fine-tuning

Deploy fine-tuned models to cloud via command-palette for AI Toolkit.

o Integration: Works seamlessly with other Azure services.

2. Azure OpenAI Service-Based Search:

o Approach: Uses the Azure OpenAI embedding model such as GPT-models to convert queries into vector embeddings. GPT-models are large language models while Phi-2 models are small language models. The dataset can include the web for the Chat Completions API from Azure OpenAI service. 

o Matching: Performs vector similarity search using the query vector in the vector database usually based on the top k-matching content based on a defined similarity threshold.

o Scenarios supported: 

Similarity search: Encode text using embedding models (e.g., OpenAI embeddings) and retrieve documents with encoded queries.

Hybrid search: Execute vector and keyword queries in the same request, merging results.

Filtered vector search: Combine vector queries with filter expressions.

o Cost:

increases linearly even for infrequent use at the rate of few-hundred dollars per month. The earlier application leveraging completion API had to be taken down for this reason.

Both approaches leverage vector embeddings for search, but toolkit and Phi-2 model are better for customizations while Azure OpenAI Completions API is useful for streamlined applications and quick chatbots.

And now the learnings follow:

- Fine-Tuning: A pretrained transformer can be put to different use during fine-tuning such as question-answering, language generation, sentiment analysis, summarization, and others. Fine-tuning adapts the model to different domains. Phi-2 behaves remarkably well in this regard. Fine-tuning LLMs are so cost prohibitive that it can be avoided. On the other hand, small language models are susceptible to overfitting where the model learns specifics of the training data that cannot be applied to the query.

- Parameter-Efficient Fine-tuning: This is called out here only for rigor. Costs for tuning LLMs can be reduced by fine-tuning only a subset of the model’s parameter but it might result in “catastrophic forgetting”. These techniques include LoRA, prefix tuning and prompt tuning. The Azure Toolkit leverages QLoRA. LoRA stands for Low-Rank Adaptation of Large Language Models and introduces trainable rank decomposition matrices into each layer of transformer architecture. It also reduces trainable parameters for downstream tasks while keeping the pre-trained weights frozen. QLoRA combines quantization with LORA with quantization data types to use as one of nf4 (4-bit normal float) or fp4 and adjustable batch size for training and evaluation per GPU. The data type is for compression to 4-bit precision as opposed to native floating-point-32-bit precision.

- Problem with outliers:  As with all neural networks that create embeddings, outliers are significant because while model weights are normally distributed, the inclusion of outliers directly affects the quality of the model. Iterative study involving different range of inclusions was too expensive to include in the study.

- Dequantization – this is the process of taking quantized weights which are frozen and not trained to be dequantized back to 32-bit precision. It is helpful to inference when quantized values and quantization constant can be used to backpropagate calculated gradients.

- Paged optimizers are necessary to manage memory usage during training of the language models. Azure NC4AS_T4_v3 family VMs handle this well but choice of sku is in initial decision not something that we can change during flight.

- BlobFuseV2 to load all the data stored in private storage accounts as local filesystem is incredibly slow for read over this entire dataset. Toolkit is more helpful to run on notebooks and laptops with VS Code, GPU, and customized local Windows Sub-system for Linux.


No comments:

Post a Comment