RAG with vector search involves retrieving information using a vector database, augmenting the user’s prompt with that information and generating a response based on the user’s prompt and information retrieved using an LLM. Each of these steps can be implemented by a variety of approaches but we will go over the mainstream.
1. Data Preparation: This is all about data ingestion into a vector database usually with the help of connectors. This isn’t a one-time task because a vector database should be regularly updated and provide high-quality information for an embeddings model otherwise, it might sound outdated. The great part of the RAG process is that the LLM weights do not need to be adjusted as data is ingested. Some of the common stages in this step include parsing the input documents, splitting the documents into chunks which incidentally can affect the output quality, and using an embedding model to convert each of the chunks into a high-dimensional numerical vector, storing and indexing the embeddings which results in a vector index to boost search efficiency and recording metadata that can participate in filtering. The value of embeddings for RAG is in the articulation of similarity scores between the meanings of the set of original text.
2. Retrieval is all about getting the relevant context. After preprocessing the original documents, we have a vector database storing chunks, embeddings and metadata. In this step, the user provides a prompt which the application uses to query the vector database and with the relevant results, augments the original prompt in the next step. Querying the vector database is done with the help of similarity scores between the vector representing the query to those in the database. There are many ways to improve search results and these include hybrid search, reranking, summarized text comparison, contextual chunk retrieval, prompt refinement, and domain specific tuning.
3. Augmenting the prompt with the retrieved context equips the model with both the prompt and the context needed to address the prompt The structure of the new prompt that combines the retrieved texts and the users’ prompt can impact the quality of the result.
4. Generation of the response is done with the help of an LLM and follows after the retrieval and augmentation steps. Some LLMs are quite good at following instructions but many require post processing. Another important consideration is whether RAG system should have memory of previous prompts and responses. One way to enhance generation is to add multi-turn conversation ability which allows us to ask a follow-up question.
Last but not the least, RAG performance must be measured. Sometimes, other LLMs are used to judge the response quality with simple scores like a range of 1-5. Prompt engineering plays a significant role in such cases to guide a model’s results towards a desired result. Fine-tuning can enhance the model’s expressiveness and accuracy.