Measuring RAG performance:
Since a RAG Application has many aspects that affect its retrieval or generation quality, there must be ways to measure its performance, but this is still one of the most challenging parts of setting up a RAG Application. It sometimes helpful to evaluate each step of the RAG Application creation process independently. Both the model and the knowledge base must be effective.
The evaluations in retrieval step, for instance, involves identifying the relevant records that should be retrieved to address each prompt. A precision and recall metric such as F-score can come helpful in benchmarking and improvements. Generating good answers to those prompts can also be evaluated so that it is free of hallucinations and incorrect responses. Leveraging another LLM to provide prompts and to check responses can also be helpful and this technique is known as LLM-as-a-judge. The scores resulting from this technique must be simple and, in the range, say 1-5 with a higher rating indicating a true response to the context.
RAG isn’t the only approach to customizing to equipping models with new information, but any approach will involve trade-offs between cost, complexity and expressive power. Cost comes from inventory and bill of materials. Complexity means technical difficulty that is usually reflected in time, effort, and expertise required. Expressiveness refers to the model’s ability to generate diverse, inclusive, meaningful and useful responses to prompts.
Besides RAG, prompt engineering offers an alternative to guide a model’s outputs towards a desired result. Large and highly capable models are often required to understand and follow complex prompts and entail serving costs or per-token costs. This is especially useful when public data is sufficient and there is no need for proprietary or recent knowledge.
Improving overall performance also requires the model to be fine-tuned. This has a special meaning in the context of large language models where it refers to taking a pretrained model and adapting it to a new task or domain by adjusting some or all of its weights on new data. This is a necessary step for building a chatbot on say medical texts.
While RAG infuses data into the overall process, it does not change the model. Fine-tuning can change a model’s behavior, so that it need not be the same as when it was originally. It is also not a straightforward process and may not be as reliable as RAG in generating relevant responses.
#codingexercise: https://1drv.ms/w/c/d609fb70e39b65c8/EYKwhcLpZ3tAs0h6tU_RYxwBxeAeg1Vg2DH7deOt-niRhw?e=qbXLag
No comments:
Post a Comment