Tuesday, December 17, 2024

 

Constant evaluation and monitoring of deployed large language models and generative AI applications are important because both the data and the environment might vary. There can be shifts in performance, accuracy or even the emergence of biases. Continuous monitoring helps with early detection and prompt responses, which in turn makes the models’ outputs relevant, appropriate, and effective. Benchmarks help to evaluate models but the variations in results can be large. This stems from a lack of ground truth. For example, it is difficult to evaluate summarization models  based on traditional NLP metrics such as BLEU, ROUGE etc. because summaries generated might have completely different words or word order. Comprehensive evaluation standards are elusive for LLMs and reliance on human judgment can be costly and time-consuming. The novel trend of “LLMs as a judge” still leaves unanswered questions about reflecting human preferences in terms of correctness, readability and comprehensiveness of the answers, reliability and reusability on different metrics, use of different grading scales by different frameworks and the applicability of the same evaluation metric across diverse use cases.

Since chatbots are common applications of LLM, an example of evaluating a chatbot now follows. The underlying principle in a chatbot is Retrieval Augmented Generation and it is quickly becoming the industry standard for developing chatbots. As with all LLM and AI models, it is only as effective as the data which in this case is the vector store aka knowledge base. The LLM could be newer GPT3.5 or GPT4 to reduce hallucinations, maintain up-to-date information, and leverage domain-specific knowledge. Evaluating the quality of chatbot responses must take into account both the knowledge base and the model involved. LLM-as-a-judge fits this bill for automated evaluation but as noted earlier, it may not be at par with human grading, might require several auto-evaluation samples and may have different responsiveness to different chatbot prompts. Slight variations in the prompt or problem can drastically affect its performance.

RAG-based chatbots can be evaluated by LLM-as-a-judge to agree on human grading on over 80% of judgements if the following can be maintained: using a 1-5 grading scale, use GPT-3.5 to save costs and when you have one grading example per score and use GPT-4 as an LLM judge when you have no examples to understand grading rules.

The initial evaluation dataset can be formed from say 100 chatbot prompts and context from the domain in terms of (chunks of ) documents that are relevant to the question based on say F-score. Using the evaluation dataset, different language models can be used to generate answers and stored in question-context-answers pairs in a dataset called “answer sheets”. Then given the answer sheets, various LLMs can be used to generate grades and reasoning for grades. Each grade can be a composite score with weighted contributions for correctness (mostly), comprehensiveness and readability  in equal proportions of the remaining weight. A good choice of hyperparameters is equally applicable to LLM-as-a-judge and this could include low temperature of say 0.1 to ensure reproducibility, single-answer grading instead of pairwise comparison, chain of thoughts to let the LLM reason about the grading process before giving the final score and examples in grading for each score value on each of the three factors. Factors that are difficult to measure quantitatively include helpfulness, depth, creativity etc. Emitting the metrics about correctness, comprehensiveness, and readability provides justification that becomes valuable. Whether we use a GPT-4, GPT-3.5 or human judgement, the composite scores can be used to tell results apart quantitatively. The overall workflow for the creation of LLM-as-a-judge is also similar to the data preparation, indexing relevant data, information retrieval and response generation for the chatbots themselves.

No comments:

Post a Comment