Constant evaluation and monitoring of deployed large
language models and generative AI applications are important because both the
data and the environment might vary. There can be shifts in performance,
accuracy or even the emergence of biases. Continuous monitoring helps with
early detection and prompt responses, which in turn makes the models’ outputs
relevant, appropriate, and effective. Benchmarks help to evaluate models but
the variations in results can be large. This stems from a lack of ground truth.
For example, it is difficult to evaluate summarization models based on traditional NLP metrics such as
BLEU, ROUGE etc. because summaries generated might have completely different
words or word order. Comprehensive evaluation standards are elusive for LLMs
and reliance on human judgment can be costly and time-consuming. The novel
trend of “LLMs as a judge” still leaves unanswered questions about reflecting
human preferences in terms of correctness, readability and comprehensiveness of
the answers, reliability and reusability on different metrics, use of different
grading scales by different frameworks and the applicability of the same
evaluation metric across diverse use cases.
Since chatbots are common applications of LLM, an example of
evaluating a chatbot now follows. The underlying principle in a chatbot is
Retrieval Augmented Generation and it is quickly becoming the industry standard
for developing chatbots. As with all LLM and AI models, it is only as effective
as the data which in this case is the vector store aka knowledge base. The LLM
could be newer GPT3.5 or GPT4 to reduce hallucinations, maintain up-to-date
information, and leverage domain-specific knowledge. Evaluating the quality of
chatbot responses must take into account both the knowledge base and the model
involved. LLM-as-a-judge fits this bill for automated evaluation but as noted
earlier, it may not be at par with human grading, might require several auto-evaluation
samples and may have different responsiveness to different chatbot prompts.
Slight variations in the prompt or problem can drastically affect its
performance.
RAG-based chatbots can be evaluated by LLM-as-a-judge to
agree on human grading on over 80% of judgements if the following can be
maintained: using a 1-5 grading scale, use GPT-3.5 to save costs and when you
have one grading example per score and use GPT-4 as an LLM judge when you have
no examples to understand grading rules.
The initial evaluation dataset can be formed from say 100
chatbot prompts and context from the domain in terms of (chunks of ) documents
that are relevant to the question based on say F-score. Using the evaluation
dataset, different language models can be used to generate answers and stored
in question-context-answers pairs in a dataset called “answer sheets”. Then
given the answer sheets, various LLMs can be used to generate grades and
reasoning for grades. Each grade can be a composite score with weighted
contributions for correctness (mostly), comprehensiveness and readability in equal proportions of the remaining weight.
A good choice of hyperparameters is equally applicable to LLM-as-a-judge and
this could include low temperature of say 0.1 to ensure reproducibility,
single-answer grading instead of pairwise comparison, chain of thoughts to let
the LLM reason about the grading process before giving the final score and
examples in grading for each score value on each of the three factors. Factors
that are difficult to measure quantitatively include helpfulness, depth,
creativity etc. Emitting the metrics about correctness, comprehensiveness, and
readability provides justification that becomes valuable. Whether we use a
GPT-4, GPT-3.5 or human judgement, the composite scores can be used to tell
results apart quantitatively. The overall workflow for the creation of
LLM-as-a-judge is also similar to the data preparation, indexing relevant data,
information retrieval and response generation for the chatbots themselves.
No comments:
Post a Comment