Gen AI created a new set of applications that require a different data architecture than traditional systems which includes structured and unstructured data. Applications like chatbot can perform satisfactorily only with information from diverse data sources. A chatbot requires an LLM model to respond with information from a knowledge base, typically a vector database. The underlying principle in a chatbot is Retrieval Augmented Generation. The LLM could be newer GPT3.5 or GPT4 to reduce hallucinations, maintain up-to-date information, and leverage domain-specific knowledge.
As with all LLMs, it is important to ensure AI safety and security1 to include a diverse set of data and to leverage the proper separation of the read-write and read-only accesses needed between the model and the judge. Use of a feedback loop to emit the gradings as telemetry and its inclusion into the feedback loop for the model when deciding on the formation shape and size, albeit optional, can ensure the parameters of remaining under the constraints imposed is always met.
Evaluating the quality of chatbot responses must take into account both the knowledge base and the model involved. LLM-as-a-judge evaluates the quality of a chatbot as an external entity. Although, it suffers from limitations such as it may not be at par with human grading, it might require several auto-evaluation samples, it may have different responsiveness to different chatbot prompts and slight variations in the prompt or problem can drastically affect its performance, it can still agree on human grading on over 80% of judgements. This is achieved by using a 1-5 grading scale, using GPT-3.5 to save costs and when there is one grading example per score and using GPT-4 when there are no examples to understand grading rules.
No comments:
Post a Comment