Thursday, December 12, 2024

 

Software-as-a-service LLMs aka SaaS LLMs are way more costly than those developed and hosted using foundational models in workspaces either on-premises or in the cloud because they need to address all the use cases including a general chatbot. The generality incurs cost. For a more specific use case, a much smaller prompt suffices and it can also be fine-tuned by baking the instructions and expected structure into the model itself. Inference costs can also rise with the number of input and output tokens and in the case of SaaS services, they are charged per token. Specific use case models can even be implemented with 2 engineers in 1 month with a few thousand dollars of compute for training and experimentation and tested by 4 human evaluators and an initial set of evaluation examples.

SaaS LLMs could be a matter of convenience. Developing a model from scratch often involves significant commitment both in terms of data and computational resources such as pre-training. Unlike fine-tuning, pre-training is a process of training a language model on a large corpus of data without using any prior knowledge or weights from an existing model. This scenario makes sense when the data is quite different from what off-the-shelf LLMs are trained on or where the domain is rather specialized when compared to everyday language or there must be full control over training data in terms of security, privacy, fit and finish for the model’s foundational knowledge base or when there are business justifications to avoid available LLMs altogether.

Organizations must plan for the significant commitment and sophisticated tooling required for this.  Libraries  like PyTorch FSDP and Deepspeed are required for their distributed training capabilities when pretraining an LLM from scratch.  Large-scale data preprocessing is required and involves distributed frameworks and infrastructure that can handle scale in data engineering. Training of an LLM cannot commence without a set of optimal hyperparameters. Since training involves high costs from long-running GPU jobs, resource utilization must be maximized. Even the length of time for training might be quite large which makes GPU failures more likely than normal load. Close monitoring of the training process is essential. Saving model checkpoints regularly and evaluating validation sets acts as safeguards.

Constant evaluation and monitoring of deployed large language models and generative AI applications are important because both the data and the environment might vary. There can be shifts in performance, accuracy or even the emergence of biases. Continuous monitoring helps with early detection and prompt responses, which in turn makes the models’ outputs relevant, appropriate and effective.  Benchmarks help to evaluate models but the variations in results can be large. This stems from a lack of ground truth. For example, it is difficult to evaluate summarization models  based on traditional NLP metrics such as BLEU, ROUGE etc because summaries generated might have completely different words or word ordes. Comprehensive evaluation standards are elusive for LLMs and reliance on human judgment can be costly and time-consuming. The novel trend of “LLMs as a judge” still leaves unanswered questions about reflecting human preferences in terms of correctness, readability and comprehensiveness of the answers, reliability and reusability on different metrics, use of different grading scales by different frameworks and the applicability of the same evaluation metric across diverse use cases.

Finally, the system must be simplified for use with model serving to manage, govern and access models via unified endpoints to handle specific LLM requests.

No comments:

Post a Comment