Software-as-a-service LLMs aka SaaS LLMs are way more costly
than those developed and hosted using foundational models in workspaces either
on-premises or in the cloud because they need to address all the use cases
including a general chatbot. The generality incurs cost. For a more specific
use case, a much smaller prompt suffices and it can also be fine-tuned by
baking the instructions and expected structure into the model itself. Inference
costs can also rise with the number of input and output tokens and in the case
of SaaS services, they are charged per token. Specific use case models can even
be implemented with 2 engineers in 1 month with a few thousand dollars of
compute for training and experimentation and tested by 4 human evaluators and
an initial set of evaluation examples.
SaaS LLMs could be a matter of convenience. Developing a
model from scratch often involves significant commitment both in terms of data
and computational resources such as pre-training. Unlike fine-tuning,
pre-training is a process of training a language model on a large corpus of
data without using any prior knowledge or weights from an existing model. This
scenario makes sense when the data is quite different from what off-the-shelf
LLMs are trained on or where the domain is rather specialized when compared to
everyday language or there must be full control over training data in terms of
security, privacy, fit and finish for the model’s foundational knowledge base
or when there are business justifications to avoid available LLMs altogether.
Organizations must plan for the significant commitment and
sophisticated tooling required for this. Libraries
like PyTorch FSDP and Deepspeed are required for their distributed
training capabilities when pretraining an LLM from scratch. Large-scale data preprocessing is required and
involves distributed frameworks and infrastructure that can handle scale in
data engineering. Training of an LLM cannot commence without a set of optimal
hyperparameters. Since training involves high costs from long-running GPU jobs,
resource utilization must be maximized. Even the length of time for training
might be quite large which makes GPU failures more likely than normal load.
Close monitoring of the training process is essential. Saving model checkpoints
regularly and evaluating validation sets acts as safeguards.
Constant evaluation and monitoring of deployed large
language models and generative AI applications are important because both the
data and the environment might vary. There can be shifts in performance,
accuracy or even the emergence of biases. Continuous monitoring helps with
early detection and prompt responses, which in turn makes the models’ outputs
relevant, appropriate and effective.
Benchmarks help to evaluate models but the variations in results can be
large. This stems from a lack of ground truth. For example, it is difficult to
evaluate summarization models based on
traditional NLP metrics such as BLEU, ROUGE etc because summaries generated
might have completely different words or word ordes. Comprehensive evaluation
standards are elusive for LLMs and reliance on human judgment can be costly and
time-consuming. The novel trend of “LLMs as a judge” still leaves unanswered
questions about reflecting human preferences in terms of correctness,
readability and comprehensiveness of the answers, reliability and reusability
on different metrics, use of different grading scales by different frameworks and
the applicability of the same evaluation metric across diverse use cases.
Finally, the system must be simplified for use with model
serving to manage, govern and access models via unified endpoints to handle
specific LLM requests.
No comments:
Post a Comment