Cluster computing

Large Language Models (LLMs) have the potential to significantly improve organizations' workforce and customer experiences. By addressing tasks that currently occupy 60%-70% of employees' time, LLMs can significantly reduce the time spent on background research, data analysis, and document writing. Additionally, these technologies can significantly reduce the time for new workers to achieve full productivity. However, organizations must first rethink the management of unstructured information assets and mitigate issues of bias and accuracy. This is why many organizations are focusing on internal applications, where a limited scope provides opportunities for better information access and human oversight. These applications, aligned with core capabilities already within the organization, have the potential to deliver real and immediate value while LLMs and their supporting technologies continue to evolve and mature. Examples of applications include automated analysis of product reviews, inventory management, education, financial services, travel and hospitality, healthcare and life sciences, insurance, technology and manufacturing, and media and entertainment.

The use of structured data in GenAI applications can enhance their quality such as in the case of a travel planning chatbot. Such an application would use a vector search and feature-and-function serving building blocks to serve personalized user preferences and budget and hotel information often involving agents for programmatic access to external data sources. To access data and functions as real-time endpoints, federated and universal access control could be used. Models can be exposed as Python functions to compute features on-demand. Such functions can be registered with a catalog for access control and encoded in a directed acyclic graph to compute and serve features as a REST endpoint.

To serve structured data to real-time AI applications, precomputed data needs to be deployed to operational databases, such as DynamoDB and Cosmos DB as in the case of AWS and Azure public clouds respectively. Synchronization of precomputed features to a low-latency data format is required. Fine-tuning a foundation model allows for more deeply personalized models, requiring an underlying architecture to ensure secure and accurate data access.

Most organizations do well with an Intelligence Platform that helps with model fine-tuning, registration for access control, secure and efficient data sharing across different platforms, clouds and regions for faster distribution worldwide, and optimized LLM serving for improved performance. The choice of such Intelligence platforms should be such that it is simple infrastructure for fine-tuning models, ensuring traceability from models to datasets, and enabling faster throughput and latency improvements compared to traditional LLM serving methods.

Software-as-a-service LLMs aka SaaS LLMs are way more costly than those developed and hosted using foundational models in workspaces either on-premises or in the cloud because they need to address all the use cases including a general chatbot. The generality incurs cost. For a more specific use case, a much smaller prompt suffices and it can also be fine-tuned by baking the instructions and expected structure into the model itself. Inference costs can also rise with the number of input and output tokens and in the case of SaaS services, they are charged per token. Specific use case models can even be implemented with 2 engineers in 1 month with a few thousand dollars of compute for training and experimentation and tested by 4 human evaluators and an initial set of evaluation examples.

SaaS LLMs could be a matter of convenience. Developing a model from scratch often involves significant commitment both in terms of data and computational resources such as pre-training. Unlike fine-tuning, pre-training is a process of training a language model on a large corpus of data without using any prior knowledge or weights from an existing model. This scenario makes sense when the data is quite different from what off-the-shelf LLMs are trained on or where the domain is rather specialized when compared to everyday language or there must be full control over training data in terms of security, privacy, fit and finish for the model’s foundational knowledge base or when there are business justifications to avoid available LLMs altogether.

Organizations must plan for the significant commitment and sophisticated tooling required for this. Libraries like PyTorch FSDP and Deepspeed are required for their distributed training capabilities when pretraining an LLM from scratch. Large-scale data preprocessing is required and involves distributed frameworks and infrastructure that can handle scale in data engineering. Training of an LLM cannot commence without a set of optimal hyperparameters. Since training involves high costs from long-running GPU jobs, resource utilization must be maximized. Even the length of time for training might be quite large which makes GPU failures more likely than normal load. Close monitoring of the training process is essential. Saving model checkpoints regularly and evaluating validation sets acts as safeguards.

Constant evaluation and monitoring of deployed large language models and generative AI applications are important because both the data and the environment might vary. There can be shifts in performance, accuracy or even the emergence of biases. Continuous monitoring helps with early detection and prompt responses, which in turn makes the models’ outputs relevant, appropriate and effective. Benchmarks help to evaluate models but the variations in results can be large. This stems from a lack of ground truth. For example, it is difficult to evaluate summarization models based on traditional NLP metrics such as BLEU, ROUGE etc because summaries generated might have completely different words or word ordes. Comprehensive evaluation standards are elusive for LLMs and reliance on human judgment can be costly and time-consuming. The novel trend of “LLMs as a judge” still leaves unanswered questions about reflecting human preferences in terms of correctness, readability and comprehensiveness of the answers, reliability and reusability on different metrics, use of different grading scales by different frameworks and the applicability of the same evaluation metric across diverse use cases.

Finally, the system must be simplified for use with model serving to manage, govern and access models via unified endpoints to handle specific LLM requests.

Cluster computing

Monday, December 16, 2024

No comments:

Post a Comment