Cluster computing

Sunday, December 22, 2024

AI Safety and security are distinct concerns. AI safety focuses on preventing AI systems from generating harmful content, from instructions for creating weapons to offensive language and inappropriate images and demonstrating bias, misrepresentations and hallucinations. It aims to ensure responsible use of AI and adherence to ethical standards. AI security involves testing AI systems with the goal of preventing bad actors from abusing AI to compromise the confidentiality, integrity, or availability of the systems the AI is hosted in.

These are significant concerns that have driven organizations to seek third-party testing. More than half of the AI vulnerability reports submitted pertain to AI safety. These issues often have a lower barrier to entry for valid reporting and present a different risk profile compared to traditional security vulnerabilities. Bounties for AI Safety are lower than those for AI security. The volume of reports is higher and one of top five reported vulnerabilities with the business logic errors, prompt injection, sensitive information disclosure, and training data poisoning constituting the remainder. Prompt injection is a new threat and unlike the perceived dismissals of it as a threat to get the model to say something it shouldn’t, it can be used by attackers to disclose a victim’s entire chat history, files and objects which leads to far greater security risks.

The recommendations for AI safety and security align with the well-recognized charter of a defense-in-depth strategy with fortified security posture at every layer and continuous vulnerability testing throughout software development lifecycle. These include establishing continuous testing, evaluation, verification, and validation, throughout the AI model lifecycle, providing regular executive metrics and updates on AI model functionality, security, reliability and robustness, and regularly scanning, and updating the underlying infrastructure and software for vulnerabilities. There are also some geographical compliance requirements from the mandates issued by country, state or government-specific agencies. In these cases, the regulations might exist around specific AI features, such as facial recognition, and employment-related systems. Organizations do well when they establish an AI governance framework outlining roles, responsibilities, and ethical considerations, including incident response planning and risk management. All employees and users of AI could be trained on ethics, responsibilities, legal issues, AI security risks, best practices including warranty, license, and copyright. Ultimately, this strives to establish a culture of open and transparent communication on the organization’s use of predictive or generative AI.

Saturday, December 21, 2024

Security and Vulnerability Management in Infrastructure Engineering has become more specialized than ever. In this article, we explore some of the contemporary practices and emerging trends.

Cyberthreats have been alarming as technology capabilities grow for enterprises but with AI deployments and with AI-powered threat actors now mainstream, the digital threat landscape is growing and changing faster than ever. Just a few years ago, chatbots and copilots gained popularity, organizations were contending with OWASP for web-based applications and connectivity from mobile applications. The OWASP top 10 document identifies the most critical security risks to web applications. With AI being so data voracious and models being so lightweight and hosted even in a browser on a mobile device, researchers are discovering new impact every day. A defense-in-depth strategy with fortified security posture at every layer and continuous vulnerability testing throughout software development lifecycle, has become a mainstream response against these threats.

Human-powered AI-enabled security testing remains vital where vulnerabilities scanners fall short. The security researcher community has played a phenomenal role in this area, constantly upgrading their skills, and delivering ongoing value and even gaining the trust of risk-averse organizations. Companies in turn are defining their vulnerability reporting program and bug bounty awards in compliance with the Department of Justice safe harbor guidelines.

Researchers and security experts are both aware that Generative AI is one of the most significant risks impacting organizations today, particularly with securing data integrity. As they upskill their AI prowess, the AI testing engagements are gaining shape. The top vulnerability reported to a bug bounty program is cross-site scripting aka XSS and for penetration testing is misconfiguration. Usually, researchers target bug bounty programs that focus more on real-world attack vectors while security experts target penetration testing that uncovers more systemic and architectural vulnerabilities. High-end security initiatives by organizations have well-defined engagements with these information workers and usually involving a broad scope and a select team of trusted researchers. The results also speak for themselves with over 30% of valid vulnerability submissions rated to be high or critical.

With the recent impact of CrowdStrike’s software causing windows machines all across the world to fail and companies like Delta Airlines suing for five hundred million dollars, the efforts to reduce common vulnerabilities in production has never been emphasized more. Those companies that are technologically savvy are getting far fewer reports for OWASP top 10 security risks than the industry average

Friday, December 20, 2024

Many of the BigTech companies maintain datacenters for their private use. Changes to the data center infrastructure to support AI workloads are not covered in as much detail as the public cloud does. Advances in machine learning, increased computational power, and the increased availability of and reliance on data is not limited to the public cloud. This article dives into the changes in so-called “on-premises” infrastructure.

AI has been integrated into a diverse range of industries with key applications standing out in each such as personalized learning experiences in Education, autonomous vehicle navigation in Automotive, predictive maintenance in manufacturing, design optimization in architecture, fraud detection in finance, demand forecasting in retail, improved diagnostics, and monitoring in healthcare, and natural language processing in Technology.

Most AI deployments can be split into training and inference. Inference is dependent on training and while it can cater to mobile and edge computing with new data, training poses heavy demands on both computational power and data ingestion. Lately, data ingestion itself has become so computationally expensive that hardware is needed to evolve. Servers are increasingly being upgraded with GPUs, TPUs, or NPUs along with traditional CPUs. Graphis processing units aka GPUs were initially developed for high-quality video graphics but are now used for high volumes of parallel computing tasks. A Neural Processing Unit has specialized hardware accelerators. A Tensor Processing Unit has a Specialized Application specific Integrated Circuit to increase efficiency of AI workloads. The more intensive the algorithm or the higher the throughput of data ingestion, the more the computational demand. This results in about 400 watts of power consumption per server and about 10KW for high-end performance servers. Data center cooling with traditional fans no longer suffices and liquid cooling is called for.

Enterprises that want to introduce AI to their IT on-premises infrastructure quickly discover the same ballooning costs that they were afraid of with the public cloud, and they don’t just come from real-estate and energy consumption but also more inventory both hardware and software. While traditional on-premises data were housed in dedicated storage systems, the rise of AI workloads is posing more DevOps requirements and challenges that can only be addressed by exploding number of data and code pipelines. Collocation of data is a significant challenge both in terms of networking as well as the volume of data in transit. The concept of Lakehouse architecture is finding popularity on-premises as well. Large-scale collocation sites are becoming more cost-effective than on-site datacenter improvements. The gray area between workload and infrastructure for AI versus consolidation in datacenters is still an ongoing exploration. Collocation datacenters are gaining popularity because they pack the highest physical security, adequate renewable power, scalability when density increases, low-latency connectivity options, dynamic cooling technology, and backup and disaster recovery options. These are even dubbed as built-to-suit datacenters. Hyperconvergence in these datacenters is yet to be realized but there are significant improvements in terms of redesigning rack placement and in-rack cooling efficiency. These efficiencies drive up the mean-time-between-failures.

There’s talk about even more efficiency required for quantum computing and these will definitely drive up the demands on computational power, data collocation, network capable of supporting billions of transactions per hour and cooling efficiency. Sustainability is also an important perspective driven by and sponsored by both the market and the leadership. Hot air from the datacenter for instance finds new applications for comfort as does solar cells and battery storage.

The only constant seems to be the demand on the people for understanding the infrastructure behind high-performance computation.

Previous article: IaCResolutionsPart219.docx

Thursday, December 19, 2024

From the previous article, RAG-based chatbots can be evaluated by LLM-as-a-judge to agree on human grading on over 80% of judgements if the following can be maintained: using a 1-5 grading scale, use GPT-3.5 to save costs and when you have one grading example per score and use GPT-4 as an LLM judge when you have no examples to understand grading rules. Emitting the metrics about correctness, comprehensiveness, and readability provides justification that becomes valuable. Whether we use a GPT-4, GPT-3.5 or human judgement, the composite scores can be used to tell results apart quantitatively.

Together with the data and the evaluations, a chatbot can be successfully made ready for production but that is not all. A data and ML pipeline are also required. The full orchestration workflow involves data ingestion pipeline, data science and data analysis. The lakehouse paradigm combines key capabilities of data lakes and data warehouses. It expedites the creation of a pipeline from a few weeks to one week at most. The use of data personas on the same lakehouse reduces overhead and improves handoff to various disciplines.

Let us explore these stages of the workflow. Data Ingestion is best served by a medallion architecture. The raw data is stored in a Bronze table. Then it is enriched, and a Delta lake silver table is created. This makes it easy to reprocess the data as it guarantees atomicity with every operation. Schema is enforced and constraints ensure data quality. Filtering out unnecessary data also makes the delta table more refined.

The data science portion consists of three major parts: exploratory data analysis, a chatbot model and LLM-as-a-judge model. Models’ lifecycle management, including experimentation, reproducibility, deployment, and a central model registry can be accomplished with open-source platforms like MLflow. This also comes with programmability and user interface. With model lineage such as the experiment and the run that produced the model, its versioning and stage transitions from staging to production or archiving, and annotations, a complete picture is available in the data science step of the workflow.

Business Intelligence in terms of correlation, dashboards and reports can be facilitated with analytical SQL-like queries and materialized data views. For each query, different views of the data can be studied for the best representation which can then be directly promoted to dashboards, and this cuts down on redundancies and cost. Even a library like Plotly is sufficient to get a comprehensive and appealing visualization of the report. The data analysis is inherently a read-only system and independent of all activities for model training and testing.

A robust data environment with smooth ingestion, processing and retrieval by the whole team, the data collection and cleaning pipelines deployed using Delta tables, the model development and deployment using MLflow for simpler organization and powerful analytical dashboards complete the gamut for launching the project.

Wednesday, December 18, 2024

Tuesday, December 17, 2024

Constant evaluation and monitoring of deployed large language models and generative AI applications are important because both the data and the environment might vary. There can be shifts in performance, accuracy or even the emergence of biases. Continuous monitoring helps with early detection and prompt responses, which in turn makes the models’ outputs relevant, appropriate, and effective. Benchmarks help to evaluate models but the variations in results can be large. This stems from a lack of ground truth. For example, it is difficult to evaluate summarization models based on traditional NLP metrics such as BLEU, ROUGE etc. because summaries generated might have completely different words or word order. Comprehensive evaluation standards are elusive for LLMs and reliance on human judgment can be costly and time-consuming. The novel trend of “LLMs as a judge” still leaves unanswered questions about reflecting human preferences in terms of correctness, readability and comprehensiveness of the answers, reliability and reusability on different metrics, use of different grading scales by different frameworks and the applicability of the same evaluation metric across diverse use cases.

Since chatbots are common applications of LLM, an example of evaluating a chatbot now follows. The underlying principle in a chatbot is Retrieval Augmented Generation and it is quickly becoming the industry standard for developing chatbots. As with all LLM and AI models, it is only as effective as the data which in this case is the vector store aka knowledge base. The LLM could be newer GPT3.5 or GPT4 to reduce hallucinations, maintain up-to-date information, and leverage domain-specific knowledge. Evaluating the quality of chatbot responses must take into account both the knowledge base and the model involved. LLM-as-a-judge fits this bill for automated evaluation but as noted earlier, it may not be at par with human grading, might require several auto-evaluation samples and may have different responsiveness to different chatbot prompts. Slight variations in the prompt or problem can drastically affect its performance.

RAG-based chatbots can be evaluated by LLM-as-a-judge to agree on human grading on over 80% of judgements if the following can be maintained: using a 1-5 grading scale, use GPT-3.5 to save costs and when you have one grading example per score and use GPT-4 as an LLM judge when you have no examples to understand grading rules.

The initial evaluation dataset can be formed from say 100 chatbot prompts and context from the domain in terms of (chunks of ) documents that are relevant to the question based on say F-score. Using the evaluation dataset, different language models can be used to generate answers and stored in question-context-answers pairs in a dataset called “answer sheets”. Then given the answer sheets, various LLMs can be used to generate grades and reasoning for grades. Each grade can be a composite score with weighted contributions for correctness (mostly), comprehensiveness and readability in equal proportions of the remaining weight. A good choice of hyperparameters is equally applicable to LLM-as-a-judge and this could include low temperature of say 0.1 to ensure reproducibility, single-answer grading instead of pairwise comparison, chain of thoughts to let the LLM reason about the grading process before giving the final score and examples in grading for each score value on each of the three factors. Factors that are difficult to measure quantitatively include helpfulness, depth, creativity etc. Emitting the metrics about correctness, comprehensiveness, and readability provides justification that becomes valuable. Whether we use a GPT-4, GPT-3.5 or human judgement, the composite scores can be used to tell results apart quantitatively. The overall workflow for the creation of LLM-as-a-judge is also similar to the data preparation, indexing relevant data, information retrieval and response generation for the chatbots themselves.

Monday, December 16, 2024

Large Language Models (LLMs) have the potential to significantly improve organizations' workforce and customer experiences. By addressing tasks that currently occupy 60%-70% of employees' time, LLMs can significantly reduce the time spent on background research, data analysis, and document writing. Additionally, these technologies can significantly reduce the time for new workers to achieve full productivity. However, organizations must first rethink the management of unstructured information assets and mitigate issues of bias and accuracy. This is why many organizations are focusing on internal applications, where a limited scope provides opportunities for better information access and human oversight. These applications, aligned with core capabilities already within the organization, have the potential to deliver real and immediate value while LLMs and their supporting technologies continue to evolve and mature. Examples of applications include automated analysis of product reviews, inventory management, education, financial services, travel and hospitality, healthcare and life sciences, insurance, technology and manufacturing, and media and entertainment.

The use of structured data in GenAI applications can enhance their quality such as in the case of a travel planning chatbot. Such an application would use a vector search and feature-and-function serving building blocks to serve personalized user preferences and budget and hotel information often involving agents for programmatic access to external data sources. To access data and functions as real-time endpoints, federated and universal access control could be used. Models can be exposed as Python functions to compute features on-demand. Such functions can be registered with a catalog for access control and encoded in a directed acyclic graph to compute and serve features as a REST endpoint.

To serve structured data to real-time AI applications, precomputed data needs to be deployed to operational databases, such as DynamoDB and Cosmos DB as in the case of AWS and Azure public clouds respectively. Synchronization of precomputed features to a low-latency data format is required. Fine-tuning a foundation model allows for more deeply personalized models, requiring an underlying architecture to ensure secure and accurate data access.

Most organizations do well with an Intelligence Platform that helps with model fine-tuning, registration for access control, secure and efficient data sharing across different platforms, clouds and regions for faster distribution worldwide, and optimized LLM serving for improved performance. The choice of such Intelligence platforms should be such that it is simple infrastructure for fine-tuning models, ensuring traceability from models to datasets, and enabling faster throughput and latency improvements compared to traditional LLM serving methods.

Software-as-a-service LLMs aka SaaS LLMs are way more costly than those developed and hosted using foundational models in workspaces either on-premises or in the cloud because they need to address all the use cases including a general chatbot. The generality incurs cost. For a more specific use case, a much smaller prompt suffices and it can also be fine-tuned by baking the instructions and expected structure into the model itself. Inference costs can also rise with the number of input and output tokens and in the case of SaaS services, they are charged per token. Specific use case models can even be implemented with 2 engineers in 1 month with a few thousand dollars of compute for training and experimentation and tested by 4 human evaluators and an initial set of evaluation examples.

SaaS LLMs could be a matter of convenience. Developing a model from scratch often involves significant commitment both in terms of data and computational resources such as pre-training. Unlike fine-tuning, pre-training is a process of training a language model on a large corpus of data without using any prior knowledge or weights from an existing model. This scenario makes sense when the data is quite different from what off-the-shelf LLMs are trained on or where the domain is rather specialized when compared to everyday language or there must be full control over training data in terms of security, privacy, fit and finish for the model’s foundational knowledge base or when there are business justifications to avoid available LLMs altogether.

Organizations must plan for the significant commitment and sophisticated tooling required for this. Libraries like PyTorch FSDP and Deepspeed are required for their distributed training capabilities when pretraining an LLM from scratch. Large-scale data preprocessing is required and involves distributed frameworks and infrastructure that can handle scale in data engineering. Training of an LLM cannot commence without a set of optimal hyperparameters. Since training involves high costs from long-running GPU jobs, resource utilization must be maximized. Even the length of time for training might be quite large which makes GPU failures more likely than normal load. Close monitoring of the training process is essential. Saving model checkpoints regularly and evaluating validation sets acts as safeguards.

Constant evaluation and monitoring of deployed large language models and generative AI applications are important because both the data and the environment might vary. There can be shifts in performance, accuracy or even the emergence of biases. Continuous monitoring helps with early detection and prompt responses, which in turn makes the models’ outputs relevant, appropriate and effective. Benchmarks help to evaluate models but the variations in results can be large. This stems from a lack of ground truth. For example, it is difficult to evaluate summarization models based on traditional NLP metrics such as BLEU, ROUGE etc because summaries generated might have completely different words or word ordes. Comprehensive evaluation standards are elusive for LLMs and reliance on human judgment can be costly and time-consuming. The novel trend of “LLMs as a judge” still leaves unanswered questions about reflecting human preferences in terms of correctness, readability and comprehensiveness of the answers, reliability and reusability on different metrics, use of different grading scales by different frameworks and the applicability of the same evaluation metric across diverse use cases.

Finally, the system must be simplified for use with model serving to manage, govern and access models via unified endpoints to handle specific LLM requests.