Cluster computing

The drone machine learning experiments from previous articles require deployment patterns of two types – online inference and batch inference. Both demonstrate MLOps principles and best practices when developing, deploying, and monitoring machine learning models at scale. Development and deployment are distinct from one another and although the model may be containerized and retrieved for execution during deployment, it can be developed independent of how it is deployed. This separates the concerns for the development of the model from the requirements to address the online and batch workloads. Regardless of the technology stack and the underlying resources used during these two phases; typically, they are created in the public cloud; this distinction serves the needs of the model as well.

For example, developing and training a model might require significant computing but not so much as when executing it for predictions and outlier detections, activities that are hallmarks of production environments. Even the workloads that make use of the model might vary even from one batch processing stack to another and not just between batch and online processing but the common operations of collecting MELT data, named after metrics, events, logs and traces and associated resources will stay the same. These include GitHub repository, Azure Active Directory, cost management dashboards, Key Vaults, and in this case, Azure Monitor. Resources and the practice associated with them for the purposes of security and performance are being left out of this discussion, and the standard DevOps guides from the public cloud providers call them out.

Online workloads targeting the model via API calls will usually require the model to be hosted in a container and exposed via API management services. Batch workloads, on the other hand, require an orchestration tool to co-ordinate the jobs consuming the model. Within the deployment phase, it is a usual practice to host more than one environment such as stage and production – both of which are served by CI/CD pipelines that flows the model from development to its usage. A manual approval is required to advance the model from the stage to the production environment. A well-developed model is usually a composite handling three distinct model activities – handling the prediction, determining the data drift in features, and determining outliers in the features. Mature MLOps also includes processes for explainability, performance profiling, versioning and pipeline automations and such others. Depending on the resources used for DevOps and the environment, typical artifacts would include dockerfiles, templates and manifests.

While parts of the solution for this MLOps can be internalized by studios and launch platforms, organizations like to invest in specific compute, storage, and networking for their needs. Databricks/Kubernetes, Azure ML workspaces and such are used for compute, storage accounts and datastores are used for storage, and diversified subnets are used for networking. Outbound internet connectivity from the code hosted and executed in MLOps is usually not required but it can be provisioned with the addition of a NAT gateway within the subnet where it is hosted.

A Continuous Integration / Continuous Deployment (CI/CD) pipeline, ML tests and model tuning become a responsibility for the development team even though they are folded into the business service team for faster turn-around time to deploy artificial intelligence models in production. In-house automation and development of Machine Learning pipelines and monitoring systems does not compare to those from the public clouds which make it easier for automation and programmability. That said, certain products become popular for specific reasons despite the allure of the public cloud for the following reasons:

First, event processing systems such as Apache Spark and Kafka find it easier to replace Extract-Transform-Load solutions that proliferate with data warehouse. It is true that much of the training data for ML pipelines comes from a data warehouse and ETL worsened data duplication and drift making it necessary to add workarounds in business logic. With a cleaner event driven system, it becomes easier to migrate to immutable data, write-once business logic and real-time data processing systems. Event processing systems is easier to develop on-premises even as staging before it is attempted to be deployed to cloud.

Second, Machine learning models are end-products. They can be hosted in a variety of environments, not just the cloud. Some ML users would like to load the model into client applications including those on mobile devices. The model as a service option is rather narrow and does not have to be made available over the internet in all cases especially when the network hop is going to be costly to real-time processing systems. Many IoT traffic and experts agree that the streaming data from edge devices can be quite heavy in traffic where an online on-premises system will out-perform any public-cloud option. Internet tcp relays are of the order of 250-300 milliseconds whereas the ingestion rate for real-time analysis can be upwards of thousands of events per second.

A workspace is needed to develop machine learning models regardless of the storage, compute and other accessories. Azure Machine Learning provides an environment to create and manage the end-to-end life cycle of Machine Learning models. Machine Learning’s compatibility with open-source frameworks and platforms like PyTorch and TensorFlow makes it an effective all-in-one platform for integrating and handling data and models which tremendously relieves the onus on the business to develop new capabilities. Azure Machine Learning is designed for all skill levels, with advanced MLOps features and simple no-code model creation and deployment.

#codingexercise: CodingExercise-11-14-2024.docx

Cluster computing

Thursday, November 14, 2024

No comments:

Post a Comment