Cluster computing

Friday, November 15, 2024

The previous article talked about a specific use case of coordinating UAV swarm to transition through virtual structures with the suggestion that the structures need not be input by humans. They can be detected as objects from images in a library, extracted and scaled. These objects form a sequence that can be passed along to the UAV swarm. This article explains the infrastructure needed to design a pipeline for UAV swarm control in this way so that drones form continuous and smooth transitions from one meaningful structure to another as if enacting an animation flashcard.

A cloud infrastructure architecture to handle the above use case, is designed with a layered approach and dedicated components for device connectivity, data ingestion, processing, storage, and analytics, utilizing features like scalable cloud services, edge computing, data filtering, and optimized data pipelines to efficiently manage the high volume and velocity of IoT data.

Compute. Networking and Storage are required to be set up properly. For example. Gateway devices must be used for data aggregation and filtering, reliable network connectivity with robust security mechanisms must be provided to secure the data in transit, load balancing must be used to distribute traffic across cloud infrastructure. Availability zones, redundancy, and multiple regions might be leveraged for availability, business continuity and disaster recovery. High-throughput data pipelines to receive large volumes of data from devices will facilitate data ingestion. Scalable storage solutions (like data lakes or databases) to handle large data volumes for data aging and durability can provide storage best practices. Advanced analytics tools for real-time insights and historical data analysis can help with processing and analytics. Edge computing helps with the preparation or pre-processing of data closer to the source on edge devices to reduce bandwidth usage and improve response time. This also calls for implementing mechanisms to filter out irrelevant data at the edge or upon ingestion to minimize data transfer to the cloud. Properly partitioning data to optimize query performance with large datasets can tune up the analytical stacks and pipelines. Select cloud services for hosting the code such as function apps, app services and Kubernetes containers can be used with elastic scaling capabilities to handle fluctuating data volumes. Finally, a security hardening review might implement robust security measures throughout the architecture, including device authentication, data encryption, and access control.

An Azure cloud infrastructure architecture blueprint for handling large volume IoT traffic typically includes: Azure IoT Hub as the central communication hub, Azure Event Hubs for high-throughput data ingestion, Azure Stream Analytics for real-time processing, Azure Data Explorer for large-scale data storage and analysis, and Azure IoT Edge for edge computing capabilities, all while incorporating robust security measures and proper scaling mechanisms to manage the high volume of data coming from numerous IoT devices.

A simplified organization to illustrate the flow might look like:

IoT Devices -> Azure IoT Hub -> Azure Event Hubs -> Azure Data Lake Storage -> Azure Machine Learning -> Azure Kubernetes Service (AKS) -> Azure API Management -> IoT Devices

Here, the drones act as the IoT devices and can include anything from sensors to camera. They act as the producer of real-time data and as the consumer for predictions and recommendations. Secure communication protocols like MQTT, CoAP might be leveraged to stream the data from edge computing data senders and relayers. Also, Device management and provisioning services is required to maintain the inventory of IoT devices.

An Azure Device Provisioning Service (DPS) can enable zero-touch provisioning of new devices added to the IoT Hub, simplifying device onboarding.

The Azure IoT Hub acts as the central message hub for bi-directional communication between IoT applications and the drones it manages. It can handle millions of messages per second from multiple devices

The Azure Event Hub is used for ingesting large amounts of data from IoT devices. It can process and store large streams of data, which can then be fed into Azure Machine Learning for processing.

Azure Machine Learning is where machine learning models are trained and deployed at scale.

Azure Data Lake Storage is used to store and organize large volumes of data until it is needed. The storage cost is low but certain features when turned on can accrue cost on an hourly basis such as the SFTP enabled feature even though they may never be used. With proper care, the Azure Data Lake Storage can act a little or no cost sink for all the streams of data with convenience access for all analytical stacks and pipelines.

Azure Kubernetes Service is used to deploy and manage containerized applications, including machine learning models. It provides a scalable and flexible environment for running the models.

Azure API management is used to expose the machine learning models as APIs making it easy for IoT devices to interact with them.

Azure Monitor and Azure Log Analytics are used to monitor the performance and health of the IoT devices, data pipelines, and machine learning models.

#codingexercise: Codingexercise-11-15-2024.docx

Thursday, November 14, 2024

The drone machine learning experiments from previous articles require deployment patterns of two types – online inference and batch inference. Both demonstrate MLOps principles and best practices when developing, deploying, and monitoring machine learning models at scale. Development and deployment are distinct from one another and although the model may be containerized and retrieved for execution during deployment, it can be developed independent of how it is deployed. This separates the concerns for the development of the model from the requirements to address the online and batch workloads. Regardless of the technology stack and the underlying resources used during these two phases; typically, they are created in the public cloud; this distinction serves the needs of the model as well.

For example, developing and training a model might require significant computing but not so much as when executing it for predictions and outlier detections, activities that are hallmarks of production environments. Even the workloads that make use of the model might vary even from one batch processing stack to another and not just between batch and online processing but the common operations of collecting MELT data, named after metrics, events, logs and traces and associated resources will stay the same. These include GitHub repository, Azure Active Directory, cost management dashboards, Key Vaults, and in this case, Azure Monitor. Resources and the practice associated with them for the purposes of security and performance are being left out of this discussion, and the standard DevOps guides from the public cloud providers call them out.

Online workloads targeting the model via API calls will usually require the model to be hosted in a container and exposed via API management services. Batch workloads, on the other hand, require an orchestration tool to co-ordinate the jobs consuming the model. Within the deployment phase, it is a usual practice to host more than one environment such as stage and production – both of which are served by CI/CD pipelines that flows the model from development to its usage. A manual approval is required to advance the model from the stage to the production environment. A well-developed model is usually a composite handling three distinct model activities – handling the prediction, determining the data drift in features, and determining outliers in the features. Mature MLOps also includes processes for explainability, performance profiling, versioning and pipeline automations and such others. Depending on the resources used for DevOps and the environment, typical artifacts would include dockerfiles, templates and manifests.

While parts of the solution for this MLOps can be internalized by studios and launch platforms, organizations like to invest in specific compute, storage, and networking for their needs. Databricks/Kubernetes, Azure ML workspaces and such are used for compute, storage accounts and datastores are used for storage, and diversified subnets are used for networking. Outbound internet connectivity from the code hosted and executed in MLOps is usually not required but it can be provisioned with the addition of a NAT gateway within the subnet where it is hosted.

A Continuous Integration / Continuous Deployment (CI/CD) pipeline, ML tests and model tuning become a responsibility for the development team even though they are folded into the business service team for faster turn-around time to deploy artificial intelligence models in production. In-house automation and development of Machine Learning pipelines and monitoring systems does not compare to those from the public clouds which make it easier for automation and programmability. That said, certain products become popular for specific reasons despite the allure of the public cloud for the following reasons:

First, event processing systems such as Apache Spark and Kafka find it easier to replace Extract-Transform-Load solutions that proliferate with data warehouse. It is true that much of the training data for ML pipelines comes from a data warehouse and ETL worsened data duplication and drift making it necessary to add workarounds in business logic. With a cleaner event driven system, it becomes easier to migrate to immutable data, write-once business logic and real-time data processing systems. Event processing systems is easier to develop on-premises even as staging before it is attempted to be deployed to cloud.

Second, Machine learning models are end-products. They can be hosted in a variety of environments, not just the cloud. Some ML users would like to load the model into client applications including those on mobile devices. The model as a service option is rather narrow and does not have to be made available over the internet in all cases especially when the network hop is going to be costly to real-time processing systems. Many IoT traffic and experts agree that the streaming data from edge devices can be quite heavy in traffic where an online on-premises system will out-perform any public-cloud option. Internet tcp relays are of the order of 250-300 milliseconds whereas the ingestion rate for real-time analysis can be upwards of thousands of events per second.

A workspace is needed to develop machine learning models regardless of the storage, compute and other accessories. Azure Machine Learning provides an environment to create and manage the end-to-end life cycle of Machine Learning models. Machine Learning’s compatibility with open-source frameworks and platforms like PyTorch and TensorFlow makes it an effective all-in-one platform for integrating and handling data and models which tremendously relieves the onus on the business to develop new capabilities. Azure Machine Learning is designed for all skill levels, with advanced MLOps features and simple no-code model creation and deployment.

#codingexercise: CodingExercise-11-14-2024.docx

Wednesday, November 13, 2024

The data processing begins with User uploading images to cloud storage say a data lake which also stores all the data from the drones as necessary. This is then fed into an Event Grid so that suitable partitioned processing say one per drone in the fleet can crunch the necessary current and desired positions in each epoch along with recommendations from a Machine Learning model to correct and reduce the sum of squares of errors from overall smoothness of the structure transitions. This is then vectorized and saved in a vector store and utilized with a monitoring stack to track performance with key metrics and ensure that overall system is continuously health to control the UAV swarm.

This makes the processing stack look something like this:

[User Uploads Image] -> [Azure Blob Storage] -> [Azure Event Grid] -> [Azure Functions] -> [Azure Machine Learning] -> [Azure Cosmos DB] -> [Monitoring]

where the infrastructure consists of:

Azure Blob Storage: Stores raw image data and processed results. When this is enabled for hierarchical filesystem, folders can come in helpful to organize the fleet, their activities and feedback.

Azure Functions: Serverless functions handle image processing tasks. The idea here is to define pure logic that is partitioned on the data and one that can scale to arbitrary loads.

Azure Machine Learning: Manages machine learning models and deployments. The Azure Machine Learning Studio allows us to view the pipeline graph, check its output and debug it. The logs and outputs of each component are available to study them. Optionally components can be registered to the workspace so they can be shared and reused. A pipeline draft connects the components. A pipeline run can be submitted using the resources in the workspace. The training pipelines can be converted to inference pipelines and the pipelines can be published to submit a new pipeline that can be run with different parameters and datasets. A training pipeline can be reused for different models and a batch inference pipeline can be used to make predictions on new data.

Azure Event Grid: Triggers events based on image uploads or user directive or drone feedback

Azure Cosmos DB: Stores metadata and processes results and makes it suitable for vector search.

Azure API Gateway: Manages incoming image upload requests and outgoing processed results with OWASP protection.

Azure Monitor: Tracks performance metrics and logs events for troubleshooting.

Tuesday, November 12, 2024

Amont the control methods for UAV swarm, Dynamic Formation Changes is the one holding the most promise for morphing from one virtual structure to another. When there is no outside influence or data driven flight management, coming up with the next virtual structure is an easier articulation for the swarm pilot.

It is usually helpful to plan out up to two or three virtual structures in advance for a UAV swarm to seamlessly morph from one holding position to another. This macro and micro movements can even be delegated to humans and UAV swarm respectively because given initial and final positions, the autonomous UAV can make tactical moves efficiently and the humans can generate the overall workflow given the absence of a three-dimensional GPS based map.

Virtual structure generation can even be synthesized from images with object detection and appropriate scaling. So virtual structures are not necessarily input by humans. In a perfect world, UAV swarms launch from packed formation to take positions in a matrix in the air and then morph from one position to another given the signals they receive.

There are several morphing algorithms that reduce the distances between initial and final positions of the drones during transition between virtual structures. These include but are not limited to:

1. Thin-plate splines aka TPS algorithm: that adapts to minimize deformation of the swarm’s formation while avoiding obstacles. It uses a non-rigid mapping function to reduce lag caused by maneuvers.

2. Non-rigid Mapping function: This function helps reduce the lag caused by maneuvers, making the swarm more responsive and energy efficient.

3. Distributed assignment and optimization protocol: this protocol enables uav swarms to construct and reconfigure formations dynamically as the number of UAV changes.

4. Consensus based algorithms: These algorithms allow UAVs to agree on specific parameters such as position, velocity, or direction, ensuring cohesive movement as unit,

5. Leader-follower method: This method involves a designated leader UAV guiding the formation, with other UAV following its path.

The essential idea behind the transition can be listed as the following steps:

1. Select random control points

2. Create a grid and use TPS to interpolate value on this grid

3. Visualize the original control points and the interpolated surface.

A sample python implementation might look like so:

import numpy as np

from scipy.interpolate import Rbf

import matplotlib.pyplot as plt

# Define the control points

x = np.random.rand(10) * 10

y = np.random.rand(10) * 10

z = np.sin(x) + np.cos(y)

# Create the TPS interpolator

tps = Rbf(x, y, z, function='thin_plate')

# Define a grid for interpolation

x_grid, y_grid = np.meshgrid(np.linspace(0, 10, 100), np.linspace(0, 10, 100))

z_grid = tps(x_grid, y_grid)

# Plot the original points and the TPS interpolation

fig = plt.figure()

ax = fig.add_subplot(111, projection='3d')

ax.scatter(x, y, z, color='red', label='Control Points')

ax.plot_surface(x_grid, y_grid, z_grid, cmap='viridis', alpha=0.6)

ax.set_xlabel('X axis')

ax.set_ylabel('Y axis')

ax.set_zlabel('Z axis')

ax.legend()

plt.show()

Reference: previous post

Monday, November 11, 2024

In Infrastructure Engineering, control plane and data plane both serve different purposes and often engineers want to manage only those that are finite, bounded and are open to management and monitoring. If the number far exceeds those that can be managed, it is better to separate resources and data. For example, when there are several drones to be inventoried and managed for interaction with cloud services, it is not necessary to create a pseudo-resource representing each of the drones. Instead, a composite cloud resource set representing a management object can be created for the fleet and almost all of the drones can be kept as data in a corresponding database maintained by that object. Let us go deep into this example for controlling UAV swarm movement via cloud resources.

First an overview of the control methods is necessary.There are several such as leader-follower, virtual structure, behavior-based, consensus-based, and artificial potential field and advanced AI-based methods (like artificial neural networks and deep reinforcement learning).

There are advantages and limitations of each approach, showcasing how conventional methods offer reliability and simplicity, while AI-based strategies provide adaptability and sophisticated optimization capabilities.

There is a critical need for innovative solutions and interdisciplinary approaches combining conventional and AI methods to overcome existing challenges and fully exploit the potential of UAV swarms in various applications, so the infrastructure and solution accelerator stacks must enable switching from one AI model to another or even change direction from one control strategy to another.

Real-world applications for UAV swarms are not only present, they are the future. So this case study is justified by wide-ranging applications across fields such as military affairs, agriculture, search and rescue operations, environmental monitoring, and delivery services.

Now, for a little more detail on the control methods to select one that we can leverage for a cloud representation. These control methods include:

Leader-Follower Method: This method involves a designated leader UAV guiding the formation, with other UAVs following its path. It's simple and effective but can be limited by the leader's capabilities.

Virtual Structure Method: UAVs maintain relative positions to a virtual structure, which moves according to the desired formation. This method is flexible but requires precise control algorithms.

Behavior-Based Method: UAVs follow simple rules based on their interactions with neighboring UAVs, mimicking natural swarm behaviors. This method is robust but can be unpredictable in complex scenarios.

Consensus-Based Method: UAVs communicate and reach a consensus on their positions to form the desired shape. This method is reliable and scalable but can be slow in large swarms.

Artificial Potential Field Method: UAVs are guided by virtual forces that attract them to the desired formation and repel them from obstacles. This method is intuitive but can suffer from local minima issues.

Artificial Neural Networks (ANN): ANN-based methods use machine learning to adaptively control UAV formations. These methods are highly adaptable but require significant computational resources.

Deep Reinforcement Learning (DRL): DRL-based methods use advanced AI techniques to optimize UAV swarm control. These methods are highly sophisticated and can handle complex environments but are computationally intensive.

Out of these, the virtual structure method inherently leverages both the drones capabilities to find appropriate positions on the virtual structure as well as their ability to limit the movements to reach their final position and orientation.

Some specific examples and details include:

Example 1: Circular Formation

Scenario: UAVs need to form a circular pattern.

Method: A virtual structure in the shape of a circle is defined. Each UAV maintains a fixed distance from this virtual circle, effectively forming a circular formation around it.

Advantages: This method is simple and intuitive, making it easy to implement and control.

Example 2: Line Formation

Scenario: UAVs need to form a straight line.

Method: A virtual structure in the shape of a line is defined. Each UAV maintains a fixed distance from this virtual line, forming a straight line formation.

Advantages: This method is effective for tasks requiring linear arrangements, such as search and rescue operations.

Example 3: Complex Shapes

Scenario: UAVs need to form complex shapes like a star or polygon.

Method: A virtual structure in the desired complex shape is defined. Each UAV maintains a fixed distance from this virtual structure, forming the complex shape.

Advantages: This method allows for the creation of intricate formations, useful in tasks requiring precise positioning.

Example 4: Dynamic Formation Changes

Scenario: UAVs need to change formations dynamically during a mission.

Method: The virtual structure is updated in real-time according to the mission requirements, and UAVs adjust their positions accordingly.

Advantages: This method provides flexibility and adaptability, essential for dynamic and unpredictable environments.

Sunday, November 10, 2024

Chief executives are increasingly demanding that their technology investments, including data and AI, work harder and deliver more value to their organizations. Generative AI offers additional tools to achieve this, but it adds complexity to the challenge. CIOs must ensure their data infrastructure is robust enough to cope with the enormous data processing demands and governance challenges posed by these advances. Technology leaders see this challenge as an opportunity for AI to deliver considerable growth to their organizations, both in terms of top and bottom lines. While many business leaders have stated in public that it is especially important for AI projects to help reduce costs, they also say that it is important that these projects enable new revenue generation. Gartner forecasts worldwide IT spending to grow by 4.3% in 2023 and 8.8% in 2024 with most of that growth concentrated in the software category that includes spending on data and AI. AI-driven efficiency gains promise business growth, with 81% expecting a gain greater than 25% and 33% believing it could exceed 50%. CDOs and CTOs are echoing: “If we can automate our core processes with the help of self-learning algorithms, we’ll be able to move much faster and do more with the same amount of people.” and “Ultimately for us it will mean automation at scale and at speed.”

Organizations are increasingly prioritizing AI projects due to economic uncertainty and the increasing popularity of generative AI. Technology leaders are focusing on longer-range projects that will have significant impact on the company, rather than pursuing short-term projects. This is due to the rapid pace of use cases and proofs of concept coming at businesses, making it crucial to apply existing metrics and frameworks rather than creating new ones. The key is to ensure that the right projects are prioritized, considering the expected business impact, complexity, and cost of scaling.

Data infrastructure and AI systems are becoming increasingly intertwined due to the enormous demands placed on data collection, processing, storage, and analysis. As financial services companies like Razorpay grow, their data infrastructure needs to be modernized to accommodate the growing volume of payments and the need for efficient storage. Advances in AI capabilities, such as generative AI, have increased the urgency to modernize legacy data architectures. Generative AI and the LLMs that support it will multiply workload demands on data systems and make tasks more complex. The implications of generative AI for data architecture include feeding unstructured data into models, storing long-term data in ways conducive to AI consumption, and putting adequate security around models. Organizations supporting LLMs need a flexible, scalable, and efficient data infrastructure. Many have claimed success with the adoption of Lakehouse architecture, combining features of data warehouse and data lake architecture. This architecture helps scale responsibly, with a good balance of cost versus performance. As one data leader observed “I can now organize my law enforcement data, I can organize my airline checkpoint data, I can organize my rail data, and I can organize my inspection data. And I can at the same time make correlations and glean understandings from all of that data, separate and together.”

Data silos are a significant challenge for data and technology executives, as they result from the disparate approaches taken by different parts of organizations to store and protect data. The proliferation of data, analytics, and AI systems has added complexity, resulting in a myriad of platforms, vast amounts of data duplication, and often separate governance models. Most organizations employ fewer than 10 data and AI systems, but the proliferation is most extensive in the largest ones. To simplify, organizations aim to consolidate the number of platforms they use and seamlessly connect data across the enterprise. Companies like Starbucks are centralizing data by building cloud-centric, domain-specific data hubs, while GM's data and analytics team is focusing on reusable technologies to simplify infrastructure and avoid duplication. Additionally, organizations need space to innovate, which can be achieved by having functions that manage data and those that involve greenfield exploration.

#codingexercise: CodingExercise-11-10-2024.docx

Saturday, November 9, 2024

One of the fundamentals in parallel processing in computer science involves the separation of tasks per worker to reduce contention. When you treat the worker as an autonomous drone with minimal co-ordination with other members of its fleet, an independent task might appear something like installing a set of solar panels in an industry with 239 GW estimate in 2023 for the global solar powered renewable energy. That estimate was a 45% increase over the previous year. As industry expands, drones are employed for their speed. Drones aid in every stage of a plant’s lifecycle from planning to maintenance. They can assist in topographic surveys, during planning, monitor construction progress, conduct commissioning inspections, and perform routine asset inspections for operations and maintenance. Drone data collection is not only comprehensive and expedited but also accurate.

During planning for solar panels, drones can conduct aerial surveys to assess topography, suitability, and potential obstacles, create accurate 3D maps to aid in designing and optimizing solar farm layouts, and analyze shading patterns to optimize panel placement and maximize energy production. During construction, drones provide visual updates on construction progress, and track and manage inventory of equipment, tools, and materials on-site. During maintenance, drones can perform close-up inspections of solar panels to identify defects, damage, or dirt buildup, monitor equipment for wear and tear, detect hot spots in panels with thermal imaging, identify and manage vegetation growth that might reduce the efficiency of solar panels and enhance security by patrolling the perimeter and alerting to unauthorized access.

When drones become autonomous, these activities go to the next level. The dependency on human pilots has always been a limitation on the frequency of flights. On the other hand, autonomous drones boost efficiency, shorten fault detection times, and optimize outcomes during O&M site visits. Finally, they help to increase the power output yield of solar farms. The sophistication of the drones in terms of hardware and software increases from remote-controlled drones to autonomous drones. Field engineers might suggest selection of an appropriate drone as well as the position of docking stations, payload such as thermal camera and capabilities. A drone data platform that seamlessly facilitates data capture, ensures safe flight operations with minimal human intervention, prioritize data security and meet compliance requirements becomes essential at this stage. Finally, this platform must also support integration with third-party data processing and analytics applications and reporting stacks that publish various charts and graphs. As usual, a separation between data processing and data analytics helps just as much as a unified layer for programmability and user interaction with API, SDK, UI and CLI. While the platform can be sold separately as a product, leveraging a cloud-based SaaS service reduces the cost on the edge.

There is still another improvement possible over this with the formation of dynamic squadrons, consensus protocol and distributed processing with hash stores. While there are existing applications that serve to improve IoT data streaming at the edges and cloud processing via stream stores and analytics with the simplicity of SQL based querying and programmability, a cloud service that installs and operates a deployment stamp with a solution accelerator and as a citizen resource of a public cloud helps bring the best practices of storage engineering, data engineering and enabling businesses to be more focused.