Cluster computing

Thursday, November 14, 2024

The drone machine learning experiments from previous articles require deployment patterns of two types – online inference and batch inference. Both demonstrate MLOps principles and best practices when developing, deploying, and monitoring machine learning models at scale. Development and deployment are distinct from one another and although the model may be containerized and retrieved for execution during deployment, it can be developed independent of how it is deployed. This separates the concerns for the development of the model from the requirements to address the online and batch workloads. Regardless of the technology stack and the underlying resources used during these two phases; typically, they are created in the public cloud; this distinction serves the needs of the model as well.

For example, developing and training a model might require significant computing but not so much as when executing it for predictions and outlier detections, activities that are hallmarks of production environments. Even the workloads that make use of the model might vary even from one batch processing stack to another and not just between batch and online processing but the common operations of collecting MELT data, named after metrics, events, logs and traces and associated resources will stay the same. These include GitHub repository, Azure Active Directory, cost management dashboards, Key Vaults, and in this case, Azure Monitor. Resources and the practice associated with them for the purposes of security and performance are being left out of this discussion, and the standard DevOps guides from the public cloud providers call them out.

Online workloads targeting the model via API calls will usually require the model to be hosted in a container and exposed via API management services. Batch workloads, on the other hand, require an orchestration tool to co-ordinate the jobs consuming the model. Within the deployment phase, it is a usual practice to host more than one environment such as stage and production – both of which are served by CI/CD pipelines that flows the model from development to its usage. A manual approval is required to advance the model from the stage to the production environment. A well-developed model is usually a composite handling three distinct model activities – handling the prediction, determining the data drift in features, and determining outliers in the features. Mature MLOps also includes processes for explainability, performance profiling, versioning and pipeline automations and such others. Depending on the resources used for DevOps and the environment, typical artifacts would include dockerfiles, templates and manifests.

While parts of the solution for this MLOps can be internalized by studios and launch platforms, organizations like to invest in specific compute, storage, and networking for their needs. Databricks/Kubernetes, Azure ML workspaces and such are used for compute, storage accounts and datastores are used for storage, and diversified subnets are used for networking. Outbound internet connectivity from the code hosted and executed in MLOps is usually not required but it can be provisioned with the addition of a NAT gateway within the subnet where it is hosted.

A Continuous Integration / Continuous Deployment (CI/CD) pipeline, ML tests and model tuning become a responsibility for the development team even though they are folded into the business service team for faster turn-around time to deploy artificial intelligence models in production. In-house automation and development of Machine Learning pipelines and monitoring systems does not compare to those from the public clouds which make it easier for automation and programmability. That said, certain products become popular for specific reasons despite the allure of the public cloud for the following reasons:

First, event processing systems such as Apache Spark and Kafka find it easier to replace Extract-Transform-Load solutions that proliferate with data warehouse. It is true that much of the training data for ML pipelines comes from a data warehouse and ETL worsened data duplication and drift making it necessary to add workarounds in business logic. With a cleaner event driven system, it becomes easier to migrate to immutable data, write-once business logic and real-time data processing systems. Event processing systems is easier to develop on-premises even as staging before it is attempted to be deployed to cloud.

Second, Machine learning models are end-products. They can be hosted in a variety of environments, not just the cloud. Some ML users would like to load the model into client applications including those on mobile devices. The model as a service option is rather narrow and does not have to be made available over the internet in all cases especially when the network hop is going to be costly to real-time processing systems. Many IoT traffic and experts agree that the streaming data from edge devices can be quite heavy in traffic where an online on-premises system will out-perform any public-cloud option. Internet tcp relays are of the order of 250-300 milliseconds whereas the ingestion rate for real-time analysis can be upwards of thousands of events per second.

A workspace is needed to develop machine learning models regardless of the storage, compute and other accessories. Azure Machine Learning provides an environment to create and manage the end-to-end life cycle of Machine Learning models. Machine Learning’s compatibility with open-source frameworks and platforms like PyTorch and TensorFlow makes it an effective all-in-one platform for integrating and handling data and models which tremendously relieves the onus on the business to develop new capabilities. Azure Machine Learning is designed for all skill levels, with advanced MLOps features and simple no-code model creation and deployment.

#codingexercise: CodingExercise-11-14-2024.docx

Wednesday, November 13, 2024

The previous article talked about a specific use case of coordinating UAV swarm to transition through virtual structures with the suggestion that the structures need not be input by humans. They can be detected as objects from images in a library, extracted and scaled. These objects form a sequence that can be passed along to the UAV swarm. This article explains the infrastructure needed to design a pipeline for UAV swarm control in this way so that drones form continuous and smooth transitions from one meaningful structure to another as if enacting an animation flashcard.

The data processing begins with User uploading images to cloud storage say a data lake which also stores all the data from the drones as necessary. This is then fed into an Event Grid so that suitable partitioned processing say one per drone in the fleet can crunch the necessary current and desired positions in each epoch along with recommendations from a Machine Learning model to correct and reduce the sum of squares of errors from overall smoothness of the structure transitions. This is then vectorized and saved in a vector store and utilized with a monitoring stack to track performance with key metrics and ensure that overall system is continuously health to control the UAV swarm.

This makes the processing stack look something like this:

[User Uploads Image] -> [Azure Blob Storage] -> [Azure Event Grid] -> [Azure Functions] -> [Azure Machine Learning] -> [Azure Cosmos DB] -> [Monitoring]

where the infrastructure consists of:

Azure Blob Storage: Stores raw image data and processed results. When this is enabled for hierarchical filesystem, folders can come in helpful to organize the fleet, their activities and feedback.

Azure Functions: Serverless functions handle image processing tasks. The idea here is to define pure logic that is partitioned on the data and one that can scale to arbitrary loads.

Azure Machine Learning: Manages machine learning models and deployments. The Azure Machine Learning Studio allows us to view the pipeline graph, check its output and debug it. The logs and outputs of each component are available to study them. Optionally components can be registered to the workspace so they can be shared and reused. A pipeline draft connects the components. A pipeline run can be submitted using the resources in the workspace. The training pipelines can be converted to inference pipelines and the pipelines can be published to submit a new pipeline that can be run with different parameters and datasets. A training pipeline can be reused for different models and a batch inference pipeline can be used to make predictions on new data.

Azure Event Grid: Triggers events based on image uploads or user directive or drone feedback

Azure Cosmos DB: Stores metadata and processes results and makes it suitable for vector search.

Azure API Gateway: Manages incoming image upload requests and outgoing processed results with OWASP protection.

Azure Monitor: Tracks performance metrics and logs events for troubleshooting.

Tuesday, November 12, 2024

Amont the control methods for UAV swarm, Dynamic Formation Changes is the one holding the most promise for morphing from one virtual structure to another. When there is no outside influence or data driven flight management, coming up with the next virtual structure is an easier articulation for the swarm pilot.

It is usually helpful to plan out up to two or three virtual structures in advance for a UAV swarm to seamlessly morph from one holding position to another. This macro and micro movements can even be delegated to humans and UAV swarm respectively because given initial and final positions, the autonomous UAV can make tactical moves efficiently and the humans can generate the overall workflow given the absence of a three-dimensional GPS based map.

Virtual structure generation can even be synthesized from images with object detection and appropriate scaling. So virtual structures are not necessarily input by humans. In a perfect world, UAV swarms launch from packed formation to take positions in a matrix in the air and then morph from one position to another given the signals they receive.

There are several morphing algorithms that reduce the distances between initial and final positions of the drones during transition between virtual structures. These include but are not limited to:

1. Thin-plate splines aka TPS algorithm: that adapts to minimize deformation of the swarm’s formation while avoiding obstacles. It uses a non-rigid mapping function to reduce lag caused by maneuvers.

2. Non-rigid Mapping function: This function helps reduce the lag caused by maneuvers, making the swarm more responsive and energy efficient.

3. Distributed assignment and optimization protocol: this protocol enables uav swarms to construct and reconfigure formations dynamically as the number of UAV changes.

4. Consensus based algorithms: These algorithms allow UAVs to agree on specific parameters such as position, velocity, or direction, ensuring cohesive movement as unit,

5. Leader-follower method: This method involves a designated leader UAV guiding the formation, with other UAV following its path.

The essential idea behind the transition can be listed as the following steps:

1. Select random control points

2. Create a grid and use TPS to interpolate value on this grid

3. Visualize the original control points and the interpolated surface.

A sample python implementation might look like so:

import numpy as np

from scipy.interpolate import Rbf

import matplotlib.pyplot as plt

# Define the control points

x = np.random.rand(10) * 10

y = np.random.rand(10) * 10

z = np.sin(x) + np.cos(y)

# Create the TPS interpolator

tps = Rbf(x, y, z, function='thin_plate')

# Define a grid for interpolation

x_grid, y_grid = np.meshgrid(np.linspace(0, 10, 100), np.linspace(0, 10, 100))

z_grid = tps(x_grid, y_grid)

# Plot the original points and the TPS interpolation

fig = plt.figure()

ax = fig.add_subplot(111, projection='3d')

ax.scatter(x, y, z, color='red', label='Control Points')

ax.plot_surface(x_grid, y_grid, z_grid, cmap='viridis', alpha=0.6)

ax.set_xlabel('X axis')

ax.set_ylabel('Y axis')

ax.set_zlabel('Z axis')

ax.legend()

plt.show()

Reference: previous post

Monday, November 11, 2024

In Infrastructure Engineering, control plane and data plane both serve different purposes and often engineers want to manage only those that are finite, bounded and are open to management and monitoring. If the number far exceeds those that can be managed, it is better to separate resources and data. For example, when there are several drones to be inventoried and managed for interaction with cloud services, it is not necessary to create a pseudo-resource representing each of the drones. Instead, a composite cloud resource set representing a management object can be created for the fleet and almost all of the drones can be kept as data in a corresponding database maintained by that object. Let us go deep into this example for controlling UAV swarm movement via cloud resources.

First an overview of the control methods is necessary.There are several such as leader-follower, virtual structure, behavior-based, consensus-based, and artificial potential field and advanced AI-based methods (like artificial neural networks and deep reinforcement learning).

There are advantages and limitations of each approach, showcasing how conventional methods offer reliability and simplicity, while AI-based strategies provide adaptability and sophisticated optimization capabilities.

There is a critical need for innovative solutions and interdisciplinary approaches combining conventional and AI methods to overcome existing challenges and fully exploit the potential of UAV swarms in various applications, so the infrastructure and solution accelerator stacks must enable switching from one AI model to another or even change direction from one control strategy to another.

Real-world applications for UAV swarms are not only present, they are the future. So this case study is justified by wide-ranging applications across fields such as military affairs, agriculture, search and rescue operations, environmental monitoring, and delivery services.

Now, for a little more detail on the control methods to select one that we can leverage for a cloud representation. These control methods include:

Leader-Follower Method: This method involves a designated leader UAV guiding the formation, with other UAVs following its path. It's simple and effective but can be limited by the leader's capabilities.

Virtual Structure Method: UAVs maintain relative positions to a virtual structure, which moves according to the desired formation. This method is flexible but requires precise control algorithms.

Behavior-Based Method: UAVs follow simple rules based on their interactions with neighboring UAVs, mimicking natural swarm behaviors. This method is robust but can be unpredictable in complex scenarios.

Consensus-Based Method: UAVs communicate and reach a consensus on their positions to form the desired shape. This method is reliable and scalable but can be slow in large swarms.

Artificial Potential Field Method: UAVs are guided by virtual forces that attract them to the desired formation and repel them from obstacles. This method is intuitive but can suffer from local minima issues.

Artificial Neural Networks (ANN): ANN-based methods use machine learning to adaptively control UAV formations. These methods are highly adaptable but require significant computational resources.

Deep Reinforcement Learning (DRL): DRL-based methods use advanced AI techniques to optimize UAV swarm control. These methods are highly sophisticated and can handle complex environments but are computationally intensive.

Out of these, the virtual structure method inherently leverages both the drones capabilities to find appropriate positions on the virtual structure as well as their ability to limit the movements to reach their final position and orientation.

Some specific examples and details include:

Example 1: Circular Formation

Scenario: UAVs need to form a circular pattern.

Method: A virtual structure in the shape of a circle is defined. Each UAV maintains a fixed distance from this virtual circle, effectively forming a circular formation around it.

Advantages: This method is simple and intuitive, making it easy to implement and control.

Example 2: Line Formation

Scenario: UAVs need to form a straight line.

Method: A virtual structure in the shape of a line is defined. Each UAV maintains a fixed distance from this virtual line, forming a straight line formation.

Advantages: This method is effective for tasks requiring linear arrangements, such as search and rescue operations.

Example 3: Complex Shapes

Scenario: UAVs need to form complex shapes like a star or polygon.

Method: A virtual structure in the desired complex shape is defined. Each UAV maintains a fixed distance from this virtual structure, forming the complex shape.

Advantages: This method allows for the creation of intricate formations, useful in tasks requiring precise positioning.

Example 4: Dynamic Formation Changes

Scenario: UAVs need to change formations dynamically during a mission.

Method: The virtual structure is updated in real-time according to the mission requirements, and UAVs adjust their positions accordingly.

Advantages: This method provides flexibility and adaptability, essential for dynamic and unpredictable environments.

Sunday, November 10, 2024

Chief executives are increasingly demanding that their technology investments, including data and AI, work harder and deliver more value to their organizations. Generative AI offers additional tools to achieve this, but it adds complexity to the challenge. CIOs must ensure their data infrastructure is robust enough to cope with the enormous data processing demands and governance challenges posed by these advances. Technology leaders see this challenge as an opportunity for AI to deliver considerable growth to their organizations, both in terms of top and bottom lines. While many business leaders have stated in public that it is especially important for AI projects to help reduce costs, they also say that it is important that these projects enable new revenue generation. Gartner forecasts worldwide IT spending to grow by 4.3% in 2023 and 8.8% in 2024 with most of that growth concentrated in the software category that includes spending on data and AI. AI-driven efficiency gains promise business growth, with 81% expecting a gain greater than 25% and 33% believing it could exceed 50%. CDOs and CTOs are echoing: “If we can automate our core processes with the help of self-learning algorithms, we’ll be able to move much faster and do more with the same amount of people.” and “Ultimately for us it will mean automation at scale and at speed.”

Organizations are increasingly prioritizing AI projects due to economic uncertainty and the increasing popularity of generative AI. Technology leaders are focusing on longer-range projects that will have significant impact on the company, rather than pursuing short-term projects. This is due to the rapid pace of use cases and proofs of concept coming at businesses, making it crucial to apply existing metrics and frameworks rather than creating new ones. The key is to ensure that the right projects are prioritized, considering the expected business impact, complexity, and cost of scaling.

Data infrastructure and AI systems are becoming increasingly intertwined due to the enormous demands placed on data collection, processing, storage, and analysis. As financial services companies like Razorpay grow, their data infrastructure needs to be modernized to accommodate the growing volume of payments and the need for efficient storage. Advances in AI capabilities, such as generative AI, have increased the urgency to modernize legacy data architectures. Generative AI and the LLMs that support it will multiply workload demands on data systems and make tasks more complex. The implications of generative AI for data architecture include feeding unstructured data into models, storing long-term data in ways conducive to AI consumption, and putting adequate security around models. Organizations supporting LLMs need a flexible, scalable, and efficient data infrastructure. Many have claimed success with the adoption of Lakehouse architecture, combining features of data warehouse and data lake architecture. This architecture helps scale responsibly, with a good balance of cost versus performance. As one data leader observed “I can now organize my law enforcement data, I can organize my airline checkpoint data, I can organize my rail data, and I can organize my inspection data. And I can at the same time make correlations and glean understandings from all of that data, separate and together.”

Data silos are a significant challenge for data and technology executives, as they result from the disparate approaches taken by different parts of organizations to store and protect data. The proliferation of data, analytics, and AI systems has added complexity, resulting in a myriad of platforms, vast amounts of data duplication, and often separate governance models. Most organizations employ fewer than 10 data and AI systems, but the proliferation is most extensive in the largest ones. To simplify, organizations aim to consolidate the number of platforms they use and seamlessly connect data across the enterprise. Companies like Starbucks are centralizing data by building cloud-centric, domain-specific data hubs, while GM's data and analytics team is focusing on reusable technologies to simplify infrastructure and avoid duplication. Additionally, organizations need space to innovate, which can be achieved by having functions that manage data and those that involve greenfield exploration.

#codingexercise: CodingExercise-11-10-2024.docx

Saturday, November 9, 2024

One of the fundamentals in parallel processing in computer science involves the separation of tasks per worker to reduce contention. When you treat the worker as an autonomous drone with minimal co-ordination with other members of its fleet, an independent task might appear something like installing a set of solar panels in an industry with 239 GW estimate in 2023 for the global solar powered renewable energy. That estimate was a 45% increase over the previous year. As industry expands, drones are employed for their speed. Drones aid in every stage of a plant’s lifecycle from planning to maintenance. They can assist in topographic surveys, during planning, monitor construction progress, conduct commissioning inspections, and perform routine asset inspections for operations and maintenance. Drone data collection is not only comprehensive and expedited but also accurate.

During planning for solar panels, drones can conduct aerial surveys to assess topography, suitability, and potential obstacles, create accurate 3D maps to aid in designing and optimizing solar farm layouts, and analyze shading patterns to optimize panel placement and maximize energy production. During construction, drones provide visual updates on construction progress, and track and manage inventory of equipment, tools, and materials on-site. During maintenance, drones can perform close-up inspections of solar panels to identify defects, damage, or dirt buildup, monitor equipment for wear and tear, detect hot spots in panels with thermal imaging, identify and manage vegetation growth that might reduce the efficiency of solar panels and enhance security by patrolling the perimeter and alerting to unauthorized access.

When drones become autonomous, these activities go to the next level. The dependency on human pilots has always been a limitation on the frequency of flights. On the other hand, autonomous drones boost efficiency, shorten fault detection times, and optimize outcomes during O&M site visits. Finally, they help to increase the power output yield of solar farms. The sophistication of the drones in terms of hardware and software increases from remote-controlled drones to autonomous drones. Field engineers might suggest selection of an appropriate drone as well as the position of docking stations, payload such as thermal camera and capabilities. A drone data platform that seamlessly facilitates data capture, ensures safe flight operations with minimal human intervention, prioritize data security and meet compliance requirements becomes essential at this stage. Finally, this platform must also support integration with third-party data processing and analytics applications and reporting stacks that publish various charts and graphs. As usual, a separation between data processing and data analytics helps just as much as a unified layer for programmability and user interaction with API, SDK, UI and CLI. While the platform can be sold separately as a product, leveraging a cloud-based SaaS service reduces the cost on the edge.

There is still another improvement possible over this with the formation of dynamic squadrons, consensus protocol and distributed processing with hash stores. While there are existing applications that serve to improve IoT data streaming at the edges and cloud processing via stream stores and analytics with the simplicity of SQL based querying and programmability, a cloud service that installs and operates a deployment stamp with a solution accelerator and as a citizen resource of a public cloud helps bring the best practices of storage engineering, data engineering and enabling businesses to be more focused.

Friday, November 8, 2024

This is a summary of the book titled “The Leader’s guide to managing risk” written by risk consultant K. Scott Griffith published by HarperCollins Leadership in 2023. He worked as America Airlines chief safety officer, and he draws on his experience to mitigate and manage risk. People are focused on success and they put their mind to survive and thrive despite risk. But when things go awry, risk management is the only thing that can prevent some consequences if we understand the underlying causes of bad outcomes. Sustained success over time demonstrates reliability and risk management teaches reliability and resilience. Organizational reliability takes it to the next level with a combination of human and systemic factors. Global risk is the largest scope and it can be daunting but can be met. People and organizations can both achieve sustained success.

Reliability is essential for businesses to demonstrate their worth to shareholders and consumers. High performance must be sustainable over time, as failure is inevitable and not a result of perfect performance. Leaders must manage risk based on solid science and understand that companies are groups of people working within systems.

Reliability theory teaches that events always emerge within a probability distribution, and leaders must pay attention to each level of operations to perceive and understand risks. They must also be aware of their personal risk tolerance, which is the level of risk they are willing to tolerate.

To manage risk effectively, leaders must develop "risk intelligence," which is the ability to perceive a risk's severity and potential harmful results. This intelligence is derived from their experience and history with their organization but is subject to psychological and cognitive limitations. Leaders must also be aware of their personal risk tolerance, which should be rational and fact-based.

A system's design determines its reliability, and leaders must understand how it works both when functioning properly and when it fails. Systems come in various forms, from large systems like transportation infrastructure to small personal ones. To predict and manage system performance, leaders must understand the influences that determine the results systems produce. Human reliability maintains system reliability, and systems should optimize core functions while heeding less important functions. Systems can become unreliable due to factors other than flaws or design limitations, such as lack of maintenance, user overload, environmental factors, or human incompetence.

System design is crucial for maintaining system functionality, resilience, and trust. Engineers and designers can create obstacles, redundancies, and recovery mechanisms. Organizations should prioritize training their leaders to manage human and operational risk factors. Human performance is predictable, and organizations can manage and minimize human failures by understanding how performance works and what motivates and shapes it. Inclusive, collaborative work encourages reliable performance.

Organizational reliability is achieved through a combination of human and systemic factors, including effective leadership and various internal and external factors. Leaders must manage risk by inspiring employees to work towards achieving the organization's goals, balancing competing priorities, and understanding the risk of error. They should ensure the company has reliable systems, appropriate resources, and effective working methods.

Leaders can predict human and organizational reliability by assessing future risks through methods like investigations, audits, inspections, employee reports, and predictive risk modeling. These methods have limitations and are subject to human interpretation bias. Predictive risk analysis helps leaders anticipate future risks and mitigate or prevent them, such as preventing accidents or errors in hospitals or transportation systems.

Global risks, such as climate change, can be mitigated by individuals and organizations. Nuclear capabilities can deter war, but cooperation among nations is necessary to address and mitigate it. Parenting can also involve addressing risks like toxic household chemicals and car accidents.

Overall, understanding and managing these risks is crucial for maintaining organizational reliability.

Between 1995 and 2005, the aviation industry in the United States reduced fatal airline accidents by 78%. This improvement was not due to a single policy but rather the Commercial Aviation Safety Team, which examined various aviation risk dimensions. To manage risk optimally, organizations should follow a "Sequence of Reliability" framework, which starts with understanding the implications of risk and moves to managing systems, organizations, and employees. The best approach varies depending on the specific risk faced.