Cluster computing

Tuesday, April 23, 2024

This is a continuation of previous articles on IaC shortcomings and resolutions. While IaC code can be used deterministically to repeatedly create, update, and delete cloud resources, there are some dependencies that are managed by the resources themselves and become a concern for the end user when they are not properly cleaned up. Take for instance, the load balancers that compute instances and clusters create when they are provisioned using the Azure Machine Learning Workspaces. These are automatically provisioned. The purpose of this load balancer is to manage traffic even when the compute instance or cluster is stopped. Each compute instance has one load balancer associated with it, and for every 50 nodes in a compute cluster, one standard load balancer is billed. The load balancer ensures that requests are distributed evenly across the available compute resources, improving performance and availability. Each load balancer is billed at approximately $0.33 per day. If we have multiple compute instances, each one will have its own load balancer. For compute clusters, the load balancer cost is based on the total number of nodes in the cluster. One way to avoid load balancer costs on stopped compute instances and clusters, is to delete the compute resources when they are not in use. The IaC can help with the delete of the resources but whether the action is automated or manual, it is contingent on the delete of the load balancers and when delete fails for reasons such as locks on load balancers, then the user is left with a troublesome situation.

An understanding of the load balancer might help put things in perspective especially when trying to find them to unlock or delete. Many cloud resources and Azure Batch services create load balancers and the ways to distinguish them vary from resource groups, tags, or properties. These load balancers play a crucial role in distributing network traffic evenly across multiple compute resources to optimize performance and ensure high availability, they use various algorithms such as round-robin, least connections, or source IP affinity, to distribute incoming traffic to the available compute resources. This helps in maintaining a balanced workload and preventing any single resource from being overwhelmed. They also contribute to high availability by continuously monitoring the health of the compute resources. If a resource becomes unhealthy or unresponsive, the load balancer automatically redirects traffic to other healthy resources. They can seamlessly handle an increase in traffic by automatically scaling up the number of compute resources. Azure Machine Learning Workspace load balancers can scale up or down based on predefined rules or metrics, ensuring that the resources can handle the workload efficiently. Load balancing rules determine how traffic should be distributed. Rules can be configured based on protocols, ports, or other attributes to ensure that the traffic is routed correctly. Load balancers continuously monitor the health of the compute resources by sending health probes to check their responsiveness. If a resource fails the health probe, it is marked as unhealthy, and traffic is redirected to other healthy resources. Azure Machine Learning Workspace supports both internal and public load balancers. Internal load balancers are used for internal traffic within a virtual network, while public load balancers handle traffic from the internet. They can be seamlessly integrated with other Azure services, such as virtual networks, virtual machines, and container services, to build scalable and highly available machine learning solutions. Overall, load balancers in Azure Machine Learning Workspace play a critical role in optimizing performance, ensuring high availability, and handling increased traffic by evenly distributing it across multiple compute resources.

Creating the compute with node public ip set to false and disabling local auth can prevent load balancers from being created but if endpoints are involved, the Azure Batch Service will create them. Load balancers, public ip addresses and associated dependencies are created in the resource group of the virtual network and not the resource group of the machine learning workspace. Finding the load balancers and taking appropriate action on them can allow the compute resources to be cleaned up. This can be done on an ad hoc basis or scheduled basis.

Cluster computing

Tuesday, April 23, 2024

No comments:

Post a Comment