Cluster computing

Wednesday, May 29, 2024

This is a continuation of articles on IaC shortcomings and resolutions. In this section too, we focus on the deployment of azure machine learning workspaces with virtual network peering and securing it with proper connectivity. When peerings are established between virtual networks and the AZ ML Workspace is secured with a subnet dedicated to the creation of compute, improper settings of private and service endpoints, firewall, NSGs and user-defined routing traffic, may cause quite a few surprises in the normal functioning of the workspace. For example, data scientists may encounter an error as: “Performing interactive authentication. Please follow the instructions on the terminal. To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXYYZZAA to authenticate.” Even if they complete the device login, the resulting message will tell them they cannot be authenticated at this time. Proper configuration of the workspace and the traffic is essential to overcome this error.

One of the main deterrence in the completion of pass-through authentication is the resolution of DNS names and their IP addresses to route the reverse traffic. Since the public plane connectivity is terminated at the workspace, the traffic to and from the compute goes over the private plane. A Private DNS lookup is required for the IP address of the private endpoint to the workspace. When the private endpoint is created, the DNS zone registrations for predetermined domain prefixes and their corresponding private IP addresses as determined by the private endpoint must be registered. This is auto-registered when the endpoint is suitably created, otherwise they must be manually added.

With just the compute and the clusters having private IP connectivity to the subnet, the outbound IP connectivity can be established through the workspace in an unrestricted setting or with a firewall in a conditional egress setting. The subnet that the compute and clusters are provisioned from must have connectivity to the subnet that the storage account, key vault and azure container registry that are internal to the workspace. A subnet can even have its own Nat gateway so that all outbound access can get the same IP address prefix which is very helpful to secure using an IP rule for the prefix for incoming traffic at the destination. Storage account and key vault can gain access via their service endpoints to the compute and cluster’s private IP address while the container registry must have a private endpoint for the private plane connectivity to the compute. A dedicated image server build compute can be created for designated image building activities.

User-defined routing and local hosts file become pertinent when a firewall is used to secure outbound traffic. Local host file with the private ip address of the compute and a name like ‘mycomputeinstance.eastus.instances.azureml.ms’, is an option to connect to the virtual network with the workspace in it. is also important to set user-defined routing when a firewall is used, and the default rule must have ‘0.0.0.0/0’ to designate all outbound internet traffic to reach the private ip address of the firewall as a next hop. This allows the firewall to inspect all outbound traffic and security policies can kick in to allow or deny traffic selectively.

Tuesday, May 28, 2024

This is a summary of the book titled “The AI playbook: mastering the art of machine learning deployment” written by Eric Siegel and published by MIT press in 2024. Prof. Siegel urges business and tech leaders to come out of their silos and collaborate to harness the full potential of machine learning models that will transform their organization and optimize their operations. He provides a step-by-step framework to do that which includes establishing value-driven deployment goal by leveraging “backward planning”, collaborating for a specific prediction goal, finding the right evaluation metrics, preparing the data to achieve desired outcomes, training the model to detect patterns, deploying the models such that there is a full-stack buy-in from stakeholder departments in the organization and committing to a strong ethical compass for maintaining the models.

Machine Learning (ML) opportunities require collaboration between business and data professionals. Business professionals need a holistic understanding of the ML process, including models, metrics, and data collection. Data professionals must broaden their perspective on ML to understand its potential to transform the entire business. BizML, a six-step business approach, bridges gaps between the business and data ends of an organization. It focuses on organizational execution and complements the Cross Industry Standard Process for Data Mining (CRISP-DM). Successful ML and AI projects require "backward planning" to establish a value-driven deployment goal. ML's applications extend beyond predicting business outcomes, addressing social issues like abuse or neglect. After choosing how to apply ML, stakeholders with decision-making power should approve it, focusing on the gains ML can make rather than fixating on the technology.

Business and tech leaders should collaborate to specify a prediction goal for machine learning (ML) projects. This involves defining the goal in detail, identifying viable prediction goals, and adhering to the "Law of ML Planning." Ensure that deployment and the predictions will shape business operations are at the forefront of the project. Consider potential ethical issues, such as the potential for predictive policing models to inflate the likelihood of Black parolees being rearrested.

For new ML projects, consider creating a binary model or binary classifier that makes predictions by answering yes/no questions. Other predictive models, such as numerical or continuous models, can also be used.

Evaluating the model's performance is crucial to determine its success. Accuracy is not the best way to measure the model's success. High accuracy models only perform better than random guessing, and metrics such as "lift" and "cost" should be used to evaluate the model's performance.

To train a machine learning (ML) model, ensure that the data is long, wide, and labeled. This will help the model accurately predict outcomes and identify patterns. Ensure that the data is structured and unstructured and be wary of "noise" or "corrupt data" that may be causing issues.

Teach the ML model to detect patterns in a sensible way, as ML algorithms learn from your data and use patterns to make predictions. Understanding your model is not always straightforward, but if the patterns your model detects and uses to make predictions are reliable, you don't necessarily need to establish causation.

Familiarize yourself with different modeling methods, such as decision trees, linear regression, and logistic regression. Investigate your models to ensure they don't contain bugs, as some models may combine input variables in problematic ways. For example, a model designed to distinguish huskies from wolves using images may label all images with snow as "wolves" when it might be discovered that the model was labeling all images without snow as "huskies."

To deploy an AI model, it's crucial to gain full-stack cooperation and buy-in from all team members within your organization. Building trust in the model is essential, as it can automate decision-making processes. Humans still play a role in some processes, and deploying a "human-in-the-loop" approach allows them to make operational decisions after integrating data from the model. Deployment risk can be mitigated by using a control group or incremental deployment. Maintaining the model is essential to prevent model drift, which can occur when the data used degrades. To avoid discrimination, ensure the model doesn't operate in a discriminatory way, aiming to equally represent different groups and avoid inferring sensitive attributes. Aspire to use data ethically and responsibly, based on empathy.

Monday, May 27, 2024

This is a continuation of articles on IaC shortcomings and resolutions. In this section too, we focus on the deployment of azure machine learning workspaces with virtual network peering and securing it with proper connectivity. When peerings are established traffic from any source in virtual network can flow to any destination in another. This is very helpful when egress must be from one virtual network. Any number of virtual networks can be peered into hub-and-spoke model or as transit, but they have their drawbacks and advantages. The impact this has on the infrastructure for AZ ML deployments is usually not called out in deployments and there can be quite a few surprises in the normal functioning of the workspace. The previous article focused on DNS name resolution and the appropriate names and ip addresses to use with A records. This article focuses on private and service endpoints, firewall, NSG, and user defined routing.

The workspace and the compute can have public and private ip addresses and when a virtual network is used, it is intended to isolate and secure the connectivity. This can be done in one of two ways. A managed virtual network or a customer specified virtual network for the compute instances and cluster. Either way, the workspace can retain public ip connectivity while the compute instances and clusters can choose to be assigned public and private connectivity independently. The latter can be provisioned with disabled public ip connectivity and only using private ip addresses from a subnet in the virtual network. It is important to say that the workspace’s ip connectivity can be independent from that of the compute and clusters because this affects end-users’ experience. The workspace can retain both a public and private ip address simultaneously but if it were made entirely private, then a jump server and a bastion would be needed to interact with the workspace including its notebooks, datastores and compute. With just the compute and the clusters having private ip connectivity to the subnet, the outbound ip connectivity can be established through the workspace in an unrestricted setting or with a firewall in a conditional egress setting. The subnet that the compute and clusters are provisioned from must have connectivity to the subnet that the storage account, key vault and azure container registry that are internal to the workspace. A subnet can even have its own Nat gateway so that all outbound access can get the same ip address prefix which is very helpful to secure using an ip rule for the prefix for incoming traffic at t the destination. Storage account and key vault can gain access via their service endpoints to the compute and cluster’s private ip address while the container registry must have a private endpoint for the private plane connectivity to the compute. A dedicated image server build compute can be created for designated image building activities. On the other hand, if the computer and cluster were assigned public ip connectivity, the azure batch service would need to be involved and these would reach the compute and cluster’s ip address via a load balancer. If created without a public ip, we get a private link service to accept the inbound access from Azure Batch Service and Azure Machine Learning Service without a public ip address. Local host file with the private ip address of the compute and a name like ‘mycomputeinstance.eastus.instances.azureml.ms’, is an option to connect to the virtual network with the workspace in it. is also important to set user-defined routing when a firewall is used, and the default rule must have ‘0.0.0.0/0’ to designate all outbound internet traffic to reach the private ip address of the firewall as a next hop. This allows the firewall to inspect all outbound traffic and security policies can kick in to allow or deny traffic selectively.

Previous article: IaCResolutionsPart126.docx

Sunday, May 26, 2024

This is a continuation of IaC shortcomings and resolutions. In this section, we focus on the deployment of azure machine learning workspaces with virtual network peerings. When peerings are established traffic from any source in virtual network can flow to any destination in another. This comes very helpful when egress must be from one virtual network. Any number of virtual networks can be peered in hub-and-spoke model or as transit but they have their drawbacks and advantages. The impact this has on the infrastructure for az ml deployments is usually not called out in deployments and there can be quite a few surprises in the normal functioning of the workspace. This article explains these.

First, the azure machine learning workspace requires certain hosts and ports to reach it and they are maintained by Microsoft. For example, the hosts: login.microsoftonline.com, and management.azure.com are necessary for the Microsoft Entra ID, Azure Portal and Azure Resource Manager to respond to the workspace. Users of the azml workspace might encounter error such as: “Performing interactive authentication. Please follow the instructions on the terminal. To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXYYZZAA to authenticate.” Such a direction does not result in a successful authentication and leads to the dreaded You-cannot-access-this-right-now with the detailed message “Your sign-in was successful but does not meet the criteria to access this resource”. To resolve this error, ensure that the workspace can be reached back from these hosts. If the compute attached to the workspace has public ip connectivity, the host can reach it back but if the compute were created with no public ip and was deployed to a subnet, then the reaching back occurs by name resolution. Consequently, the private endpoint associated with the workspace must be linked to the virtual networks that must have access and the following dns names must be registered with those zones: <workspace-identifier-guid>.workspace.<region>.privatelink.api.azureml.ms, <workspace-identifier-guid.workspace.centralus.cert.privatelink.api.azureml.ms. *.<workspace-identifier-guid>.inference.centralus.privatelink.api.azureml.ms and ml-ml-pod-innov—centralus-<workspace-identifier-guid>.<region>.privatelink.notebooks.azure.net whose corresponding private ip addresses can be found from the private endpoint associated with the workspace where workspace-identifier-guid is specific to a workspace and the region such as ‘centralus’ is where the workspace is deployed. With peered networks, private dns zones in those networks must allow reverse lookup of these names.

Second, the network watcher or tools like that must be used to diagnose that the traffic to the public network addresses registered with Microsoft which are typically well advertised in both documentation and api from Azure. These include CIDR like 13.0.0.0/8, 51.0.0.0/8 52.0.0.0/8, 20.0.0.0/8 and 40.0.0.0/8 and more specific ranges can be obtained via CLI/API.

Previous articles: IaCResolutionsPart125.docx

Saturday, May 25, 2024

This is a continuation of previous articles on IaC shortcomings and resolutions. In this section, we focus on automation involving external tools and APIs. Almost all mature DevOps pipelines rely on some automation that is facilitated by scripts and executable rather than IaC resources. The home for these scripts usually turns out to be in pipelines themselves or gravitate to centralized one-point maintenance destinations such as Azure Automation Account Runbooks or Azure DevOps depending on scope and reusability.

While deciding on where to save automation logic, some considerations often get ignored. For example, Runbooks run in sandbox environment or as Hybrid Runbook Worker.

When the executables are downloadable from the internet, either can be used since internet connectivity is available in both. But when local resources need to be managed such as an Azure storage account or on-premises store, they need to be managed via a Hybrid Runbook Worker. The Runbook worker enables us to manage local resources that are not necessarily native to the cloud and bridges the gap between cloud-based automation and on-premises or hybrid scenarios. There are two installation platforms for the Hybrid Runbook Worker: Extension-based (v2) and Agent-based (v1). The former is the recommended approach because it simplifies installation and management by using a VM extension. It does not rely on the Log Analytics Agent and reports directly to an Azure Monitor Log Analytics workspace. The V1 approach requires the Log Analytics agent to be installed first. Both v1 and v2 can coexist on the same machine. Beyond those choices are just limitations and other options such as Azure DevOps might be considered instead. Webhooks and APIs are left out of this discussion, but they provide the advantage that authentication and encryption become part of each request.

Azure DevOps aka ADO is a cloud-based service, and it does not have restrictions on its elasticity. The DevOps based approach is critical to rapid software development cycles. The Azure DevOps project represents a fundamental container where data is stored when added to Azure DevOps. Since it is a repository for packages and a place for users to plan, track progress, and collaborate on building workflows, it must scale with the organization. When a project is created, a team is created by the same name. For enterprise, it is better to use collection-project-team structure which provides teams a high level of autonomy and supports administrative tasks to occur at the appropriate level.

Some tenets for organization from ADO have parallels in Workflow management systems:

· Projects can be added to support different business units

· Within a project, teams can be added

· Repositories and branches can be added for a team

· Agents, agent pools, and deployment pools to support continuous integration and deployment

· Many users can be managed using the Azure Active Directory.

It might be tempting to use GitOps and third-party automation solutions including Jenkins-based automation, but they only introduce more variety. Consolidating resources and automation in the public cloud is the way to go.

As with all automation, it is important to register them in source control so that their maintenance can become easy. It is also important to secure the credentials with which these scripts run. Finally, lockdown of all resources in terms of network access and private planes is just as important as their accessibility for automation.

Previous articles: https://1drv.ms/w/s!Ashlm-Nw-wnWhO4RqzMcKLnR-r_WSw?e=kTQwQd

Friday, May 24, 2024

While deciding on where to save automation logic, some considerations often get ignored. For example, Runbooks run in sandbox environment or as Hybrid Runbook Worker.

Some tenets for organization from ADO have parallels in Workflow management systems:

· Projects can be added to support different business units

· Within a project, teams can be added

· Repositories and branches can be added for a team

· Agents, agent pools, and deployment pools to support continuous integration and deployment

· Many users can be managed using the Azure Active Directory.

Thursday, May 23, 2024

This is a summary of the book titled “Nonviolent or Compassionate Communication – a language of life” written by Marshall P. Rosenberg and published by the Puddledancer press in 2003. The author explains how to express needs and feelings in ways that promote respectful empathic interpersonal communications. This is not about conflict resolution alone but about compassionate communication. It provides a framework about human needs and emotions and ultimately leads to clearer communication, mindfulness, relationships, and personal growth. Imperfect communication causes misunderstandings and frustrations. NVC is based on the language “from the heart”. It has four components: observations, feelings, needs and requests. We can practice it first by observing without judgement or evaluation. We express our needs without associating our feelings which can easily be manipulated by environmental factors. Too often, we blame those external factors for our feelings, but we begin to prioritize our needs and by ourselves first before others. When we express requests, we can include both needs and feelings but not demands. Checking whether the message our requests sank in is good practice. Applying NVC practices can help in dealing with emotions and resolving conflicts. Simple substitutions of “I choose to” instead of “I have to” helps in this regard.

Nonviolent Communication (NVC) is a method of communication that promotes interpersonal connection and empathy. It consists of four components: observations, feelings, needs, and requests. NVC is applied by observing what is happening, sharing how it makes us feel and what we need, and asking for specific actions. NVC can be applied to personal relationships, family, business, and societal conflicts.

Observation should be specific to a time and context, and evaluation should be specific to the behavior observed. Identifying and expressing feelings is crucial, but people may not always support it. It can be improved by distinguishing between emotions and thoughts, and focusing on what is enriching or not enriching our life.

Feelings result from how we receive others' actions and statements, which is a choice made in combination with our needs and expectations. If someone says something negative to us, we have four response options: blaming ourselves, blaming others, paying attention to what we feel and need, or paying attention to what others feel and need. This helps us become aware of what's happening, what people are feeling, and why.

Identifying needs is crucial for emotional liberation, as it helps individuals recognize their physical, spiritual, autonomy, and interdependence needs. This process involves three stages: emotional slavery, where one feels responsible for others' feelings, the obnoxious stage, where one rejects responsibility, and the third stage, emotional liberation, where one takes responsibility for their actions.

NVC's fourth component is requesting, which involves asking others for things that would enrich one's life. Active language is used when making requests, and specific, positive actions are requested. Emphasizing empathy and asking listeners to reflect back on their responses can make requests seem less like demands. It is important to present requests as requests rather than demands, as people may view those who make a demand as criticizing or making them feel guilty. The goal is to build a relationship based on honesty and empathy, rather than presenting a demand.

NVC principles emphasize self-expression and empathy in interactions with others. Listening with our whole being, letting go of preconceptions, and focusing on what people feel and need is crucial. Empathy can be achieved by paraphrasing what we think we've heard, correcting our understanding if we're wrong, and empathizing when someone stays silent. NVC can help develop compassion for oneself, helping to grow rather than reinforcing self-hatred. It helps connect with feelings or needs arising from past actions, allowing for self-forgiveness.

NVC also helps in expressing anger by separating the link between others and their actions. Instead of blaming others, we look inside ourselves to identify unmet needs. Making requests in clear, positive, concrete action language reveals what we really want. When angry, we choose to stop and take a breath, identify judgments, and express our feelings and needs. To get someone to listen, we need to listen to them.

NVC-style conflict resolution focuses on establishing a connection between parties, allowing productive communication and understanding of each other's perspectives. It emphasizes listening to needs, providing empathy, and proposing strategies. Mediation should not be solely intellectual, but also involve playing different roles and avoiding punishment. It helps individuals recognize their feelings and needs and avoid repeating negative judgments. NVC also encourages expressing appreciation without unconscious judgment, avoiding negative compliments that can alienate. Instead, it encourages celebrating actions that enhance well-being and identifying the needs fulfilled by others. This approach helps to move people out of fixed positions and promotes a more positive and productive resolution.