Cluster computing

Saturday, July 22, 2023

Ha ha - my post would not publish because genuine IaC conflicts with Google's policies, cheers.

Friday, July 21, 2023

Considerations when using private endpoint for communication between Azure Application Gateway and Azure App Services.

An App Service has a public internet facing endpoint. An Application Gateway can communicate with an app service over a service endpoint which allows traffic from a specific subnet within an Azure Virtual Network to which the gateway is deployed and blocks everything else. Enabling the service endpoint in that subnet and setting an access restriction on the app service ensures that the communication is private and secure.

The above configuration can be achieved using various tools such as CLI, portal and SDK. A sample command would like this:

az webapp config access-restriction add --resource-group myRG --name myWebApp --rule-name AppGwSubnet --priority 200 --subnet mySubNetName --vnet-name myVnetName

The service endpoint helps to tag all the traffic leaving the subnet towards the app service to be tagged with the specific subnet id.

A similar effect can be achieved with an alternative that uses a private endpoint but there are a few things we need to establish first. We need to ensure that the application gateway can DNS resolve the private IP address in the backend pool and override the hostname in the http settings.

The Gateway caches the DNS lookup results, so if we use FQDNs and rely on the DNS lookup to get the private ip address, then we may need to restart the Application Gateway if the DNS update or link to Azure private DNS zone was done after the backend pool. The application gateway can be restarted by starting and stopping the instance such as with the commands shown below.

az network application-gateway stop --resource-group myRG --name myAppGw

az network application-gateway start --resource-group myRG --name myAppGw

Since it is customary for the resources to be succinctly described with an IaC code for deployment, the following section illustrates how to do that.

Wednesday, July 19, 2023

Previous articles in this regard have been discussing resolutions for shortcomings in the use of Infrastructure-as-a-code (IaC) in various scenarios. This section discusses the resolution for the case there are cascading resource locks.

Resources can be locked to prevent unexpected changes. A subscription, resource group or resource can be locked to prevent other users from accidentally deleting or modifying critical resources. The lock overrides any permissions the users may have. The lock level can be set to CannotDelete or ReadOnly with ReadOnly being more restrictive. Lock inheritance can be applied at a parent scope, all resources within that scope can then inherit the same lock. Some considerations still apply after locking. For example, a CannotDelete lock on a storage account does not prevent data within that account from being deleted. A read only lock on an application gateway prevents you from getting the backend health of the application gateway because it uses POST. Only Owner and User Access Administrator role members are granted access to Microsoft.Authorization/locks/* actions.

When the IaC is applied, it can be quite frustrating to find the resources locked in the public cloud and preventing the IaC actions to complete. For example, a resource might have a private endpoint which in turn might be associated with a DNS and have a private NIC card and these sub-resources might be locked that prevents the private endpoint from being deleted which in turns fails the IaC application. The resolution for the owner of the subscription is to delete the lock from the said resource via the Azure Portal or the command-line interface and then proceed to apply the locks. And iterate over the ‘apply’ and the ‘unlock’ steps until there are no further obstructions.

While this works for the role with the elevated privileges, many developers using the credentials for the CI/CD pipeline to make changes to the subscription do not have that privilege and might find the experience harrowing to resolve without external intervention. One way that they overcome this unlocking is by applying the unlock commands via a pipeline step prior to the application of the IaC. Fortunately, there are ways to unlock at a global subscription level scope rather than at a resource-by-resource level. Even so, it might not be clear when the locks reappear, and the unlocking might need to be repeated. Checking the policies to make sure that the locking is not enforced automatically, which in turn interferes with the infrastructure changes by code, is a good practice and one that can potentially advise about the intent behind the locking. If the locking were simply to prevent accidental deletions against a broad range of resources, then the unlocking is straightforward for the applying of the changes

Let us make a specific association between say a firewall and a network resource such as a gateway. The firewall must be associated with the gateway to prevent traffic flow through that appliance. When they remain associated, they remember the identifier and the state for each other. Initially, the firewall may remain in detection mode where it is merely passive. It becomes active in the prevention mode. When the modes are attempted to be toggled, the association prevents it. Neither end of the association can tell what state to be in without exchanging information and when they are deployed or updated in place, neither knows about nor informs the other.

The above resolutions are easy when the error messages are descriptive and indicate that the failure of the IaC is exclusively due to locks. There are other forms of errors where the cause may not be straightforward. In such cases, the activity log on the resources or at the subscription level can be quite helpful when the json content of a logged event explains exactly what happened. This particular feature is also helpful to know if something transpired by actions of something other than the deployment of the infrastructure changes.

Tuesday, July 18, 2023

IaC Resolutions Part 8:

Let us make a specific association between say a firewall and a network resource such as a gateway. The firewall must be associated with the gateway to prevent traffic flow through that appliance. When they remain associated, they remember the identifier and the state for each other. Initially, the firewall may remain in detection mode where it is merely passive. It becomes active in the prevention mode. When the modes are attempted to be toggled, the association prevents it. Neither end of the association can tell what state to be in without exchanging the information and when they are deployed or updated in place, neither knows about or informs the other.

There are two ways to overcome this limitation.

First, there is a direction established between the resources where the update to one forcibly updates the state of the other. This is supported by the gateway when it allows the information to be written through by the update in the state of one resource.

Second, the changes are made by the IaC provider first to one resource and then to the other so that the update to the other picks up the state of the first during its change. In this mode, the firewall can be activated after the gateway knows that there is such a firewall.

If the IaC tries to form an association while updating the state of one, the other might end up with an inconsistent state. One of the two resolutions above works to mitigate this.

This is easy when there is a one-to-one relationship between resources. Sometimes there are one-to-many relationships. For example, a gateway might have more than a dozen app services as its backend members and each member might be allowing public access. If the gateway must consolidate access to all the app services, then there are changes required on the gateway to route traffic to each app service as intended by the client and a restriction on the app services to allow only private access from the gateway.

Consider the sequence in which these changes must be made given that the final operational state of the gateway is only acceptable when all barring none remain reachable for a client through the gateway.

If the app services toggle the access from public to gateway sooner than the gateway becomes operational, there is some downtime to them, and the duration is not necessarily bounded if one app service fails to listen to the gateway. The correct sequence would involve first making the change in the gateway to set up proper routing and then restricting the app services to accept only the gateway. Finally, the gateway validates all the app service flows from a client before enabling them.

Each app service might have nuances about whether the gateway can reach it one way or another. Usually, if they are part of the same vnet, then this is not a concern, otherwise peering might be required. Even if the peering is made available, routing by address or resolution by name or both might be required unless they are universally known on the world wide web. If the public access is disabled, then the private links must be established, and this might require both the gateway and the app service to do so. Lastly, with each change, an app service must maintain its inbound and outbound properly for bidirectional communication, so some vetting is required on the app service side independent of the gateway.

Putting this altogether via IaC requires that the changes be made in stages and each stage validated independently.

Monday, July 17, 2023

Your guide to sustainability with cloud solutions.

Solutions for the industry that are implemented new, benefit from a set of principles that provide prescriptive guidance to improving the quality of their deployments. When the industry moves from digital adoption to digital transformation to digital acceleration, the sustainability journey requires a strong digital foundation. It is the best preparation for keeping pace with this rapid change.

This is true for meeting new sustainability requirements, avoiding the worst impacts of climate change and other business priorities such as driving growth, adapting to industry shifts, and navigating energy consumption and economic conditions. It helps to track and manage data at scale, unifying data and improving visibility across the organization.

As an aside, the well-architected framework consists of five pillars. These are reliability (REL), security (SEC), cost optimization (COST), operational excellence (OPS) and performance efficiency (PERF). The elements that support these pillars are a review, a cost and optimization advisor, documentation, patterns-support-and-service offers, reference architectures and design principles.

Sustainability is a journey. Cloud solutions can be developed by fostering growth and controlling costs while contributing to sustainability goals. The journey can be shaped by building resilience with an ESG strategy, resizing opportunities to control costs, improving efficiency with energy reduction, and tracking progress for environmental impact.

The well-architected framework helps to reliably report your sustainability impact, driving meaningful progress and finding gaps where the most impact can be delivered but a leader’s insights and perspectives can tune the sustainability investments to create opportunities that align with other goals. While some of the industry-tested learnings are brought in this article, meeting sustainability goals is a big win for an organization of any size. That is why a tiny country like Bhutan can claim to be a leader with its carbon negative footprint. Companies that meet sustainability goals are favored by investors, consumer satisfaction and employee retention are also improved with clear indicators such as customers willing to pay more for options with sustainability. Knowing where we are today, setting future goals and making data driven decisions to create steps towards realizing those dreams are invaluable exercises for your blue ocean strategy.

Eventually, there will be regulations mandating similar work but a determination of the ESG metrics provides competitive differentiation for your solutions. Granted there might be a cultural shift involved not different from the one experienced in adopting the cloud, but the potential for dramatic discoveries and strong storytelling can unite the people behind the vision. Some tools help such as the Microsoft Sustainability Manager can be instrumental towards realizing this vision as it unifies the data to monitor and manage the environmental impact.

An ideal tool and central vantage point can assess the value of corporate sustainability. But an even more important consideration is that the scope for similar assessment and impact can be delegated to different organizational units and departments and assist with incremental progress towards the sustainability journey when those participants learn to self-evaluate their path and metrics.

Sustainability is also about consumption, efficiency and digitizing the supply chain. Transparency and tighter upstream and downstream collaboration are needed if cyclical efficiencies are to be discovered in products and processes. A single unified platform can help with visualizing the data from disparate devices and systems. Adopting recyclable and repairable software goes a great way towards sustainability just as much as devices. A strong digital foundation can drive both sustainability and transformational goals. The urgency, scope and scale of the task cited in this article can help you with the journey from pledges to progress.

References: IaC Shortcomings and resolutions.

Sunday, July 16, 2023

Some methods of organization for large scale Infrastructure-as-a-Code deployments.

The purpose of IaC is to provide a dynamic, reliable, and repeatable infrastructure suitable for cases where manual approaches and management practices cannot keep up. When automation increases to the point of becoming a cloud-based service responsible for the deployment of cloud resources and stamps that provision other services that are diverse, consumer facing and public cloud general availability services, some learnings can be called out that apply universally across a large spectrum of industry clouds.

A service that deploys other services must accept IaC deployment logic with templates, intrinsics, and deterministic execution that works much like any other workflow management system. This helps to determine the order in which to run them and with retries. The tasks are self-described. The automation consists of a scheduler to trigger scheduled workflows and to submit tasks to the executor to run, an executor to run the tasks, a web server for a management interface, a folder for the directed acyclic graph representing the deployment logic artifacts, and a metadata database to store state. The workflows don’t restrict what can be specified as a task which can be an Operator or a predefined task using say Python, a Sensor which is entirely about waiting for an external event to happen, and a Custom task that can be specified via a Python function decorated with a @task.

The organization of such artifacts posed two necessities. First, to leverage the builtin templates and deployment capabilities of the target IaC provider as well as their packaging in the format suitable to the automation that demands certain declarations, phases, and sequences to be called out. Second the co—ordination of context management switches between automation service and IaC provider. This involved a preamble and an epilogue to a context switch for bookkeeping and state reconciliation.

This taught us that large IaC authors are best served by uniform, consistent and global naming conventions, registries that can be published by the system for cross subscription and cross region lookups, parametrizing diligently at every scope including hierarchies, leveraging dependency declarations, and reducing the need for scriptability in favor of system and user defined organizational units of templates. Leveraging supportability via read-only stores and frequently publishing continuous and up-to-date information on the rollout helps alleviate the operations from the design and development of IaC.

IaC writers frequently find themselves in positions where the separation between pipeline automation and IaC declarations are not clean, self-contained or require extensive customizations. One of the approaches that worked on this front is to have multiple passes on the development. With one pass providing initial deployment capability and another pass consolidating and providing best practice via refactoring and reusability. Enabling the development pass to be DevOps based, feature centric and agile helps converge to a working solution with learnings that can be carried from iteration to iteration. The refactoring pass is more generational in nature. It provides cross-cutting perspectives and non-functional guarantees.

A library of routines, operators, data types, global parameters and registries are almost inevitable with large scale IaC deployments but unlike the support for programming language-based packages, these are often organically curated in most cases and often self-maintained. Leveraging tracking and versioning support of source control, its possible to provide compatibility as capabilities are made native to the IaC provider or automation service.

Reference: IaC shortcomings and resolutions.

Saturday, July 15, 2023

Improve workloads and solution deployments:

The well-architected framework consists of five pillars. These are reliability (REL), security (SEC), cost optimization (COST), operational excellence (OPS) and performance efficiency (PERF). The elements that support these pillars are a review, a cost and optimization advisor, documentation, patterns-support-and-service offers, reference architectures and design principles.

This guidance provides a summary of how these principles apply to the management of the data workloads.

Cost optimization is one of the primary benefits of using the right tool for the right solution. It helps to analyze the spend over time as well as the effects of scale out and scale up. An advisor can help improve reusability, on-demand scaling, reduced data duplication, among many others.

Performance is usually based on external factors and is very close to customer satisfaction. Continuous telemetry and reactiveness are essential to tuned up performance. The shared environment controls for management and monitoring create alerts, dashboards, and notifications specific to the performance of the workload. Performance considerations include storage and compute abstractions, dynamic scaling, partitioning, storage pruning, enhanced drivers, and multilayer cache.

Operational excellence comes with security and reliability. Security and data management must be built right into the system at layers for every application and workload. The data management and analytics scenario focus on establishing a foundation for security. Although workload specific solutions might be required, the foundation for security is built with the Azure landing zones and managed independently from the workload. Confidentiality and integrity of data including privilege management, data privacy and appropriate controls must be ensured. Network isolation and end-to-end encryption must be implemented. SSO, MFA, conditional access and managed service identities are involved to secure authentication. Separation of concerns between azure control plane and data plane as well as RBAC access control must be used.

The key considerations for reliability are how to detect change and how quickly the operations can be resumed. The existing environment should also include auditing, monitoring, alerting and a notification framework.

In addition to all the above, some consideration may be given to improving individual service level agreements, redundancy of workload specific architecture, and processes for monitoring and notification beyond what is provided by the cloud operations teams.

Each pillar contains questions for which the answers relate to technical and organizational decisions that are not directly related to the features of the software to be deployed. For example, a software that allows people to post comments must honor use cases where some people can write, and others can read. But the system developed must also be safe and sound enough to handle all the traffic and should incur reasonable costs.

Since the most crucial pillars are OPS and SEC, they should never be traded in to get more out of the other pillars.

The security pillar consists of Identity and access management, detective controls, infrastructure protection, data protection and incident response. Three questions are routinely asked for this pillar: How is the access controlled for the serverless api? How are the security boundaries managed for the serverless application? How is the application security implemented for the workload?

The operational excellence pillar is made up of four parts: organization, preparation, operation, and evolution. The questions that drive the decisions for this pillar include: How is the health of the serverless application known? How is the application lifecycle management approached?

The reliability pillar is made of three parts: foundations, change management, and failure management. The questions asked for this pillar include: How are the inbound request rates regulated? How is the resiliency build into the serverless application?

The cost optimization pillar consists of five parts: cloud financial management practice, expenditure and usage awareness, cost-effective resources, demand management and resources supply, and optimizations over time. The questions asked for cost optimization include: How are the costs optimized?

The performance efficiency pillar is composed of four parts: selection, review, monitoring and tradeoffs. The questions asked for this pillar include: How is the performance optimized for the serverless application?

In addition to these questions, there’s quite a lot of opinionated and even authoritative perspectives into the appropriateness of a framework and they are often referred to as lenses. With these forms of guidance, a well-architected framework moves closer to an optimized realization.