Cluster computing

Tuesday, July 18, 2023

IaC Resolutions Part 8:

Previous articles in this regard have been discussing resolutions for shortcomings in the use of Infrastructure-as-a-code (IaC) in various scenarios. This section discusses the resolution for the case when changes to resources involves breaking a deadlock in state awareness between, say a pair.

Let us make a specific association between say a firewall and a network resource such as a gateway. The firewall must be associated with the gateway to prevent traffic flow through that appliance. When they remain associated, they remember the identifier and the state for each other. Initially, the firewall may remain in detection mode where it is merely passive. It becomes active in the prevention mode. When the modes are attempted to be toggled, the association prevents it. Neither end of the association can tell what state to be in without exchanging the information and when they are deployed or updated in place, neither knows about or informs the other.

There are two ways to overcome this limitation.

First, there is a direction established between the resources where the update to one forcibly updates the state of the other. This is supported by the gateway when it allows the information to be written through by the update in the state of one resource.

Second, the changes are made by the IaC provider first to one resource and then to the other so that the update to the other picks up the state of the first during its change. In this mode, the firewall can be activated after the gateway knows that there is such a firewall.

If the IaC tries to form an association while updating the state of one, the other might end up with an inconsistent state. One of the two resolutions above works to mitigate this.

This is easy when there is a one-to-one relationship between resources. Sometimes there are one-to-many relationships. For example, a gateway might have more than a dozen app services as its backend members and each member might be allowing public access. If the gateway must consolidate access to all the app services, then there are changes required on the gateway to route traffic to each app service as intended by the client and a restriction on the app services to allow only private access from the gateway.

Consider the sequence in which these changes must be made given that the final operational state of the gateway is only acceptable when all barring none remain reachable for a client through the gateway.

If the app services toggle the access from public to gateway sooner than the gateway becomes operational, there is some downtime to them, and the duration is not necessarily bounded if one app service fails to listen to the gateway. The correct sequence would involve first making the change in the gateway to set up proper routing and then restricting the app services to accept only the gateway. Finally, the gateway validates all the app service flows from a client before enabling them.

Each app service might have nuances about whether the gateway can reach it one way or another. Usually, if they are part of the same vnet, then this is not a concern, otherwise peering might be required. Even if the peering is made available, routing by address or resolution by name or both might be required unless they are universally known on the world wide web. If the public access is disabled, then the private links must be established, and this might require both the gateway and the app service to do so. Lastly, with each change, an app service must maintain its inbound and outbound properly for bidirectional communication, so some vetting is required on the app service side independent of the gateway.

Putting this altogether via IaC requires that the changes be made in stages and each stage validated independently.

Monday, July 17, 2023

Your guide to sustainability with cloud solutions.

Solutions for the industry that are implemented new, benefit from a set of principles that provide prescriptive guidance to improving the quality of their deployments. When the industry moves from digital adoption to digital transformation to digital acceleration, the sustainability journey requires a strong digital foundation. It is the best preparation for keeping pace with this rapid change.

This is true for meeting new sustainability requirements, avoiding the worst impacts of climate change and other business priorities such as driving growth, adapting to industry shifts, and navigating energy consumption and economic conditions. It helps to track and manage data at scale, unifying data and improving visibility across the organization.

As an aside, the well-architected framework consists of five pillars. These are reliability (REL), security (SEC), cost optimization (COST), operational excellence (OPS) and performance efficiency (PERF). The elements that support these pillars are a review, a cost and optimization advisor, documentation, patterns-support-and-service offers, reference architectures and design principles.

Sustainability is a journey. Cloud solutions can be developed by fostering growth and controlling costs while contributing to sustainability goals. The journey can be shaped by building resilience with an ESG strategy, resizing opportunities to control costs, improving efficiency with energy reduction, and tracking progress for environmental impact.

The well-architected framework helps to reliably report your sustainability impact, driving meaningful progress and finding gaps where the most impact can be delivered but a leader’s insights and perspectives can tune the sustainability investments to create opportunities that align with other goals. While some of the industry-tested learnings are brought in this article, meeting sustainability goals is a big win for an organization of any size. That is why a tiny country like Bhutan can claim to be a leader with its carbon negative footprint. Companies that meet sustainability goals are favored by investors, consumer satisfaction and employee retention are also improved with clear indicators such as customers willing to pay more for options with sustainability. Knowing where we are today, setting future goals and making data driven decisions to create steps towards realizing those dreams are invaluable exercises for your blue ocean strategy.

Eventually, there will be regulations mandating similar work but a determination of the ESG metrics provides competitive differentiation for your solutions. Granted there might be a cultural shift involved not different from the one experienced in adopting the cloud, but the potential for dramatic discoveries and strong storytelling can unite the people behind the vision. Some tools help such as the Microsoft Sustainability Manager can be instrumental towards realizing this vision as it unifies the data to monitor and manage the environmental impact.

An ideal tool and central vantage point can assess the value of corporate sustainability. But an even more important consideration is that the scope for similar assessment and impact can be delegated to different organizational units and departments and assist with incremental progress towards the sustainability journey when those participants learn to self-evaluate their path and metrics.

Sustainability is also about consumption, efficiency and digitizing the supply chain. Transparency and tighter upstream and downstream collaboration are needed if cyclical efficiencies are to be discovered in products and processes. A single unified platform can help with visualizing the data from disparate devices and systems. Adopting recyclable and repairable software goes a great way towards sustainability just as much as devices. A strong digital foundation can drive both sustainability and transformational goals. The urgency, scope and scale of the task cited in this article can help you with the journey from pledges to progress.

References: IaC Shortcomings and resolutions.

Sunday, July 16, 2023

Some methods of organization for large scale Infrastructure-as-a-Code deployments.

The purpose of IaC is to provide a dynamic, reliable, and repeatable infrastructure suitable for cases where manual approaches and management practices cannot keep up. When automation increases to the point of becoming a cloud-based service responsible for the deployment of cloud resources and stamps that provision other services that are diverse, consumer facing and public cloud general availability services, some learnings can be called out that apply universally across a large spectrum of industry clouds.

A service that deploys other services must accept IaC deployment logic with templates, intrinsics, and deterministic execution that works much like any other workflow management system. This helps to determine the order in which to run them and with retries. The tasks are self-described. The automation consists of a scheduler to trigger scheduled workflows and to submit tasks to the executor to run, an executor to run the tasks, a web server for a management interface, a folder for the directed acyclic graph representing the deployment logic artifacts, and a metadata database to store state. The workflows don’t restrict what can be specified as a task which can be an Operator or a predefined task using say Python, a Sensor which is entirely about waiting for an external event to happen, and a Custom task that can be specified via a Python function decorated with a @task.

The organization of such artifacts posed two necessities. First, to leverage the builtin templates and deployment capabilities of the target IaC provider as well as their packaging in the format suitable to the automation that demands certain declarations, phases, and sequences to be called out. Second the co—ordination of context management switches between automation service and IaC provider. This involved a preamble and an epilogue to a context switch for bookkeeping and state reconciliation.

This taught us that large IaC authors are best served by uniform, consistent and global naming conventions, registries that can be published by the system for cross subscription and cross region lookups, parametrizing diligently at every scope including hierarchies, leveraging dependency declarations, and reducing the need for scriptability in favor of system and user defined organizational units of templates. Leveraging supportability via read-only stores and frequently publishing continuous and up-to-date information on the rollout helps alleviate the operations from the design and development of IaC.

IaC writers frequently find themselves in positions where the separation between pipeline automation and IaC declarations are not clean, self-contained or require extensive customizations. One of the approaches that worked on this front is to have multiple passes on the development. With one pass providing initial deployment capability and another pass consolidating and providing best practice via refactoring and reusability. Enabling the development pass to be DevOps based, feature centric and agile helps converge to a working solution with learnings that can be carried from iteration to iteration. The refactoring pass is more generational in nature. It provides cross-cutting perspectives and non-functional guarantees.

A library of routines, operators, data types, global parameters and registries are almost inevitable with large scale IaC deployments but unlike the support for programming language-based packages, these are often organically curated in most cases and often self-maintained. Leveraging tracking and versioning support of source control, its possible to provide compatibility as capabilities are made native to the IaC provider or automation service.

Reference: IaC shortcomings and resolutions.

Saturday, July 15, 2023

Improve workloads and solution deployments:

The well-architected framework consists of five pillars. These are reliability (REL), security (SEC), cost optimization (COST), operational excellence (OPS) and performance efficiency (PERF). The elements that support these pillars are a review, a cost and optimization advisor, documentation, patterns-support-and-service offers, reference architectures and design principles.

This guidance provides a summary of how these principles apply to the management of the data workloads.

Cost optimization is one of the primary benefits of using the right tool for the right solution. It helps to analyze the spend over time as well as the effects of scale out and scale up. An advisor can help improve reusability, on-demand scaling, reduced data duplication, among many others.

Performance is usually based on external factors and is very close to customer satisfaction. Continuous telemetry and reactiveness are essential to tuned up performance. The shared environment controls for management and monitoring create alerts, dashboards, and notifications specific to the performance of the workload. Performance considerations include storage and compute abstractions, dynamic scaling, partitioning, storage pruning, enhanced drivers, and multilayer cache.

Operational excellence comes with security and reliability. Security and data management must be built right into the system at layers for every application and workload. The data management and analytics scenario focus on establishing a foundation for security. Although workload specific solutions might be required, the foundation for security is built with the Azure landing zones and managed independently from the workload. Confidentiality and integrity of data including privilege management, data privacy and appropriate controls must be ensured. Network isolation and end-to-end encryption must be implemented. SSO, MFA, conditional access and managed service identities are involved to secure authentication. Separation of concerns between azure control plane and data plane as well as RBAC access control must be used.

The key considerations for reliability are how to detect change and how quickly the operations can be resumed. The existing environment should also include auditing, monitoring, alerting and a notification framework.

In addition to all the above, some consideration may be given to improving individual service level agreements, redundancy of workload specific architecture, and processes for monitoring and notification beyond what is provided by the cloud operations teams.

Each pillar contains questions for which the answers relate to technical and organizational decisions that are not directly related to the features of the software to be deployed. For example, a software that allows people to post comments must honor use cases where some people can write, and others can read. But the system developed must also be safe and sound enough to handle all the traffic and should incur reasonable costs.

Since the most crucial pillars are OPS and SEC, they should never be traded in to get more out of the other pillars.

The security pillar consists of Identity and access management, detective controls, infrastructure protection, data protection and incident response. Three questions are routinely asked for this pillar: How is the access controlled for the serverless api? How are the security boundaries managed for the serverless application? How is the application security implemented for the workload?

The operational excellence pillar is made up of four parts: organization, preparation, operation, and evolution. The questions that drive the decisions for this pillar include: How is the health of the serverless application known? How is the application lifecycle management approached?

The reliability pillar is made of three parts: foundations, change management, and failure management. The questions asked for this pillar include: How are the inbound request rates regulated? How is the resiliency build into the serverless application?

The cost optimization pillar consists of five parts: cloud financial management practice, expenditure and usage awareness, cost-effective resources, demand management and resources supply, and optimizations over time. The questions asked for cost optimization include: How are the costs optimized?

The performance efficiency pillar is composed of four parts: selection, review, monitoring and tradeoffs. The questions asked for this pillar include: How is the performance optimized for the serverless application?

In addition to these questions, there’s quite a lot of opinionated and even authoritative perspectives into the appropriateness of a framework and they are often referred to as lenses. With these forms of guidance, a well-architected framework moves closer to an optimized realization.

Thursday, July 13, 2023

Path based routing and rewrite rule sets in Application Gateway.

One of the frequently encountered issues with app service is incorrect redirect URLs. This might mean that the app service URL is exposed in a browser when there’s a redirection or the authentication and authorization for the app service is broken because of a redirect with the wrong hostname. The root-cause for this symptom is a setup that overrides the hostname as used by the application gateway towards app service into a different hostname as is seen by the browser. This hostname is typically overridden to the default “azurewebsites.net” domain. In such cases, the application gateway must either have a “pick hostname from backend address” enabled in HTTP Settings or have an “Override with specific domain name” set to a value different from what the browser request has.

The typical resolution for this symptom is to rewrite the location header. The application gateway’s domain name is set as the host name in the location header. A rewrite rule is added to do just that and specified with a condition that evaluates if the location header in the response contains azurewebsites.net.

When the Azure AppService sends a redirection response, it uses the same hostname in the location header of its response as the one in the request it receives from the application gateway. This makes a client redirect to contoso.azurewebsites.net/path2 instead of going through the application gateway. This bypass is not desirable. Setting the location header to the application gateways’ domain name resolves this situation.

The steps for replacing the hostname are as follows:

First, a rewrite rule is written with a condition that evaluates if the location header in the response contains azurewebsites.net. The pattern for this can be specified in PCRE format as (https?):\/\/.*azurewebsites\.net(.*)$

Second, the location header is rewritten to include the application gateways’ hostname. This is done by entering {http_resp_Location_1}://contoso.com{http_resp_Location_2} as the header value. Alternatively, we can also use the server variable ‘host’ to set the hostname to match the original request.

This manifests as an If-Then condition-action pair. The If section specifies the type of variable to check as an HTTP header, the header type as Response, the header name as a Common Header, the name of the Common Header as the Location, the case-sensitivity turned off, the operator to evaluate as an equals-to and the pattern to match as (https?):\/\/.*azurewebsites\.net(.*)$. The Then Section specifies the action type as Set, the Header Type as Response, the Header name as Common Header, the name of the Common Header as Location, and the Header Value as {http_resp_Location_1}://contoso.com{http_resp_Location_2}

This completes the rewrite. Care must be taken to ensure the appropriate use of rewrite versus redirect.

A URL is rewritten by the application gateway before it is sent to the backend. It won’t change what the user sees in the browser because the changes are hidden from the user. A URL is redirected by the application gateway to a new URL and sent as a response to the client. The URL that the user sees in the browser will update to a new URL.

Wednesday, July 12, 2023

Overwatch implementation involves cluster configuration, overwatch configuration and run. The steps outlined in this article are a guide to realizing cluster utilization reports from Azure Databricks instance. It starts with some concepts as an overview and context in which the steps are performed, followed by the listing of the steps and closing with the running and viewing of the Overwatch jobs and reports. This is a continuation of the previous article also on similar topics.

Overwatch can be taken as an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost. The audit logs and cluster logs are primary data sources but the cluster logs are crucial to get the cluster utilization data. It requires dedicated storage account for these and the time-to-live must be enabled so that the retention does not grow to incur unnecessary costs. The cluster logs must not be stored on the DBFS directly but can reside on an external store. When there are different Databricks workspaces numbered say 1 to N, each workspace pushes the diagnostic data to the EventHubs and writes the cluster logs to the per region dedicated storage account. One of the Databricks workspace is chosen to deploy Overwatch. The Overwatch jobs read the storage account and the event hub diagnostic data to create bronze, silver and gold data pipelines which can be read from anywhere for the reports.

The steps involve with overwatch configurations include the following: 1. Create a storage account 2. Create an Azure Event Hub namespace 3. Store the Event Hub connection string in a KeyVault 4. Enable Diagnostic settings in the Databricks instance for the event hub 5. Store the Databricks PAT token in the KeyVault, 6. Create a secret scope 7. Use the Databricks overwatch notebook from link and replace the parameters 8. Configure the storage account within the workspace. 9. Create the cluster and add the Maven libraries to the cluster com.databricks.labs:overwatch and com.microsoft.azure:azure-eventhubs-spark and run the Overwatch notebook.

There are a few elaborations to the above steps that can be called out otherwise the steps are routine. All the Azure resources can be created with default settings. The connection string for the EventHub is stored in the KeyVault as a secret. The personal access token aka PAT token created from the Databricks is also stored in the KeyVault as a secret. The PAT Token is created from the user settings of the Azure Databricks instance. A scope is created to import the token back from the KeyVault to the Databricks. A cluster is created to run the Databricks job. The two maven libraries are added to the databricks clusters’ library. The logging tab of the advanced options in the cluster’s configuration will allow us to specify a dbfs location pertaining to the external storage account we created to store the cluster logs. The Azure navigation bar for the Azure Databricks instance will allow for the diagnostic settings data to be sent to EventHub.

The Notebook to describe the Databricks jobs for the Overwatch takes the above configuration as parameters including those for the dbfs location for the cluster logs target, the Extract-Transform-Load database name which stores the tables used for the dashboard, the consumer database name, the secret scope, the secret key for the PAT token, the secret key for the EventHub, the topic name in the EventHub, the primordial date to start the Overwatch, the maximum number of days to bound the data and the scopes.

Overwatch provides both the summary as well as the drill-down options to understand the operations of a Databricks instance. It has two primary modes: Historical and Real-time. It coalesces all the logs produced by Sparks and Databricks via a periodic job run and then enriches this data through various API calls. The jobs from the notebook creates the configuration string with OverwatchParams. Most functionalities can be realized by instantiating the workspace object with these OverwatchParams. It provides two tables the dbuCostDetails table and the instanceDetails table which can then be used for reports.

Tuesday, July 11, 2023

The previous article discussed Azure Databricks and compute cluster usage monitoring. There were many ways suggested.

First, the user interface provided a way to view the usages by cluster by navigating to the CPU utilization metrics reported for each compute resource. It was based on total CPU seconds’ cost. The metric was averaged out for the time interval chosen which had a default value of an hour.

Second, there is a PowerShell scripting library that allows the use of the REST APIs to query the databricks workspace directly. For this the access token suggested was the user’s Personal Access Token or PAT token and required to pass in the cluster’s id as a request parameter. The response of this API was at par with the information from the User Interface and provided a scriptable approach.

Since this might not provide visibility across the databricks instance, the use of REST APIs directly against the Azure resource was suggested. The authentication for this API required the Microsoft login servers to issue a token. While this provided properties and metadata for the resource and provided general manageability, the information pertaining to the cluster usages was rather limited. Still, it was possible to query the last used time to give an indication for whether the compute can be shut down for a workspace user.

Databricks itself provides a set of REST APIS and these give information on the last used time and if the duration from the last usage exceeds a certain threshold, the compute can be stopped. This is referred to as the auto termination time and some values in the order of minutes are sufficient. The default value is 72 hours but that may be quite extravagant.

Third, there is an SDK available from Databricks that can be called with Python to make the above step easier. This option makes it tidy to create charts and graphs from the cluster usages and works well for reporting purposes. Samples can include listing the budgets or directly querying the clusters:

from databricks.sdk import AccountClient

a = AccountClient()

all = a.budgets.list()

a.billable_usage.download(start_month="2023-06", end_month="2023-07")

One of the benefits of using the SDK is that we are no longer required to concern ourselves with setting up the Unity Catalog for querying the system tables ourselves. The SDK can provide authoritative information.

Similarly, another convenience to query cluster usage is with Overwatch. It provides both the summary as well as the drill-down options to understand the operations of a Databricks instance. It has two primary modes: Historical and Real-time. It coalesces all the logs produced by Sparks and Databricks via a periodic job run and then enriches this data through various API calls. Most functionalities can be realized by instantiating the workspace object with suitable OverwatchParams. It provides two tables the dbuCostDetails table and the instanceDetails table.