Cluster computing

Monday, September 11, 2023

Overwatch reporting involves cluster configuration, overwatch configuration and jobs run. The steps outlined in this article are a guide to realizing cluster utilization reports from Azure Databricks instance. It starts with some concepts as an overview and context in which the steps are performed, followed by the listing of the steps, and closing with the running and viewing of the Overwatch jobs and reports.

Overwatch can be taken as an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost. The audit logs and cluster logs are primary data sources, but the cluster logs are crucial to get the cluster utilization data. It requires dedicated storage account for these and the time-to-live must be enabled so that the retention does not grow to incur unnecessary costs. The cluster logs must not be stored on the DBFS directly but can reside on an external store. When there are different Databricks workspaces numbered say 1 to N, each workspace pushes the diagnostic data to the EventHubs and writes the cluster logs to the per region dedicated storage account. One of the Databricks workspace is chosen to deploy Overwatch. The Overwatch jobs read the storage account and the event hub diagnostic data to create bronze, silver and gold data pipelines which can be read from anywhere for the reports.

The steps involve with overwatch configurations include the following:

1. Create a storage account

2. Create an Azure Event Hub namespace

3. Store the Event Hub connection string in a KeyVault

4. Enable Diagnostic settings in the Databricks instance for the event hub

5. Store the Databricks PAT token in the KeyVault,

6. Create a secret scope

7. Use the Databricks overwatch notebook from [link](https://databrickslabs.github.io/overwatch/deployoverwatch/runningoverwatch/notebook/) and replace the parameters

8. Configure the storage account within the workspace.

9. Create the cluster and add the Maven libraries to the cluster com.databricks.labs:overwatch and com.microsoft.azure:azure-eventhubs-spark and run the Overwatch notebook.

There are a few elaborations to the above steps that can be called out otherwise the steps are routine. All the Azure resources can be created with default settings. The connection string for the EventHub is stored in the KeyVault as a secret. The personal access token aka PAT token created from the Databricks is also stored in the KeyVault as a secret. The PAT Token is created from the user settings of the Azure Databricks instance. A scope is created to import the token back from the KeyVault to the Databricks. A cluster is created to run the Databricks job. The two maven libraries are added to the databricks clusters’ library. The logging tab of the advanced options in the cluster’s configuration will allow us to specify a dbfs location pertaining to the external storage account we created to store the cluster logs. The Azure navigation bar for the Azure Databricks instance will allow for the diagnostic settings data to be sent to EventHub.

The Notebook to describe the Databricks jobs for the Overwatch takes the above configuration as parameters including those for the dbfs location for the cluster logs target, the Extract-Transform-Load database name which stores the tables used for the dashboard, the consumer database name, the secret scope, the secret key for the PAT token, the secret key for the EventHub, the topic name in the EventHub, the primordial date to start the Overwatch, the maximum number of days to bound the data and the scopes.

Overwatch provides both the summary as well as the drill-down options to understand the operations of a Databricks instance. It has two primary modes: Historical and Real-time. It coalesces all the logs produced by Sparks and Databricks via a periodic job run and then enriches this data through various API calls. The jobs from the notebook creates the configuration string with OverwatchParams. Most functionalities can be realized by instantiating the workspace object with these OverwatchParams. It provides two tables the dbuCostDetails table and the instanceDetails table which can then be used for reports.

Layout

A screenshot of a computer

Description automatically generated

Usage:

A screenshot of a computer

Description automatically generated

Result:

Overwatch creates two tables as shown:

A screenshot of a computer

Description automatically generated

These tables can be read from CLI, SDK, and a variety of clients.

Reference: demonstration with deployable IaC: dbx-overwatch.zip

Sunday, September 10, 2023

A sample GitHub action to detect state drift that can run periodically.

on:

workflow_dispatch:

schedule:

- cron: '00 2 * * *' # runs nightly at 2:00 am

permissions:

id-token: write

contents: read

issues: write

env:

ARM_CLIENT_ID: "${{ secrets.AZURE_CLIENT_ID }}"

ARM_SUBSCRIPTION_ID: "${{ secrets.AZURE_SUBSCRIPTION_ID }}"

ARM_TENANT_ID: "${{ secrets.AZURE_TENANT_ID }}"

jobs:

terraform-plan:

runs-on: ubuntu-latest

env:

ARM_SKIP_PROVIDER_REGISTRATION: true

outputs:

tfplanExitCode: ${{ steps.tf-plan.outputs.exitcode }}

steps:

- name: Checkout

uses: actions/checkout@v3

- name: Setup Terraform

uses: hashicorp/setup-terraform@v2

with:

terraform_wrapper: false

- name: Terraform Init

run: terraform init
- name: Terraform Plan

id: tf-plan

run: |

export exitcode=0

terraform plan -detailed-exitcode -no-color -out tfplan || export exitcode=$?

echo "exitcode=$exitcode" >> $GITHUB_OUTPUT

if [ $exitcode -eq 1 ]; then

echo Terraform Plan Failed!

exit 1

else

exit 0

Saturday, September 9, 2023

A sample GitHub action to detect state drift that can run periodically.

on:

workflow_dispatch:

schedule:

- cron: '00 2 * * *' # runs nightly at 2:00 am

permissions:

id-token: write

contents: read

issues: write

env:

ARM_CLIENT_ID: "${{ secrets.AZURE_CLIENT_ID }}"

ARM_SUBSCRIPTION_ID: "${{ secrets.AZURE_SUBSCRIPTION_ID }}"

ARM_TENANT_ID: "${{ secrets.AZURE_TENANT_ID }}"

jobs:

terraform-plan:

runs-on: ubuntu-latest

env:

ARM_SKIP_PROVIDER_REGISTRATION: true

outputs:

tfplanExitCode: ${{ steps.tf-plan.outputs.exitcode }}

steps:

- name: Checkout

uses: actions/checkout@v3

- name: Setup Terraform

uses: hashicorp/setup-terraform@v2

with:

terraform_wrapper: false

- name: Terraform Init

run: terraform init
- name: Terraform Plan

id: tf-plan

run: |

export exitcode=0

terraform plan -detailed-exitcode -no-color -out tfplan || export exitcode=$?

echo "exitcode=$exitcode" >> $GITHUB_OUTPUT

if [ $exitcode -eq 1 ]; then

echo Terraform Plan Failed!

exit 1

else

exit 0

Friday, September 8, 2023

This article talks about the deployment to Azure Infrastructure with GitHub Actions.

The benefits of using IaC are best realized when the CI/CD pipelines deploy it in an automated and repeatable fashion. A structured solution for deployment is required to meet the needs of automation and the benefits include the following: 1. Declarative form for defining and deploying, versioning and review. IaC also prevents any drift in the configuration. 2. Consistency in following an IaC process that ensures the whole organization follows a standard well-established method to deploy the infrastructure 3. Security for hardening and approving cloud operations and to meet internal standards. 4. Self-service for the teams to be empowered to deploy their own infrastructure. And 5. Improved productivity by provisioning new environments and leveraging standard templates.

The control and data plane are separate in this solution. The data flow involves creating a new branch and checking in the needed IaC modifications. A pull request is created to merge the changes into the environment. A GitHub actions workflow is triggered to ensure the code is well-formatted, internally consistent and produces secure infrastructure. A terraform plan what-if analysis can be run to preview the changes. Once this is reviewed it can be merged into the main branch. Another GitHub actions workflow will trigger from the main branch and apply the changes. A regularly scheduled workflow will also run to look for any configuration drift in the environment and create a new issue if changes are detected.

GitHub environments and secrets store the Azure Identity information and set up an approval process for deployments. For example, a production environment will have a protection rule and add any required approvers needed to sign off on the production deployments. It can also limit the environment into the main branch. An Azure application identity with read/write permissions to the Azure subscription is required. The GitHub secrets usually include the client id, client secret, subscription id and tenant id. The state file location will be needed to persist the state between different runs of the workflow. This location will be saved to the IaC backend configuration block. Once the environment and the identity are created, a reader and data access permission set will be required for the storage account where the state resides. A set of read/write identity and read-only identities will be helpful to separate accesses on environment and activity basis.

Two forms of validations are popular with infrastructure deployments, and these are unit-tests and What-if deployments. The unit-tests on the infrastructure code is possible with the help of options from the compiler, linter and formatter that catches errors with syntax, declarations and typos. For example, terraform fmt and validate are commonly used options. There is also an open-source static code analysis tool called Chekov that can be run to detect security and compliance issues. Results from the tests can also be uploaded to GitHub. The what-if stage of the workflow, also called the dry run is used to understand the impact of the IaC changes. Typically, this is triggered by a push to the main branch. It will be followed by the deployment stage that will be accompanied by a manual approval. The drift detection workflow, if present, can run periodically and independently of the changes contributed to the IaC.

Thursday, September 7, 2023

Access Control is notorious in the IaC for quickly growing out of bounds and being rather unstable. Between declarative forms and scripts, the role-assignments increase much more with the number of related resources.

A role assignment consists at the minimum of a role, an assignee, and a scope. The role can either be a fully qualified identifier to its definition or just its name. The assignee on the other hand can be one of many types such as a user, group, ServicePrincipal or a ForeignGroup. With each of these types, the role assignment must specify the principal identifier which is a guid. It could also simply replace the type and the identifier with a name for an assignee. A scope must be specified, and it is preferable to give the entire resource id. This role assignment seems simple with values passed for three parameters, but a string of cryptic errors encountered with their usage makes it brittle.

For example, one of the ways the assignee is obtained during IaC is by provisioning the resource with a system managed identity and subsequently retrieving its identifier by querying for the provisioned resource. A simple variable substitution will then be attempted with the role assignment, and it would fail with the message that the principal id provided as the assignee must be a valid Guid. Manually inspecting the resource and making sure that the object id and not the app id was used is required. But the source of the error might be something inconspicuous in the form of double quotes on either end that are added as part of the query results and escapes one’s attention. Stripping the quotes around the principal id is required for the role assignment command to recognize it.

Similarly, scope is often constructed from elements but in fact it is best treated as opaque and taken directly from the source. Any kind of parsing or reconstruction is prone to script errors and grammar errors.

More importantly, the number of role assignments can be reduced by targeting a higher scope, but the diligence required to group and organize the role-assignments such that they can be avoided or replaced with higher or more specific assignments, might be daunting. It is precisely the overlooking of proper organization by multiple participants that the technical debt is incurred and with the role assignments proliferating, this debt comes back to haunt quickly.

Finally, role assignments and network rules are hard to debug when they go missing and it is in the best interest of the code maintainers to actually specify the associations right at the time of creation. The symptoms manifested by missing rules and assignments are not only difficult to diagnose but also tend to work their way backwards from the customers and end-users. Proper application of role assignment might return the dreaded 403 forbidden http status code and message even when the root cause might have been just cross network permissions that went missing when the resources were created.

Authentication, authorization, and auditing are the final proof of declarations that work and those that do not. One must remove the unnecessary just as much as the incorrect ones.

A special mention about IaC state must be made because state eludes the code as the cause and the resource as the effect. Carefully propagating the changes forward from code and writing through the state to apply the changes to the resources and similarly backward propagating the modified resource by importing them into the state and updating the IaC must be fully traversed in both directions to keep the code, the state and the resources in sync. The changes being made to keep all three in sync were often spread out over time and distributed among authors leading to sources of errors or discrepancies. Establishing a baseline combination of state, IaC and corresponding resources is necessary to make incremental changes. It is also important to keep them in sync going forward. The best way to do this would be to close the gap by enumerating all discrepancies to establish a baseline and then have the process and the practice to enforce that they do not get out of sync.

Wednesday, September 6, 2023

Authentication, authorization, and auditing are the final proof of declarations that work and those that do not. One must remove the unnecessary just as much as the incorrect ones.

Tuesday, September 5, 2023

Improve workloads and solution deployments:

Solutions for the industry that are implemented new, benefit from a set of principles that provide prescriptive guidance to improving the quality of their deployments. When the industry moves from digital adoption to digital transformation to digital acceleration, the sustainability journey requires a strong digital foundation. It is the best preparation for keeping pace with this rapid change.

This is true for meeting new sustainability requirements, avoiding the worst impacts of climate change and other business priorities such as driving growth, adapting to industry shifts, and navigating energy consumption and economic conditions. It helps to track and manage data at scale, unifying data and improving visibility across the organization. This helps to reliably report your sustainability impact, driving meaningful progress and finding gaps where the most impact can be delivered.

The well-architected framework consists of five pillars. These are reliability (REL), security (SEC), cost optimization (COST), operational excellence (OPS) and performance efficiency (PERF). The elements that support these pillars are a review, a cost and optimization advisor, documentation, patterns-support-and-service offers, reference architectures and design principles.

This guidance provides a summary of how these principles apply to the management of the data workloads.

Cost optimization is one of the primary benefits of using the right tool for the right solution. It helps to analyze the spend overtime as well as the effects of scale out and scale up. An advisor can help improve reusability, on-demand scaling, reduced data duplication, among many others.

Performance is usually based on external factors and is remarkably close to customer satisfaction. Continuous telemetry and reactiveness are essential to tuned up performance. The shared environment controls for management and monitoring create alerts, dashboards, and notifications specific to the performance of the workload. Performance considerations include storage and compute abstractions, dynamic scaling, partitioning, storage pruning, enhanced drivers, and multilayer cache.

Operational excellence comes with security and reliability. Security and data management must be built right into the system at layers for every application and workload. The data management and analytics scenario focus on establishing a foundation for security. Although workload specific solutions might be required, the foundation for security is built with the Azure landing zones and managed independently from the workload. Confidentiality and integrity of data including privilege management, data privacy and appropriate controls must be ensured. Network isolation and end-to-end encryption must be implemented. SSO, MFA, conditional access and managed service identities are involved to secure authentication. Separation of concerns between azure control plane and data plane as well as RBAC access control must be used.

The key considerations for reliability are how to detect change and how quickly the operations can be resumed. The existing environment should also include auditing, monitoring, alerting and a notification framework.

In addition to all the above, some consideration may be given to improving individual service level agreements, redundancy of workload specific architecture, and processes for monitoring and notification beyond what is provided by the cloud operations teams.

Each pillar contains questions for which the answers relate to technical and organizational decisions that are not related to the features of the software to be deployed. For example, a software that allows people to post comments must honor use cases where some people can write, and others can read. But the system developed must also be safe enough to handle all the traffic and should incur reasonable costs.

Since the most crucial pillars are OPS and SEC, they should never be traded in to get more out of the other pillars.

The security pillar consists of Identity and access management, detective controls, infrastructure protection, data protection and incident response. Three questions are routinely asked for this pillar: How is the access controlled for the serverless api? How are the security boundaries managed for the serverless application? How is the application security implemented for the workload?

The operational excellence pillar is made up of four parts: organization, preparation, operation, and evolution. The questions that drive the decisions for this pillar include: How is the health of the serverless application known? How is the application lifecycle management approached?

The reliability pillar is made of three parts: foundations, change management, and failure management. The questions asked for this pillar include: How are the inbound request rates regulated? How is resiliency built into the serverless application?

The cost optimization pillar consists of five parts: cloud fiscal management practice, expenditure and usage awareness, cost-effective resources, demand management and resources supply, and optimizations over time. The questions asked for cost optimization include: How are the costs optimized?

The performance efficiency pillar is composed of four parts: selection, review, monitoring, and tradeoffs. The questions asked for this pillar include: How is the performance optimized for the serverless application?

In addition to these questions, there is quite a lot of opinionated and even authoritative perspectives on the appropriateness of a framework and they are often referred to as lenses. With these forms of guidance, a well-architected framework moves closer to an optimized realization.