Cluster computing

Thursday, September 14, 2023

Some of the learnings from deploying Overwatch on Databricks

This is a continuation of previous articles on Overwatch which can be considered an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost. This section of the article describes some of the considerations when deploying Overwatch that might not be obvious from the public documentation but helps with optimizing the deployments.

Overwatch deployments must include an EventHub as well as a storage account. The EventHub receives diagnostics data and comes from the target Databricks workspaces. Usually, only EventHub namespace is required to work with the Overwatch deployment, but it will have 1 to N Event Hubs within that namespace with one each for every workspace monitored. When the EventHubs and their namespace is created, the workspaces must be associated with it which does not alter a workspace if it is already existing. The association reflects on the workspace in the diagnostics settings under the monitoring section of that instance.

Unlike the EventHub that receives the diagnostic data, a storage account is required as a working directory for the Overwatch instance so that it may write out its reports from the calculations it makes. These reports could be in binary format but the aggregated information on dbu-cost basis as well as instance-level basis are available to view in two independent tables in the Overwatch database on the workspace where it deployed. There are other artifacts also stored on this storage account such as the parameters for the deployment of Overwatch and incremental computations, but the entire account can be dedicated to Overwatch as a working directory. It is for this reason that the storage account is dedicated to Overwatch that the compute logs from the workspaces are also archived here because the locality of the data enables the Overwatch jobs to read the logs with minimum cost.

This is another diagnostic setting for a workspace, and it might be additional in the case that the logs from the workspace were already being sent elsewhere either via EventHub or via a different storage account. The separation of the logs read by Overwatch from that for other purposes helps Overwatch be performant as well as reliable by maintaining isolation. The compute logs are only read by Overwatch and so they need not be saved longer than necessary and intended only for the computations of Overwatch.

Both the event hub and the storage account can be regional because cross region transfer of data can be expensive and the ability to decide what data is sent to the Overwatch and making it local reduces the cost significantly. Instead of thinking about eliminating storage costs, it is better to exercise over what and how much data is sent to Overwatch to perform its calculations. Having multiple diagnostic settings on the Databricks workspace helps with this.

Lastly, it must be noted that the cluster logs can be considered different from the compute logs in that one is emitted by the clusters spun up by the users on a Databricks workspace and the other is written out by the Databricks workspace itself. All jobs regardless of whether they are user jobs or Overwatch jobs access the data over https or via mounts. The https way of accessing data is with the help of the abfss@container.<storage-account>.dfs.core.windows.net qualifier or via mounts that can be setup via

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id": "<application-id>",

"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),

"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

dbutils.fs.mount(

source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",

mount_point = "/mnt/<mount-name>",

extra_configs = configs)

When the cluster is created, the logging destination must be set to this mount and will be found under the advanced configuration section.

This summarizes the capture and analysis by Overwatch deployments.

Reference: https://1drv.ms/w/s!Ashlm-Nw-wnWhM9v1fn0cHGer-BjQg?e=ke3d50

Wednesday, September 13, 2023

Overwatch organization

Overwatch can be taken as an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost. Auditing logs and cluster logs are primary data sources. Databricks monitors and logs cluster metrics such as CPU Utilization, memory usage, network I/O and storage, job related telemetry such as those for scheduled jobs, run history, execution times and resource utilization. The notebook execution metrics such as tracking metrics for individual notebook executions, including execution time, data read/write and memory usage, logging and metrics export, data from application monitoring tools like DataDog or Relic to gain deeper insights into performance alongside other applications and services, and SQL Analytics monitoring including those for query performance and resource utilization.

The Deployment runners used for Overwatch take the following parameters:

ETL Storage prefix

ETL database name

Consumer DB Name

Secret Scope

Secret Key for Databricks PAT Token

Secret Key for EventHub

Event Hub Topic Name

Primordial Date

Max Days

And AT Scopes

These parameters are stored in a csv file in the deployment folder of the storage account associated with the Overwatch and mounted via the ETL storage prefix.

So it would seem that the storage account used with the Overwatch notebook jobs is for both read and write with the ability to collect the cluster logs for reading purposes say from the cluster-logs directory and to write the corresponding calculations to say a report folder within the same account as <etl_storage_prefix>/cluster-logs and <etl_storage_prefix>/reports. However, the json configuration to the Overwatch jobs that run for a long time and parse large and plentiful logs run in a dedicated manner. It is possible to configure the read be served from a location different from the write and involves injecting the separate locations to the Overwatch jobs. The default locations of storage account qualified cluster-log folder and that for report folder are configurable.

With the newer versions, the etl_storage_prefix has been renamed to storage_prefix to indicate that it is just the working directory for the Overwatch and the all the logs are accessed via the mount_mapping_path variable that lists the remote locations of logs storage as a path different from the ones the storage_prefix points to. Therefore, the reports are written to a location as abfss@container on an Azure data lake storage, but the cluster logs can be read from mounts such as dbfs:/mnt/logs

Tuesday, September 12, 2023

This is a continuation of a series of articles on the shortcomings and resolutions of Infrastructure-as-Code (IaC). One of the commonly encountered situations is when settings for a resource must preserve the old configuration as well as the new configuration but the resource only allows one way.

Let us take the example of the monitoring section of compute resources like databricks that host a wide variety of analytical applications. By the nature of these long running jobs on the databricks instance, diagnosability is critical to ensure incremental progress and completion of the computation involved. All forms of logs including those from the log categories of Databricks File System, Clusters, Accounts, Jobs, Notebook, SSH, Workspace, Secrets, SQLPermissions, Instance Pools, SQL Analytics, Genie, Global Init Scripts, IAM Role, MLFlow Experiment, Feature Store, Remote History Service, MLFlow Acled Artifact, DatabricksSQL, Delta Pipelines, Repos, Unity Catalog, Git Credentials, Web Terminal, Serverless Real-Time Inference, Cluster Libraries, Partner Hub, Clam AV Scan, and Capsule 8 Container Security Scanning Reports must be sent to a destination where they can be read and analysed. Typically, this means sending these logs to a Log Analytics Workspace, archiving to a storage account, or streaming to an event hub.

Since the same treatment is needed for all Databricks instances in one or more subscriptions, the strategy to collect and analyze logs is centralized and consolidated with a directive. Just like log4j for an application, that directive to send the logs to a destination might mean an EventHub as the destination so that the logging events can be forwarded to multiple listeners. Such a directive will require the namespace of an EventHub and a queue within it.

Now the Databricks instance might require introspection analysis from its logs to detect usage patterns on the clusters in the instance and to determine their cost. This is the technique used by Overwatch feature which is a log reader and a calculator and requires the use of an EventHub to collate logs from multiple workspaces and analyze them centrally within one dedicated workspace.

The trouble arises when the diagnostics settings of the cluster must specify only one EventHub but now an EventHub is required for the centralized organizational logging best practices as well as another for Overwatch. The latter might not be able to make use of the EventHub from the former for its purpose because they might include a lot more workspaces than those intended for analysis with Overwatch for cost savings. Also, performance suffers when the queues cannot be separated.

The resolution in this case might then be to send the data to another log account and then use a filter to forward only the relevant logs to another storage account and EventHub combination so that they can be analyzed by Overwatch in a performant manner.

This calls for a databricks diagnostic setting like so:

data "azurerm_eventhub_namespace_authorization_rule" "eventhub_rule" {

name = "RootManageSharedAccessKey"

resource_group_name = "central-logging"

namespace_name = “central-logging-namespace”

}

resource "azurerm_monitor_diagnostic_setting" "lpcl_eventhub" {

name = "central-logging-eventhub-setting"

target_resource_id = “/subscriptions/…/resourceGroups/…/instance”

eventhub_authorization_rule_id = data.azurerm_eventhub_namespace_authorization_rule.eventhub_rule.id

eventhub_name = "log-consolidator"

}

The use of a relay or a consolidator are some of the ways in which this situation can be resolved.

Monday, September 11, 2023

Overwatch reporting involves cluster configuration, overwatch configuration and jobs run. The steps outlined in this article are a guide to realizing cluster utilization reports from Azure Databricks instance. It starts with some concepts as an overview and context in which the steps are performed, followed by the listing of the steps, and closing with the running and viewing of the Overwatch jobs and reports.

Overwatch can be taken as an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost. The audit logs and cluster logs are primary data sources, but the cluster logs are crucial to get the cluster utilization data. It requires dedicated storage account for these and the time-to-live must be enabled so that the retention does not grow to incur unnecessary costs. The cluster logs must not be stored on the DBFS directly but can reside on an external store. When there are different Databricks workspaces numbered say 1 to N, each workspace pushes the diagnostic data to the EventHubs and writes the cluster logs to the per region dedicated storage account. One of the Databricks workspace is chosen to deploy Overwatch. The Overwatch jobs read the storage account and the event hub diagnostic data to create bronze, silver and gold data pipelines which can be read from anywhere for the reports.

The steps involve with overwatch configurations include the following:

1. Create a storage account

2. Create an Azure Event Hub namespace

3. Store the Event Hub connection string in a KeyVault

4. Enable Diagnostic settings in the Databricks instance for the event hub

5. Store the Databricks PAT token in the KeyVault,

6. Create a secret scope

7. Use the Databricks overwatch notebook from [link](https://databrickslabs.github.io/overwatch/deployoverwatch/runningoverwatch/notebook/) and replace the parameters

8. Configure the storage account within the workspace.

9. Create the cluster and add the Maven libraries to the cluster com.databricks.labs:overwatch and com.microsoft.azure:azure-eventhubs-spark and run the Overwatch notebook.

There are a few elaborations to the above steps that can be called out otherwise the steps are routine. All the Azure resources can be created with default settings. The connection string for the EventHub is stored in the KeyVault as a secret. The personal access token aka PAT token created from the Databricks is also stored in the KeyVault as a secret. The PAT Token is created from the user settings of the Azure Databricks instance. A scope is created to import the token back from the KeyVault to the Databricks. A cluster is created to run the Databricks job. The two maven libraries are added to the databricks clusters’ library. The logging tab of the advanced options in the cluster’s configuration will allow us to specify a dbfs location pertaining to the external storage account we created to store the cluster logs. The Azure navigation bar for the Azure Databricks instance will allow for the diagnostic settings data to be sent to EventHub.

The Notebook to describe the Databricks jobs for the Overwatch takes the above configuration as parameters including those for the dbfs location for the cluster logs target, the Extract-Transform-Load database name which stores the tables used for the dashboard, the consumer database name, the secret scope, the secret key for the PAT token, the secret key for the EventHub, the topic name in the EventHub, the primordial date to start the Overwatch, the maximum number of days to bound the data and the scopes.

Overwatch provides both the summary as well as the drill-down options to understand the operations of a Databricks instance. It has two primary modes: Historical and Real-time. It coalesces all the logs produced by Sparks and Databricks via a periodic job run and then enriches this data through various API calls. The jobs from the notebook creates the configuration string with OverwatchParams. Most functionalities can be realized by instantiating the workspace object with these OverwatchParams. It provides two tables the dbuCostDetails table and the instanceDetails table which can then be used for reports.

Layout

A screenshot of a computer

Description automatically generated

Usage:

A screenshot of a computer

Description automatically generated

Result:

Overwatch creates two tables as shown:

A screenshot of a computer

Description automatically generated

These tables can be read from CLI, SDK, and a variety of clients.

Reference: demonstration with deployable IaC: dbx-overwatch.zip

Sunday, September 10, 2023

A sample GitHub action to detect state drift that can run periodically.

on:

workflow_dispatch:

schedule:

- cron: '00 2 * * *' # runs nightly at 2:00 am

permissions:

id-token: write

contents: read

issues: write

env:

ARM_CLIENT_ID: "${{ secrets.AZURE_CLIENT_ID }}"

ARM_SUBSCRIPTION_ID: "${{ secrets.AZURE_SUBSCRIPTION_ID }}"

ARM_TENANT_ID: "${{ secrets.AZURE_TENANT_ID }}"

jobs:

terraform-plan:

runs-on: ubuntu-latest

env:

ARM_SKIP_PROVIDER_REGISTRATION: true

outputs:

tfplanExitCode: ${{ steps.tf-plan.outputs.exitcode }}

steps:

- name: Checkout

uses: actions/checkout@v3

- name: Setup Terraform

uses: hashicorp/setup-terraform@v2

with:

terraform_wrapper: false

- name: Terraform Init

run: terraform init
- name: Terraform Plan

id: tf-plan

run: |

export exitcode=0

terraform plan -detailed-exitcode -no-color -out tfplan || export exitcode=$?

echo "exitcode=$exitcode" >> $GITHUB_OUTPUT

if [ $exitcode -eq 1 ]; then

echo Terraform Plan Failed!

exit 1

else

exit 0

Saturday, September 9, 2023

A sample GitHub action to detect state drift that can run periodically.

on:

workflow_dispatch:

schedule:

- cron: '00 2 * * *' # runs nightly at 2:00 am

permissions:

id-token: write

contents: read

issues: write

env:

ARM_CLIENT_ID: "${{ secrets.AZURE_CLIENT_ID }}"

ARM_SUBSCRIPTION_ID: "${{ secrets.AZURE_SUBSCRIPTION_ID }}"

ARM_TENANT_ID: "${{ secrets.AZURE_TENANT_ID }}"

jobs:

terraform-plan:

runs-on: ubuntu-latest

env:

ARM_SKIP_PROVIDER_REGISTRATION: true

outputs:

tfplanExitCode: ${{ steps.tf-plan.outputs.exitcode }}

steps:

- name: Checkout

uses: actions/checkout@v3

- name: Setup Terraform

uses: hashicorp/setup-terraform@v2

with:

terraform_wrapper: false

- name: Terraform Init

run: terraform init
- name: Terraform Plan

id: tf-plan

run: |

export exitcode=0

terraform plan -detailed-exitcode -no-color -out tfplan || export exitcode=$?

echo "exitcode=$exitcode" >> $GITHUB_OUTPUT

if [ $exitcode -eq 1 ]; then

echo Terraform Plan Failed!

exit 1

else

exit 0