Cluster computing

Tuesday, September 12, 2023

This is a continuation of a series of articles on the shortcomings and resolutions of Infrastructure-as-Code (IaC). One of the commonly encountered situations is when settings for a resource must preserve the old configuration as well as the new configuration but the resource only allows one way.

Let us take the example of the monitoring section of compute resources like databricks that host a wide variety of analytical applications. By the nature of these long running jobs on the databricks instance, diagnosability is critical to ensure incremental progress and completion of the computation involved. All forms of logs including those from the log categories of Databricks File System, Clusters, Accounts, Jobs, Notebook, SSH, Workspace, Secrets, SQLPermissions, Instance Pools, SQL Analytics, Genie, Global Init Scripts, IAM Role, MLFlow Experiment, Feature Store, Remote History Service, MLFlow Acled Artifact, DatabricksSQL, Delta Pipelines, Repos, Unity Catalog, Git Credentials, Web Terminal, Serverless Real-Time Inference, Cluster Libraries, Partner Hub, Clam AV Scan, and Capsule 8 Container Security Scanning Reports must be sent to a destination where they can be read and analysed. Typically, this means sending these logs to a Log Analytics Workspace, archiving to a storage account, or streaming to an event hub.

Since the same treatment is needed for all Databricks instances in one or more subscriptions, the strategy to collect and analyze logs is centralized and consolidated with a directive. Just like log4j for an application, that directive to send the logs to a destination might mean an EventHub as the destination so that the logging events can be forwarded to multiple listeners. Such a directive will require the namespace of an EventHub and a queue within it.

Now the Databricks instance might require introspection analysis from its logs to detect usage patterns on the clusters in the instance and to determine their cost. This is the technique used by Overwatch feature which is a log reader and a calculator and requires the use of an EventHub to collate logs from multiple workspaces and analyze them centrally within one dedicated workspace.

The trouble arises when the diagnostics settings of the cluster must specify only one EventHub but now an EventHub is required for the centralized organizational logging best practices as well as another for Overwatch. The latter might not be able to make use of the EventHub from the former for its purpose because they might include a lot more workspaces than those intended for analysis with Overwatch for cost savings. Also, performance suffers when the queues cannot be separated.

The resolution in this case might then be to send the data to another log account and then use a filter to forward only the relevant logs to another storage account and EventHub combination so that they can be analyzed by Overwatch in a performant manner.

This calls for a databricks diagnostic setting like so:

data "azurerm_eventhub_namespace_authorization_rule" "eventhub_rule" {

name = "RootManageSharedAccessKey"

resource_group_name = "central-logging"

namespace_name = “central-logging-namespace”

}

resource "azurerm_monitor_diagnostic_setting" "lpcl_eventhub" {

name = "central-logging-eventhub-setting"

target_resource_id = “/subscriptions/…/resourceGroups/…/instance”

eventhub_authorization_rule_id = data.azurerm_eventhub_namespace_authorization_rule.eventhub_rule.id

eventhub_name = "log-consolidator"

}

The use of a relay or a consolidator are some of the ways in which this situation can be resolved.

Cluster computing

Tuesday, September 12, 2023

No comments:

Post a Comment