Thursday, September 14, 2023

Some of the learnings from deploying Overwatch on Databricks

This is a continuation of previous articles on Overwatch which can be considered an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost. This section of the article describes some of the considerations when deploying Overwatch that might not be obvious from the public documentation but helps with optimizing the deployments.

Overwatch deployments must include an EventHub as well as a storage account. The EventHub receives diagnostics data and comes from the target Databricks workspaces. Usually, only EventHub namespace is required to work with the Overwatch deployment, but it will have 1 to N Event Hubs within that namespace with one each for every workspace monitored. When the EventHubs and their namespace is created, the workspaces must be associated with it which does not alter a workspace if it is already existing. The association reflects on the workspace in the diagnostics settings under the monitoring section of that instance.

Unlike the EventHub that receives the diagnostic data, a storage account is required as a working directory for the Overwatch instance so that it may write out its reports from the calculations it makes. These reports could be in binary format but the aggregated information on dbu-cost basis as well as instance-level basis are available to view in two independent tables in the Overwatch database on the workspace where it deployed.  There are other artifacts also stored on this storage account such as the parameters for the deployment of Overwatch and incremental computations, but  the entire account can be dedicated to Overwatch as a working directory. It is for this reason that the storage account is dedicated to Overwatch that the compute logs from the workspaces are also archived here  because the locality of the data enables the Overwatch jobs to read the logs with minimum cost.

This is another diagnostic setting for a workspace, and it might be additional in the case that the logs from the workspace were already being sent elsewhere either via EventHub or via a different storage account. The separation of the logs read by Overwatch from that for other purposes helps Overwatch be performant as well as reliable by maintaining isolation. The compute logs are only read by Overwatch and so they need not be saved longer than necessary and intended only for the computations of Overwatch.

Both the event hub and the storage account can be regional because cross region transfer of data can be expensive and the ability to decide what data is sent to the Overwatch and making it local reduces the cost significantly. Instead of thinking about eliminating storage costs, it is better to exercise over what and how much data is sent to Overwatch to perform its calculations. Having multiple diagnostic settings on the Databricks workspace helps with this.

Lastly, it must be noted that the cluster logs can be considered different from the compute logs in that one is emitted by the clusters spun up by the users on a Databricks workspace and the other is written out by the Databricks workspace itself. All jobs regardless of whether they are user jobs or Overwatch jobs access the data over https or via mounts. The https way of accessing data is with the help of the abfss@container.<storage-account>.dfs.core.windows.net qualifier or via mounts that can be setup via

configs = {"fs.azure.account.auth.type": "OAuth",

          "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

          "fs.azure.account.oauth2.client.id": "<application-id>",

          "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),

          "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

 

dbutils.fs.mount(

  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",

  mount_point = "/mnt/<mount-name>",

  extra_configs = configs)

When the cluster is created, the logging destination must be set to this mount and will be found under the advanced configuration section.

This summarizes the capture and analysis by Overwatch deployments.

Reference: https://1drv.ms/w/s!Ashlm-Nw-wnWhM9v1fn0cHGer-BjQg?e=ke3d50

A screenshot of a computer

Description automatically generated

No comments:

Post a Comment