Cluster computing

Tuesday, July 11, 2023

The previous article discussed Azure Databricks and compute cluster usage monitoring. There were many ways suggested.

First, the user interface provided a way to view the usages by cluster by navigating to the CPU utilization metrics reported for each compute resource. It was based on total CPU seconds’ cost. The metric was averaged out for the time interval chosen which had a default value of an hour.

Second, there is a PowerShell scripting library that allows the use of the REST APIs to query the databricks workspace directly. For this the access token suggested was the user’s Personal Access Token or PAT token and required to pass in the cluster’s id as a request parameter. The response of this API was at par with the information from the User Interface and provided a scriptable approach.

Since this might not provide visibility across the databricks instance, the use of REST APIs directly against the Azure resource was suggested. The authentication for this API required the Microsoft login servers to issue a token. While this provided properties and metadata for the resource and provided general manageability, the information pertaining to the cluster usages was rather limited. Still, it was possible to query the last used time to give an indication for whether the compute can be shut down for a workspace user.

Databricks itself provides a set of REST APIS and these give information on the last used time and if the duration from the last usage exceeds a certain threshold, the compute can be stopped. This is referred to as the auto termination time and some values in the order of minutes are sufficient. The default value is 72 hours but that may be quite extravagant.

Third, there is an SDK available from Databricks that can be called with Python to make the above step easier. This option makes it tidy to create charts and graphs from the cluster usages and works well for reporting purposes. Samples can include listing the budgets or directly querying the clusters:

from databricks.sdk import AccountClient

a = AccountClient()

all = a.budgets.list()

a.billable_usage.download(start_month="2023-06", end_month="2023-07")

One of the benefits of using the SDK is that we are no longer required to concern ourselves with setting up the Unity Catalog for querying the system tables ourselves. The SDK can provide authoritative information.

Similarly, another convenience to query cluster usage is with Overwatch. It provides both the summary as well as the drill-down options to understand the operations of a Databricks instance. It has two primary modes: Historical and Real-time. It coalesces all the logs produced by Sparks and Databricks via a periodic job run and then enriches this data through various API calls. Most functionalities can be realized by instantiating the workspace object with suitable OverwatchParams. It provides two tables the dbuCostDetails table and the instanceDetails table.

Cluster computing

Tuesday, July 11, 2023

No comments:

Post a Comment