Cluster computing

Saturday, September 23, 2023

This is a continuation of the previous articles on Azure Databricks usage and Overwatch analysis. While they talked about configuration and deployment of Overwatch, the data ingestion for analysis was taken to be the event hub which in turn collects it from the Azure Databricks resource. This article talks about the collection of the cluster logs and those from the logging and print instructions from the notebooks that run on the clusters.

The default cluster logs directory is the ‘dbfs:/cluster-logs’ and Databricks instance collects it every five minutes and archives every hour. The spark driver logs are saved in this directory. This location is managed by Databricks, and the cluster logs are saved in this directory in a sub-directory named after each cluster. When the cluster is created to attach a notebook to, the cluster’s logging destination is set to dbfs:/cluster-logs by the user under the advanced configuration section of the cluster creation parameters.

The policy under which the cluster gets created is also determined by the users. This policy could also be administered so that the users only create clusters compliant to a policy. In this policy, the logging destination option can be preset to a path like ‘dbfs:/cluster-logs.’ It can also be substituted with a path like ‘/mnt/externalstorageaccount/path/to/folder’ if a remote storage location is provided but it is preferable to use the built-in location.

The Azure Databricks instance will transmit cluster-logs along with all other opted-in logs to the event hub and for that it will require a diagnostic setting specifying the namespace and the EventHub to send to. Overwatch can read this EventHub data but reading from the dbfs:/cluster-logs location is not specified in the documentation.

There are a couple of ways to do that. First, the cluster log destination can be specified in the mapped-path-parameter in the Overwatch deployment csv, so that the deployment knows this additional location to read the data from. Although documentation suggests that the parameter was introduced to cover those workspaces that have more than fifty external storage accounts, it is possible to include just one that the overwatch needs to read from. This option is convenient for reading the default location but again the customers or the administrator must ensure that the clusters are created to send the logs to that location.

While the above works for new clusters, the second option works for both the new and the existing clusters in that a dedicated Databricks job is created to read cluster log locations and transmit to the location that the Overwatch reads from. This job would use the shell command of ‘rsync’ or ‘rclone’ and perform copying activity that can resume with intermittent network failures and indicate progress. When this job runs periodically, the clusters are unaffected and along with the Overwatch jobs, this job would run to make sure that all the relevant logs not covered by those streaming to the EventHub are also read by Overwatch.

Finally, the dashboards that report the analysis performed by Overwatch, which are also available out-of-the-box, can be scheduled to run nightly so that all the logs collected and analyzed are included periodically.

Cluster computing

Saturday, September 23, 2023

No comments:

Post a Comment