This is a continuation of the previous articles on Azure
Databricks usage and Overwatch analysis. While they talked about configuration
and deployment of Overwatch, the data ingestion for analysis was taken to be
the event hub which in turn collects it from the Azure Databricks resource.
This article talks about the collection of the cluster logs and those from the
logging and print instructions from the notebooks that run on the clusters.
The default cluster logs directory is the
‘dbfs:/cluster-logs’ and Databricks instance collects it every five minutes and
archives every hour. The spark driver logs are saved in this directory. This
location is managed by Databricks, and the cluster logs are saved in this
directory in a sub-directory named after each cluster. When the cluster is
created to attach a notebook to, the cluster’s logging destination is set to
dbfs:/cluster-logs by the user under the advanced configuration section of the
cluster creation parameters.
The policy under which the cluster gets created is also
determined by the users. This policy could also be administered so that the
users only create clusters compliant to a policy. In this policy, the logging
destination option can be preset to a path like ‘dbfs:/cluster-logs.’ It can also be substituted with a path like
‘/mnt/externalstorageaccount/path/to/folder’ if a remote storage location is
provided but it is preferable to use the built-in location.
The Azure Databricks instance will transmit cluster-logs
along with all other opted-in logs to the event hub and for that it will
require a diagnostic setting specifying the namespace and the EventHub to send
to. Overwatch can read this EventHub data but reading from the
dbfs:/cluster-logs location is not specified in the documentation.
There are a couple of ways to do that. First, the cluster
log destination can be specified in the mapped-path-parameter in the Overwatch
deployment csv, so that the deployment knows this additional location to read
the data from. Although documentation suggests that the parameter was
introduced to cover those workspaces that have more than fifty external storage
accounts, it is possible to include just one that the overwatch needs to read
from. This option is convenient for reading the default location but again the
customers or the administrator must ensure that the clusters are created to
send the logs to that location.
While the above works for new clusters, the second option works
for both the new and the existing clusters in that a dedicated Databricks job
is created to read cluster log locations and transmit to the location that the
Overwatch reads from. This job would use the shell command of ‘rsync’ or
‘rclone’ and perform copying activity that can resume with intermittent network
failures and indicate progress. When this job runs periodically, the clusters
are unaffected and along with the Overwatch jobs, this job would run to make
sure that all the relevant logs not covered by those streaming to the EventHub
are also read by Overwatch.
Finally, the dashboards that report the analysis performed
by Overwatch, which are also available out-of-the-box, can be scheduled to run
nightly so that all the logs collected and analyzed are included periodically.
No comments:
Post a Comment