Cluster computing

Monday, September 18, 2023

Overwatch deployment issues and resolutions continued.

Issue #6) Dropping the database does not work

The workspace configuration for the etl database might be hardcoded and the file location for the database might still be lingering even if the db file and the database were dropped. There is a cleanup script available from Overwatch that explains the correct order and even comes with a dry run.

· Issue #7) There are stale entries for locations of the etl or consumer databases or there are intermittent errors when reading the data.

The location that was specified as a mount is only accessible by using a service account or a dbx connector. It is not using the same credentials as the logged in user. Access to the remote storage for the purposes of Overwatch must always maintain both the account and the access control. Switching between credentials will not help in this case. It is preferred that Overwatch continues to run with admin credentials while the data is accessed with the token for storage access.

· Issue #8) DB Name is not unique or the locations do not match.

The primordial date must be specified in the form yyyy-MM-dd although the Excel function saves the date in a different format and while this may appear consistent to the user, the error manifests in different forms with complaints mostly about name and location. Specifying this correctly, making sure the validations pass and the databases are correctly created, helps smoothen out the Overwatch operations.

Sunday, September 17, 2023

Drone Formation

Teaching a drone to fly is different from teaching a swarm of drones to fly. A central controller can issue a group command and when each of the drones execute the command, the formation flies. If the formation is unchanged, the group command is merely relayed across to the group members. The drone is one group for the purpose of relaying the same command. When the fleet changes formation, the command changes to individual members. Each unit moves from one position to another without colliding with one another.

The movement has as much degree of freedom as a particle. A drone is often represented as a volumetric pixel or voxel for short. An altimeter and a GPS co-ordinate are sufficient to let the unit maintain its position. When the group command is issued, the movement of the group is specified. Consensus algorithms help with the group behavior without worrying about the exact position of each unit in the group. The flight of any one unit can be written in the form of unicycle model with u1 as the velocity and the u2 as the change in the heading or the angle relative Cartesian co-ordinates. The term unicycle refers to the cosine and sine as the x and y axis displacements. Unicycle consensus algorithms can help the group achieve the intended formation.

One of the most used drone fleet navigations is the Simultaneous Location and Mapping algorithm which provides a framework within which the drones can plan their paths. A drone only needs to know its location, build or acquire a map of its surroundings, plan a path in terms of a series of positions if not the next linear displacement. Consensus helps to determine paths do not have conflicts. Without imminent collision, units can take their time to arrive at their final formation.

Conditions are not always ideal even for the most direct displacements. Wind and obstruction are some of the challenges encountered. A unit might not have the flexibility to move in any direction and must co-ordinate movement to its moving parts to achieve the intended effect. When the current position is hard to maintain and movement to the final position is off by external influence, the path can be included to modify positions to reduce the sum of squares of errors to arrive at the designated position. As a combination of external influence and internal drive to reduce the errors, the points along the alternate path can be determined. An obstruction to a linear displacement for a drone unit would then form a path with positions along a rough semi-circle around the obstruction.

This notion of depth estimation is another navigation technique where the unit’s sensors are enhanced to give a better reference for the surroundings to the unit and then the flight path is optimized. The term comes from the traditional techniques in image processing where it is used to refer to the task of measuring distance of each pixel relative to the camera. Depth is extracted from either monocular or stereo images. Mutli-view geometry helps find the relationships between images.

A cost function helps to minimize the error between the current and the final location which is not a predesignated one but an iterative transition state that is determined by a steepest gradient descent.

Saturday, September 16, 2023

Overwatch deployment issues and resolutions:

· Issue #1) Parameter names have changed

The ETL_STORAGE_PREFIX used to point to the location where the ETL Database and the consumer database were stored. However, since the underlying storage account is used for a wide variety of tasks including calculations and report generation, this has changed to STORAGE_PREFIX. Earlier, the value would typically be a dbfs file location or a /mnt/folder and this now allows values such as ‘abfss#container.storage_account’ convention for locating reports and deployment directories. The /mnt/folder is still the best route to go with Overwatch jobs although the use of mounts is being deprecated in databricks.

· Issue #2) Location migrations with different versions of the Overwatch deployment notebook

Occasionally, the 70 version of the Overwatch deployment notebook is run before the 71 version and even the location specified for storage prefix might change as users become aware of the different ways in which the notebook deploys the schema. They are both independent, but the first location reflects what the hive_metastore will show. Although the table names remain the same between the notebook versions, the version consistency between the notebook, the databases and the dashboards is still a requirement.

· Issue #3) Missing tables or the generic table or view not found error is encountered when using Overwatch

Even though the results from the notebook execution might appear to show that it was successful, there may be messages in there about the validations that were performed. A false value for any validation pass indicates that the database tables would not be as pristine as they would if all the rules were successful. Also, some of the executions do not create all the tables in the consumer database and therefore repeated runs of the deployment notebook are required whenever there are warnings or messages. If all warnings and errors are not removable, it is better to drop and recreate the databases.

· Issue #4) There are stale entries for locations of the etl or consumer databases or there are intermittent errors when reading the data.

· Issue #5) DB Name is not unique or the locations do not match.

Friday, September 15, 2023

This is a summary of a book titled “Win from Within: Build organizational culture for Competitive Advantage” written by James Heskett who is a professor emeritus of Business Logistics at the Harvard Business School. The book was published by Columbia Business School Publishing in 2022. It provides an applicable overview with concrete examples.

The book details 16 steps to change your culture on the premise that evidence does not support most of the common wisdom about organizational culture. An effective culture boosts the bottom line and fosters flexibility, innovation, and learning. Responsibility rests with the leaders to engage and retain employees and an organization’s policies must reflect its values. High-engagement workplaces share several crucial characteristics and experimentation improves your likelihood of success. There might be some challenges presented by remote work, but they are not insurmountable. The risk associated with good cultures going bad is that change becomes difficult.

A strong culture does not imply marketplace success and is not necessarily a winning asset. It could even be toxic. But leaders can shift the culture in a matter of months. The steps listed here are useful to everyone involved in managing organizations.

Culture and strategy are complementary. For example, Satya Nadella simultaneously healed Microsoft’s dysfunctional culture and led a major strategic shift from Windows to cloud computing. On the contrary, resisting new ideas assuming what worked in the past will continue to work, is one of the most common pitfalls.

An effective culture boosts the bottom line, and fosters flexibility, innovation, and learning. The competitive advantage of an effective culture can outlive that of any strategy. Organizations that put their employees first gained long-term market share and later rewarded their shareholders handsomely. Analysts can predict a company’s relative profitability by studying just the culture. There can even be a virtuous feedback loop between cultural changes and impact on profit. For example, Ritz Carlton vets the hirings thoroughly and empowers almost anyone to spend up to 2000$ to redress a guest’s problem. It emphasizes attitude and empathy.

Leaders must engage and retain employees and culture can be a tiebreaker in engaging talent. Organizations with effective culture can be tiebreakers but they could also be pressure cookers. Discontent stems from a lack of training and a lack of being acknowledged.

Companies known for highly engaged employees train their recruiters in employee engagement as a competitive advantage. They seek people with complementary viewpoints and empower them with the necessary skills. The US Marine Corps, the Mayo Clinic and Harvard Business School all have sustained high engagement beyond their founding generation and leverage a team-based structure to maintain the culture. Similarly, Southwest Airlines views the late departure as a team failure, not an individual one. This results in a top on-time record.

Experimentation is key to success.Booking.com authorizes any staffer to run a test without advance approval. Testing is taught and test evidence overrides executive judgment. Failed tests provide lessons. The author asserts that measurement without action is a great way to scuttle the success of a lot of effort that precedes it.

Sometimes, a toxic culture has devastating results. After two Boeing 737 MAX planes crashed, a whistleblower said management had rejected an engineer’s request for a safety measure. Employees feared retaliation for bringing problems to management’s attention. Similarly, the O-Ring failure destroyed the Challenger space shuttle, and the case of Volkswagen’s emissions-testing imbroglio is well-known.

Remote work presents cultural challenges and the best that the leaders of increasingly remote workforces can hope for may be hiring advantages and modest increases in productivity.

James Heskett lists the following steps to accomplish culture change:

1. Leaders acknowledge the need for culture change – Leaders must take note of the metrics and messages emerging from the “shadow culture.”

2. Use discontent with the status quo as a spur for change – Drastic steps might be needed to crystallize and alleviate the concerns people see with change.

3. Share the message of change – Communications must be ongoing, clear, and simple. Listen to the reactions. Repeat.

4. Designate a change team – A team can be tasked with cultural change codifying values, gathering input, meeting deadlines, and maintaining the impetus for change.

5. Install the best leaders – Bring the right people to the fore; tell the wrong people good-bye. Your goal is alignment around change.

6. Generate and maintain urgency – Culture change should take six to 12 months. As John Doerr said, “Time is the enemy of transformation.” Build in a sense of drive.

7. Draft a culture charter – by articulating what must change and how. For example, Microsoft spurred change to empower people “to achieve more.” Compare the current state to the desired future.

8. Promulgate a change statement that involves the whole organization – Communication is crucial. Gather comments; include or reject them; document the outcome.

9. Set up a “monitor team” – This team tracks relevant measurements, checks progress, and ensures that communication continues.

10. Align everything – Changes must align with corporate values. Reward what matters.

11. Put changes into motion – Leaders must walk the talk. McKinsey found that change is more than five times likelier when leaders act the way they want their employees to act.

12. Teach people at every level how to implement change – Training must be imparted.

13. Measure new behaviors – Align your metrics with your new expectations and handle troubles.

14. Acknowledge progress – Milestones are just as much reason to celebrate as the goal.

15. Give big changes time to unfold – Long range habits take time to reach the customer.

16. Keep reminding yourself what culture change requires – This is an ongoing evolution. Frequent check-ins with everyone on the team and recalibrations help.

Thursday, September 14, 2023

Some of the learnings from deploying Overwatch on Databricks

This is a continuation of previous articles on Overwatch which can be considered an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost. This section of the article describes some of the considerations when deploying Overwatch that might not be obvious from the public documentation but helps with optimizing the deployments.

Overwatch deployments must include an EventHub as well as a storage account. The EventHub receives diagnostics data and comes from the target Databricks workspaces. Usually, only EventHub namespace is required to work with the Overwatch deployment, but it will have 1 to N Event Hubs within that namespace with one each for every workspace monitored. When the EventHubs and their namespace is created, the workspaces must be associated with it which does not alter a workspace if it is already existing. The association reflects on the workspace in the diagnostics settings under the monitoring section of that instance.

Unlike the EventHub that receives the diagnostic data, a storage account is required as a working directory for the Overwatch instance so that it may write out its reports from the calculations it makes. These reports could be in binary format but the aggregated information on dbu-cost basis as well as instance-level basis are available to view in two independent tables in the Overwatch database on the workspace where it deployed. There are other artifacts also stored on this storage account such as the parameters for the deployment of Overwatch and incremental computations, but the entire account can be dedicated to Overwatch as a working directory. It is for this reason that the storage account is dedicated to Overwatch that the compute logs from the workspaces are also archived here because the locality of the data enables the Overwatch jobs to read the logs with minimum cost.

This is another diagnostic setting for a workspace, and it might be additional in the case that the logs from the workspace were already being sent elsewhere either via EventHub or via a different storage account. The separation of the logs read by Overwatch from that for other purposes helps Overwatch be performant as well as reliable by maintaining isolation. The compute logs are only read by Overwatch and so they need not be saved longer than necessary and intended only for the computations of Overwatch.

Both the event hub and the storage account can be regional because cross region transfer of data can be expensive and the ability to decide what data is sent to the Overwatch and making it local reduces the cost significantly. Instead of thinking about eliminating storage costs, it is better to exercise over what and how much data is sent to Overwatch to perform its calculations. Having multiple diagnostic settings on the Databricks workspace helps with this.

Lastly, it must be noted that the cluster logs can be considered different from the compute logs in that one is emitted by the clusters spun up by the users on a Databricks workspace and the other is written out by the Databricks workspace itself. All jobs regardless of whether they are user jobs or Overwatch jobs access the data over https or via mounts. The https way of accessing data is with the help of the abfss@container.<storage-account>.dfs.core.windows.net qualifier or via mounts that can be setup via

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id": "<application-id>",

"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),

"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

dbutils.fs.mount(

source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",

mount_point = "/mnt/<mount-name>",

extra_configs = configs)

When the cluster is created, the logging destination must be set to this mount and will be found under the advanced configuration section.

This summarizes the capture and analysis by Overwatch deployments.

Reference: https://1drv.ms/w/s!Ashlm-Nw-wnWhM9v1fn0cHGer-BjQg?e=ke3d50

Wednesday, September 13, 2023

Overwatch organization

Overwatch can be taken as an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost. Auditing logs and cluster logs are primary data sources. Databricks monitors and logs cluster metrics such as CPU Utilization, memory usage, network I/O and storage, job related telemetry such as those for scheduled jobs, run history, execution times and resource utilization. The notebook execution metrics such as tracking metrics for individual notebook executions, including execution time, data read/write and memory usage, logging and metrics export, data from application monitoring tools like DataDog or Relic to gain deeper insights into performance alongside other applications and services, and SQL Analytics monitoring including those for query performance and resource utilization.

The Deployment runners used for Overwatch take the following parameters:

ETL Storage prefix

ETL database name

Consumer DB Name

Secret Scope

Secret Key for Databricks PAT Token

Secret Key for EventHub

Event Hub Topic Name

Primordial Date

Max Days

And AT Scopes

These parameters are stored in a csv file in the deployment folder of the storage account associated with the Overwatch and mounted via the ETL storage prefix.

So it would seem that the storage account used with the Overwatch notebook jobs is for both read and write with the ability to collect the cluster logs for reading purposes say from the cluster-logs directory and to write the corresponding calculations to say a report folder within the same account as <etl_storage_prefix>/cluster-logs and <etl_storage_prefix>/reports. However, the json configuration to the Overwatch jobs that run for a long time and parse large and plentiful logs run in a dedicated manner. It is possible to configure the read be served from a location different from the write and involves injecting the separate locations to the Overwatch jobs. The default locations of storage account qualified cluster-log folder and that for report folder are configurable.

With the newer versions, the etl_storage_prefix has been renamed to storage_prefix to indicate that it is just the working directory for the Overwatch and the all the logs are accessed via the mount_mapping_path variable that lists the remote locations of logs storage as a path different from the ones the storage_prefix points to. Therefore, the reports are written to a location as abfss@container on an Azure data lake storage, but the cluster logs can be read from mounts such as dbfs:/mnt/logs

Tuesday, September 12, 2023

This is a continuation of a series of articles on the shortcomings and resolutions of Infrastructure-as-Code (IaC). One of the commonly encountered situations is when settings for a resource must preserve the old configuration as well as the new configuration but the resource only allows one way.

Let us take the example of the monitoring section of compute resources like databricks that host a wide variety of analytical applications. By the nature of these long running jobs on the databricks instance, diagnosability is critical to ensure incremental progress and completion of the computation involved. All forms of logs including those from the log categories of Databricks File System, Clusters, Accounts, Jobs, Notebook, SSH, Workspace, Secrets, SQLPermissions, Instance Pools, SQL Analytics, Genie, Global Init Scripts, IAM Role, MLFlow Experiment, Feature Store, Remote History Service, MLFlow Acled Artifact, DatabricksSQL, Delta Pipelines, Repos, Unity Catalog, Git Credentials, Web Terminal, Serverless Real-Time Inference, Cluster Libraries, Partner Hub, Clam AV Scan, and Capsule 8 Container Security Scanning Reports must be sent to a destination where they can be read and analysed. Typically, this means sending these logs to a Log Analytics Workspace, archiving to a storage account, or streaming to an event hub.

Since the same treatment is needed for all Databricks instances in one or more subscriptions, the strategy to collect and analyze logs is centralized and consolidated with a directive. Just like log4j for an application, that directive to send the logs to a destination might mean an EventHub as the destination so that the logging events can be forwarded to multiple listeners. Such a directive will require the namespace of an EventHub and a queue within it.

Now the Databricks instance might require introspection analysis from its logs to detect usage patterns on the clusters in the instance and to determine their cost. This is the technique used by Overwatch feature which is a log reader and a calculator and requires the use of an EventHub to collate logs from multiple workspaces and analyze them centrally within one dedicated workspace.

The trouble arises when the diagnostics settings of the cluster must specify only one EventHub but now an EventHub is required for the centralized organizational logging best practices as well as another for Overwatch. The latter might not be able to make use of the EventHub from the former for its purpose because they might include a lot more workspaces than those intended for analysis with Overwatch for cost savings. Also, performance suffers when the queues cannot be separated.

The resolution in this case might then be to send the data to another log account and then use a filter to forward only the relevant logs to another storage account and EventHub combination so that they can be analyzed by Overwatch in a performant manner.

This calls for a databricks diagnostic setting like so:

data "azurerm_eventhub_namespace_authorization_rule" "eventhub_rule" {

name = "RootManageSharedAccessKey"

resource_group_name = "central-logging"

namespace_name = “central-logging-namespace”

}

resource "azurerm_monitor_diagnostic_setting" "lpcl_eventhub" {

name = "central-logging-eventhub-setting"

target_resource_id = “/subscriptions/…/resourceGroups/…/instance”

eventhub_authorization_rule_id = data.azurerm_eventhub_namespace_authorization_rule.eventhub_rule.id

eventhub_name = "log-consolidator"

}

The use of a relay or a consolidator are some of the ways in which this situation can be resolved.