Cluster computing

Thursday, September 21, 2023

The following is a list of some errors and resolutions encountered with deploying Overwatch dashboards:

1. Overwatch dashboard fails with errors mentioning missing tables.

The databases that Overwatch needs are consumer database usually named overwatch and the ETL database, usually named overwatch_etl. These databases are deployed with the Overwatch notebook runners and there are two versions the 70 and 71. The latter version requires the storage account to be created and a csv to be uploaded to the deployment folder within the overwatch container or bucket in a public cloud storage account. The csv requires a mount location referred to as the storage prefix where all the files associated with the creation and use of database are kept. There are two files there, one each for overwatch consumer database and overwatch_etl database which persist the database outside the catalog of the databricks instance.

When the notebook runs, the tables are created within the catalog and the associated file on the storage account. Over sixty jobs are run to create these tables and eventually all the tables appear in the catalog. Due to the high number of jobs, failures are common and the tables are not all populated. Rerunning the notebook a few times, helps to close the gap towards a complete database.

2. Overwatch has mismatching files and/or database and must need to be redeployed but the starting point is not clean

Due to the versions of notebook used and the intermittent failures from executing any one, it is quite likely that a redeploy from a clean slate is required. Deleting just the persistence files from the storage account will not help because the catalog and the databricks instance might keep a mention of stale configuration. Although a cleanup script is available along with the Overwatch deployment notebooks, it is best to execute the following command for a speedy resolution:

DROP DATABASE overwatch_etl CASCADE;

DROP DATABASE overwatch CASCADE;

-- CLEAR CACHE;

This will delete the associated files from the storage account as well. It is also advisable that if the Overwatch is being upgraded even for a stale deployment, it could be followed up by recreating the storage account container and mounting it on the databricks cluster.

3. When the storage prefix refers to the remote location via the abfss@container.storage.dfs.core.windows.net naming scheme, frequently the unauthorized error displays.

Although mounts are deprecated and abfss is relatively newer than the mounts, creating a mount initially helps prevent repeated resolution for every name lookup. This can be done with the following script:

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id": "<application-id>,

"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-key>",key="<key-name>"),

"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"}

#dbutils.fs.unmount("/mnt/overwatch-wd")

dbutils.fs.mount(

source = "abfss://container@storageaccountname.dfs.core.windows.net/",

mount_point = "/mnt/overwatch-wd",

extra_configs = configs)

dbutils.fs.ls("/mnt/overwatch-wd")

Wednesday, September 20, 2023

This is a continuation of the articles on infrastructure deployments. One of the popular instruments to exercise governance on Azure resources is the Azure Policy. The definition and the assignment constitute the policy. Assignments are used to define which resources are assigned to which policies or initiatives. The assignment can determine the values of the parameters for that group of resources at assignment time which makes the definition reusable with different needs for compliance.

Among the properties of the assignment, the enforcement mode stands out. This property provides customers the ability to test the outcome of a policy on existing resources without initiating the policy effect or triggering entries in the Azure Activity Log. It is also referred to as the “What If” scenario and aligns to safe deployment practices. When the mode is set to Enabled, the JSON value is ‘Default’ and the policy effect is enforced during resource creation or update. When the mode is set to Disabled, the JSON value is ‘DoNotEnforce’ policy effect is not enforced during resource creation or update. If the enforcement mode is not specified, the value ‘Default’ applies.

The scope of the assignment includes all child resource containers and child resources. If a child resource container or child resource should not have the definition applied, each can be excluded from the evaluation by setting notScopes which defaults to an empty array [].

The effects currently supported in a policy definition include Append, Audit, AuditIfNotExists, Deny, DenyAction, DeployIfNotExists, Disabled, Manual, Modify. When the policy definition effect is Modify, Append, or DeployIfNotExists, Policy alters the request or adds to it. When the policy definition effect is Audit or AuditIfNotExists, Policy causes an Activity log entry to be created for new and updated resources. And when the policy definition effect is Deny or Deny action, Policy stops the creation or alteration of the request. The effects must always be tried out. Validation of a policy ensures that the non-compliant resources are correctly reported and that false positives are excluded. The recommended approach to validating a new policy definition is by following these steps: tightly defining the policy, auditing existing resources, auditing new or updated resource requests, deploying policy to resources, and continuous monitoring.

A differentiation between Audit and AuditIfNotExists must be called out. Audit generates a warning event in the activity log if a related resource does not exist but does not fail the request. AuditIfNotExists generates a warning event in the activity log if a related resource does not exist. The If condition evaluates a field so a value must be provided for the name of the field. It references fields on the resources that are being evaluated.

Tuesday, September 19, 2023

Pure and mixed templates:

Infrastructure-as-a-code is a declarative paradigm that is a language for describing infrastructure and the state that it must achieve. The service that understands this language supports tags, RBAC, declarative syntax, locks, policies, and logs for the resources and their create, update, and delete operations which can be exposed via the command-line interface, scripts, web requests, and the user interface. Declarative style also helps to boost agility, productivity, and quality of work within the organizations. 

Template providers often go to great lengths to determine the convention, syntax and semantics that authors can use to describe the infrastructure to be setup. Many provide common forms of expressing infrastructure and equivalents that are similar across providers. Authors, however, rely on tools to import and export infrastructure. Consequently, they must mix and match templates.

One such template provider is AWS cloud’s CloudFormation. Terraform is the open-source equivalent that helps the users with the task of setting up and provisioning datacenter infrastructure independent of clouds. These cloud configuration files can be shared among team members, treated as code, edited, reviewed and versioned.

Terraform allows including Json and Yaml in the templates and state files using built-in functions called jsonencode and yamlencode respectively. With the tools to export templates in one of the two well-known forms, it becomes easy to import in Terraform with these two built-in functions. Terraform can also be used to read and export existing cloud infrastructure in its syntax but often they may be exported in ugly compressed hard-to-read format and these two built-in functions allow multi-line display of the same content which makes it more readable.

AWS CloudFormation has a certain appeal for being AWS native with a common language to model and provision AWS and third-party resources. It abstracts the nuances in managing AWS resources and their dependencies making it easier for creating and deleting resources in a predictable manner. It makes versioning and iterating of the infrastructure more accessible. It supports iterative testing as well as rollback.

Terraform’s appeal is that it can be used for multi-cloud deployment. For example, it deploys serverless functions with AWS Lambda, manage Microsoft Azure Active Directory resources, and provision a load balancer in Google cloud.

Both facilitate state management. With CloudFormation, users can perform drift detection on all of their assets and get notifications when something changes. It also determines dependencies and performs certain validations before a delete command is honored. Terraform stores the state of the infrastructure on the provisioning computer, or in a remote site in proprietary JSON which serves to describe and configure the resources. The state management is automatically done with no user involvement by CloudFormation whereas Terraform requires you to specify the remote store or fallback to local disk to save state.

Both have their unique ways for addressing flexibility for changing requirements. Terraform has modules which are containers for multiple resources that are used together and CloudFormation utilizes a system called “nested stacks” where templates can be called from within templates. A benefit of Terraform is increased flexibility over CloudFormation regarding modularity.

They also differ in how they handle configuration and parameters. Terraform uses provider specific data sources. The implementation is modular allowing data to be fetched and reused. CloudFormation uses up to 60 parameters per template that must be of a type that CloudFormation understands. They must be declared or retrieved from the System Manager parameter store and used within the template.
Both are powerful cloud infrastructure management tools, but one is more favorable for cloud-agnostic support. It also ties in very well with DevOps automations such as GitLab. Finally, having an abstraction over cloud lock-ins might also be beneficial to the organization in the long run.

Monday, September 18, 2023

Overwatch deployment issues and resolutions continued.

Issue #6) Dropping the database does not work

The workspace configuration for the etl database might be hardcoded and the file location for the database might still be lingering even if the db file and the database were dropped. There is a cleanup script available from Overwatch that explains the correct order and even comes with a dry run.

· Issue #7) There are stale entries for locations of the etl or consumer databases or there are intermittent errors when reading the data.

The location that was specified as a mount is only accessible by using a service account or a dbx connector. It is not using the same credentials as the logged in user. Access to the remote storage for the purposes of Overwatch must always maintain both the account and the access control. Switching between credentials will not help in this case. It is preferred that Overwatch continues to run with admin credentials while the data is accessed with the token for storage access.

· Issue #8) DB Name is not unique or the locations do not match.

The primordial date must be specified in the form yyyy-MM-dd although the Excel function saves the date in a different format and while this may appear consistent to the user, the error manifests in different forms with complaints mostly about name and location. Specifying this correctly, making sure the validations pass and the databases are correctly created, helps smoothen out the Overwatch operations.

Sunday, September 17, 2023

Drone Formation

Teaching a drone to fly is different from teaching a swarm of drones to fly. A central controller can issue a group command and when each of the drones execute the command, the formation flies. If the formation is unchanged, the group command is merely relayed across to the group members. The drone is one group for the purpose of relaying the same command. When the fleet changes formation, the command changes to individual members. Each unit moves from one position to another without colliding with one another.

The movement has as much degree of freedom as a particle. A drone is often represented as a volumetric pixel or voxel for short. An altimeter and a GPS co-ordinate are sufficient to let the unit maintain its position. When the group command is issued, the movement of the group is specified. Consensus algorithms help with the group behavior without worrying about the exact position of each unit in the group. The flight of any one unit can be written in the form of unicycle model with u1 as the velocity and the u2 as the change in the heading or the angle relative Cartesian co-ordinates. The term unicycle refers to the cosine and sine as the x and y axis displacements. Unicycle consensus algorithms can help the group achieve the intended formation.

One of the most used drone fleet navigations is the Simultaneous Location and Mapping algorithm which provides a framework within which the drones can plan their paths. A drone only needs to know its location, build or acquire a map of its surroundings, plan a path in terms of a series of positions if not the next linear displacement. Consensus helps to determine paths do not have conflicts. Without imminent collision, units can take their time to arrive at their final formation.

Conditions are not always ideal even for the most direct displacements. Wind and obstruction are some of the challenges encountered. A unit might not have the flexibility to move in any direction and must co-ordinate movement to its moving parts to achieve the intended effect. When the current position is hard to maintain and movement to the final position is off by external influence, the path can be included to modify positions to reduce the sum of squares of errors to arrive at the designated position. As a combination of external influence and internal drive to reduce the errors, the points along the alternate path can be determined. An obstruction to a linear displacement for a drone unit would then form a path with positions along a rough semi-circle around the obstruction.

This notion of depth estimation is another navigation technique where the unit’s sensors are enhanced to give a better reference for the surroundings to the unit and then the flight path is optimized. The term comes from the traditional techniques in image processing where it is used to refer to the task of measuring distance of each pixel relative to the camera. Depth is extracted from either monocular or stereo images. Mutli-view geometry helps find the relationships between images.

A cost function helps to minimize the error between the current and the final location which is not a predesignated one but an iterative transition state that is determined by a steepest gradient descent.

Saturday, September 16, 2023

Overwatch deployment issues and resolutions:

· Issue #1) Parameter names have changed

The ETL_STORAGE_PREFIX used to point to the location where the ETL Database and the consumer database were stored. However, since the underlying storage account is used for a wide variety of tasks including calculations and report generation, this has changed to STORAGE_PREFIX. Earlier, the value would typically be a dbfs file location or a /mnt/folder and this now allows values such as ‘abfss#container.storage_account’ convention for locating reports and deployment directories. The /mnt/folder is still the best route to go with Overwatch jobs although the use of mounts is being deprecated in databricks.

· Issue #2) Location migrations with different versions of the Overwatch deployment notebook

Occasionally, the 70 version of the Overwatch deployment notebook is run before the 71 version and even the location specified for storage prefix might change as users become aware of the different ways in which the notebook deploys the schema. They are both independent, but the first location reflects what the hive_metastore will show. Although the table names remain the same between the notebook versions, the version consistency between the notebook, the databases and the dashboards is still a requirement.

· Issue #3) Missing tables or the generic table or view not found error is encountered when using Overwatch

Even though the results from the notebook execution might appear to show that it was successful, there may be messages in there about the validations that were performed. A false value for any validation pass indicates that the database tables would not be as pristine as they would if all the rules were successful. Also, some of the executions do not create all the tables in the consumer database and therefore repeated runs of the deployment notebook are required whenever there are warnings or messages. If all warnings and errors are not removable, it is better to drop and recreate the databases.

· Issue #4) There are stale entries for locations of the etl or consumer databases or there are intermittent errors when reading the data.

· Issue #5) DB Name is not unique or the locations do not match.

Friday, September 15, 2023

This is a summary of a book titled “Win from Within: Build organizational culture for Competitive Advantage” written by James Heskett who is a professor emeritus of Business Logistics at the Harvard Business School. The book was published by Columbia Business School Publishing in 2022. It provides an applicable overview with concrete examples.

The book details 16 steps to change your culture on the premise that evidence does not support most of the common wisdom about organizational culture. An effective culture boosts the bottom line and fosters flexibility, innovation, and learning. Responsibility rests with the leaders to engage and retain employees and an organization’s policies must reflect its values. High-engagement workplaces share several crucial characteristics and experimentation improves your likelihood of success. There might be some challenges presented by remote work, but they are not insurmountable. The risk associated with good cultures going bad is that change becomes difficult.

A strong culture does not imply marketplace success and is not necessarily a winning asset. It could even be toxic. But leaders can shift the culture in a matter of months. The steps listed here are useful to everyone involved in managing organizations.

Culture and strategy are complementary. For example, Satya Nadella simultaneously healed Microsoft’s dysfunctional culture and led a major strategic shift from Windows to cloud computing. On the contrary, resisting new ideas assuming what worked in the past will continue to work, is one of the most common pitfalls.

An effective culture boosts the bottom line, and fosters flexibility, innovation, and learning. The competitive advantage of an effective culture can outlive that of any strategy. Organizations that put their employees first gained long-term market share and later rewarded their shareholders handsomely. Analysts can predict a company’s relative profitability by studying just the culture. There can even be a virtuous feedback loop between cultural changes and impact on profit. For example, Ritz Carlton vets the hirings thoroughly and empowers almost anyone to spend up to 2000$ to redress a guest’s problem. It emphasizes attitude and empathy.

Leaders must engage and retain employees and culture can be a tiebreaker in engaging talent. Organizations with effective culture can be tiebreakers but they could also be pressure cookers. Discontent stems from a lack of training and a lack of being acknowledged.

Companies known for highly engaged employees train their recruiters in employee engagement as a competitive advantage. They seek people with complementary viewpoints and empower them with the necessary skills. The US Marine Corps, the Mayo Clinic and Harvard Business School all have sustained high engagement beyond their founding generation and leverage a team-based structure to maintain the culture. Similarly, Southwest Airlines views the late departure as a team failure, not an individual one. This results in a top on-time record.

Experimentation is key to success.Booking.com authorizes any staffer to run a test without advance approval. Testing is taught and test evidence overrides executive judgment. Failed tests provide lessons. The author asserts that measurement without action is a great way to scuttle the success of a lot of effort that precedes it.

Sometimes, a toxic culture has devastating results. After two Boeing 737 MAX planes crashed, a whistleblower said management had rejected an engineer’s request for a safety measure. Employees feared retaliation for bringing problems to management’s attention. Similarly, the O-Ring failure destroyed the Challenger space shuttle, and the case of Volkswagen’s emissions-testing imbroglio is well-known.

Remote work presents cultural challenges and the best that the leaders of increasingly remote workforces can hope for may be hiring advantages and modest increases in productivity.

James Heskett lists the following steps to accomplish culture change:

1. Leaders acknowledge the need for culture change – Leaders must take note of the metrics and messages emerging from the “shadow culture.”

2. Use discontent with the status quo as a spur for change – Drastic steps might be needed to crystallize and alleviate the concerns people see with change.

3. Share the message of change – Communications must be ongoing, clear, and simple. Listen to the reactions. Repeat.

4. Designate a change team – A team can be tasked with cultural change codifying values, gathering input, meeting deadlines, and maintaining the impetus for change.

5. Install the best leaders – Bring the right people to the fore; tell the wrong people good-bye. Your goal is alignment around change.

6. Generate and maintain urgency – Culture change should take six to 12 months. As John Doerr said, “Time is the enemy of transformation.” Build in a sense of drive.

7. Draft a culture charter – by articulating what must change and how. For example, Microsoft spurred change to empower people “to achieve more.” Compare the current state to the desired future.

8. Promulgate a change statement that involves the whole organization – Communication is crucial. Gather comments; include or reject them; document the outcome.

9. Set up a “monitor team” – This team tracks relevant measurements, checks progress, and ensures that communication continues.

10. Align everything – Changes must align with corporate values. Reward what matters.

11. Put changes into motion – Leaders must walk the talk. McKinsey found that change is more than five times likelier when leaders act the way they want their employees to act.

12. Teach people at every level how to implement change – Training must be imparted.

13. Measure new behaviors – Align your metrics with your new expectations and handle troubles.

14. Acknowledge progress – Milestones are just as much reason to celebrate as the goal.

15. Give big changes time to unfold – Long range habits take time to reach the customer.

16. Keep reminding yourself what culture change requires – This is an ongoing evolution. Frequent check-ins with everyone on the team and recalibrations help.