Cluster computing

Monday, September 25, 2023

AI and Product development - Part 1.

This article focuses on the role of Artificial Intelligence in product development. Both in business and engineering, a new product development covers the complete process from concept to realization and introducing it to the market. There are many aspects and interdisciplinary endeavors to get a product and thereby a venture off the ground. A central aspect of this process is the product design, which involves various business considerations and is broadly described as the transformation of a market opportunity into a product available for sale. A product is meant to generate income and technological companies leverage innovation in a rapidly changing market. Cost, time, and quality are the main variables that drive customer needs. Business and technology professionals find the product-market fit as one of the most challenging aspects of starting a business and startups are often constrained to meet this long and expensive process. This is where Artificial Intelligence holds promises for startups and SMBs.

Since the product design involves predicting the right product to build and investing in prototypes, experimentation and testing, Artificial Intelligence can help us be smarter about navigating the product development course. Research studies cite that 35% of the SMBs and startups fail due to no market need. AI powered data analysis can help them to be more accurate with a well-rounded view of the quantitative and qualitative data to determine whether the product will meet customer needs or even whether the right audience has been selected in the first place. Collecting and analyzing are strengths of AI and in this case helps to connect with the customers at a deeper level. One such technique is often referred to as latent semantic analysis in AI which helps to articulate the real customers’ needs. Hidden matrix or latent semantic analysis or SoftMax classification was nearly unknown until 2013. The traditional way of creating software products, especially when it was technologically driven, attributed to the high failure rate. This is an opportunity to correct that.

Second, AI boosts the iteration and time to market cycles by plugging into the CI/CD pipelines and reports. Mockups and prototypes often take time in the range of a few weeks at the least as they overcome friction and unexplored territory. This is a fairly long period of time for all participants in the process to see the same outcome. The time and money spent to create and test a prototype could end up costing the initiative in the first place. If this period could be collapsed by virtue of better insights into what works and what doesn’t, reprioritizing efforts to realize the products, better aligning with a strategy that has more chance towards becoming successful, and avoiding avenues of waste or unsatisfactory returns, the net result is shorter and faster product innovation cycles.

One specific ability of AI is called to attention in this regard. The so-called Generative AI can create content from scratch with high speed and even accuracy. This ability is easily seen in the field of copywriting which can be considered a content production strategy. Only in copywriting, the goal is to convince the reader to take a specific action and achieve it with its persuasive character, using triggers to arouse readers’ interest, to generate conversations and sales. Copyrighting is also an essential part of digital marketing strategy with potential to increase brand awareness, generate higher-quality leads, and acquire new customers. Good copywriting articulates the brand’s messaging and image while tuning into the target audience. This is a process that has parallels to product development. AI has demonstrated the potential to generate content from scratch. The difference between content writing and copywriting remains with these product developers to fill.

Sunday, September 24, 2023

Azure managed instance for Apache Cassandra is an open-source NoSQL distributed database that is trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault tolerance on commodity hardware or cloud infrastructure makes it the perfect platform for mission critical data.

Azure managed instance for Apache Cassandra is a distributed database environment. This is a managed service that automates the deployment, management (patching and node health), and scaling of nodes within an Apache Cassandra cluster. It also provides the capability for hybrid clusters, so Apache Cassandra datacenters deployed in Azure can join an existing on-premises or third party hosted Cassandra ring. This service is deployed using Azure Virtual machine scale sets.

However, Cassandra is not limited to any one form of compute platform. For example, Kubernetes runs distributed applications and Cassandra and Kubernetes can be run together. One of the advantages is the use of containers and another is the interactive management of Cassandra from the command line. The Azure managed instance for Apache Cassandra is notorious for allowing limited form of connection and interactivity required to manage the Cassandra instance. Most of the Database administration options are limited to the Azure command line interface that takes the invoke-command option to pass the actual commands to the Cassandra instance. There is no native invocation of commands directly by reaching the IP address because the Azure Managed Instance for Apache Cassandra does not create nodes with public IP addresses, so to connect to a newly created Cassandra cluster, one will need to create another resource inside the VNet. This could be an application, or a Virtual Machine with Apache’s open-source query tool CSQLSH installed. The Azure Portal may also provide connection strings that have all the necessary credentials to connect with the instance using this tool. Native support for Cassandra is not limited to the nodetool and sstable commands that are permitted via the Azure CLI command options. CSQLSH is a command-line shell interface for interacting with Cassandra using CQL (Cassandra Query Language). It is shipped with every Cassandra package and can be found in the bin/ directory. It is implemented with the Python native protocol driver and connects to the single specified node, and this greatly reduces the overhead to manage the Cassandra control and data planes.

The use of containers is a blessing for developers to deploy applications in the cloud and Kubernetes helps with the container orchestration. Unlike managed Kubernetes instances in Azure that can allow a client to configure the .kubeconfig file with connection configuration using the az cli get-credentials and kubectl switch context commands, the Azure managed instance for Apache Cassandra does not come with the option to use kubectl commands. The use of containers helps with managing add or remove of nodes to the Cassandra cluster with the help of the cassandra.yaml file. It can be found in the /etc/cassandra folder within the node. One cannot access the node directly from the Azure managed instance for Cassandra so a shell prompt in the node is out of the question. The nodetool option to bootstrap is also not available via Invoke-Command but it is possible to edit this file. One of the most important properties of this application is the option to set seed-providers for existing datacenters. This option allows a new node to quickly become ready by importing all the necessary information from the existing datacenter. The seed provider must not be set to the new node but point to the existing node.

Cassandra service on a node must be stopped prior to the execution of some commands and restarted post execution. The database must also be set to read-write for certain commands to execute. These options can be set as command line parameters to the Azure Command-line interface for the managed-cassandra set of commands.

Saturday, September 23, 2023

This is a continuation of the previous articles on Azure Databricks usage and Overwatch analysis. While they talked about configuration and deployment of Overwatch, the data ingestion for analysis was taken to be the event hub which in turn collects it from the Azure Databricks resource. This article talks about the collection of the cluster logs and those from the logging and print instructions from the notebooks that run on the clusters.

The default cluster logs directory is the ‘dbfs:/cluster-logs’ and Databricks instance collects it every five minutes and archives every hour. The spark driver logs are saved in this directory. This location is managed by Databricks, and the cluster logs are saved in this directory in a sub-directory named after each cluster. When the cluster is created to attach a notebook to, the cluster’s logging destination is set to dbfs:/cluster-logs by the user under the advanced configuration section of the cluster creation parameters.

The policy under which the cluster gets created is also determined by the users. This policy could also be administered so that the users only create clusters compliant to a policy. In this policy, the logging destination option can be preset to a path like ‘dbfs:/cluster-logs.’ It can also be substituted with a path like ‘/mnt/externalstorageaccount/path/to/folder’ if a remote storage location is provided but it is preferable to use the built-in location.

The Azure Databricks instance will transmit cluster-logs along with all other opted-in logs to the event hub and for that it will require a diagnostic setting specifying the namespace and the EventHub to send to. Overwatch can read this EventHub data but reading from the dbfs:/cluster-logs location is not specified in the documentation.

There are a couple of ways to do that. First, the cluster log destination can be specified in the mapped-path-parameter in the Overwatch deployment csv, so that the deployment knows this additional location to read the data from. Although documentation suggests that the parameter was introduced to cover those workspaces that have more than fifty external storage accounts, it is possible to include just one that the overwatch needs to read from. This option is convenient for reading the default location but again the customers or the administrator must ensure that the clusters are created to send the logs to that location.

While the above works for new clusters, the second option works for both the new and the existing clusters in that a dedicated Databricks job is created to read cluster log locations and transmit to the location that the Overwatch reads from. This job would use the shell command of ‘rsync’ or ‘rclone’ and perform copying activity that can resume with intermittent network failures and indicate progress. When this job runs periodically, the clusters are unaffected and along with the Overwatch jobs, this job would run to make sure that all the relevant logs not covered by those streaming to the EventHub are also read by Overwatch.

Finally, the dashboards that report the analysis performed by Overwatch, which are also available out-of-the-box, can be scheduled to run nightly so that all the logs collected and analyzed are included periodically.

Friday, September 22, 2023

This is a continuation of the previous articles on Azure Databricks and Overwatch analysis. This section focuses on the role-based access control required for the setup and deployment of Overwatch.

The use of a storage account as a working directory for Overwatch implies that it will need to be accessed from the databricks workspace. There are two ways to do this – one that involves the azure active directory credentials passthrough with ‘abfss@container.storageaccount.dfs.core.windows.net’ name resolution and another that mounts the remote storage account as a folder on the local file system.

The former requires that the cluster be enabled for active directory credentials passthrough and will work for directly resolving the deployment and reports folder but for contents whose layout is dynamically determined, the resolution is expensive each time. The abfss scheme also fails with error 403 when there are tokens demanded for certain activities. Instead, the second way of mounting helps with one time setup. The mount is setup with the help of a service principal and getting OAuth tokens from the active directory. It becomes the prefix for all the temporary files and folders.

Using the credentials with the Azure Active Directory only works when there are corresponding role assignments and container/blob access control lists. The role assignment for the control plane differs from that of the data plane so there are roles for both. This separation of roles allows access to certain containers and blobs without necessarily allowing access to change the storage account and container organization or management. With acls applied to individual files/blobs and folders/container, the authentication-authorization-auditing is completely covered and scoped at the finest granularity.

Then queries like the following can come very helpful:

1. Frequent operations can be queried with:

StorageBlobLogs

| where TimeGenerated > ago(3d)

| summarize count() by OperationName

| sort by count_ desc

| render piechart

2. High latency operations can be queried with:

StorageBlobLogs

| where TimeGenerated > ago(3d)

| top 10 by DurationMs desc

| project TimeGenerated, OperationName, DurationMs, ServerLatencyMs, ClientLatencyMs = DurationMs – ServerLatencyMs

3. Operations causing the most error are caused by:

StorageBlobLogs

| where TimeGenerated > ago(3d) and StatusText !contains "Success"

| summarize count() by OperationName

| top 10 by count_ desc

4. Gives the number of read transactions and the number of bytes read on each container:

StorageBlobLogs

| where OperationName == "GetBlob"

| extend ContainerName = split(parse_url(Uri).Path, "/")[1]

| summarize ReadSize = sum(ResponseBodySize), ReadCount = count() by tostring(ContainerName)

Thursday, September 21, 2023

The following is a list of some errors and resolutions encountered with deploying Overwatch dashboards:

1. Overwatch dashboard fails with errors mentioning missing tables.

The databases that Overwatch needs are consumer database usually named overwatch and the ETL database, usually named overwatch_etl. These databases are deployed with the Overwatch notebook runners and there are two versions the 70 and 71. The latter version requires the storage account to be created and a csv to be uploaded to the deployment folder within the overwatch container or bucket in a public cloud storage account. The csv requires a mount location referred to as the storage prefix where all the files associated with the creation and use of database are kept. There are two files there, one each for overwatch consumer database and overwatch_etl database which persist the database outside the catalog of the databricks instance.

When the notebook runs, the tables are created within the catalog and the associated file on the storage account. Over sixty jobs are run to create these tables and eventually all the tables appear in the catalog. Due to the high number of jobs, failures are common and the tables are not all populated. Rerunning the notebook a few times, helps to close the gap towards a complete database.

2. Overwatch has mismatching files and/or database and must need to be redeployed but the starting point is not clean

Due to the versions of notebook used and the intermittent failures from executing any one, it is quite likely that a redeploy from a clean slate is required. Deleting just the persistence files from the storage account will not help because the catalog and the databricks instance might keep a mention of stale configuration. Although a cleanup script is available along with the Overwatch deployment notebooks, it is best to execute the following command for a speedy resolution:

DROP DATABASE overwatch_etl CASCADE;

DROP DATABASE overwatch CASCADE;

-- CLEAR CACHE;

This will delete the associated files from the storage account as well. It is also advisable that if the Overwatch is being upgraded even for a stale deployment, it could be followed up by recreating the storage account container and mounting it on the databricks cluster.

3. When the storage prefix refers to the remote location via the abfss@container.storage.dfs.core.windows.net naming scheme, frequently the unauthorized error displays.

Although mounts are deprecated and abfss is relatively newer than the mounts, creating a mount initially helps prevent repeated resolution for every name lookup. This can be done with the following script:

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id": "<application-id>,

"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-key>",key="<key-name>"),

"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"}

#dbutils.fs.unmount("/mnt/overwatch-wd")

dbutils.fs.mount(

source = "abfss://container@storageaccountname.dfs.core.windows.net/",

mount_point = "/mnt/overwatch-wd",

extra_configs = configs)

dbutils.fs.ls("/mnt/overwatch-wd")

Wednesday, September 20, 2023

This is a continuation of the articles on infrastructure deployments. One of the popular instruments to exercise governance on Azure resources is the Azure Policy. The definition and the assignment constitute the policy. Assignments are used to define which resources are assigned to which policies or initiatives. The assignment can determine the values of the parameters for that group of resources at assignment time which makes the definition reusable with different needs for compliance.

Among the properties of the assignment, the enforcement mode stands out. This property provides customers the ability to test the outcome of a policy on existing resources without initiating the policy effect or triggering entries in the Azure Activity Log. It is also referred to as the “What If” scenario and aligns to safe deployment practices. When the mode is set to Enabled, the JSON value is ‘Default’ and the policy effect is enforced during resource creation or update. When the mode is set to Disabled, the JSON value is ‘DoNotEnforce’ policy effect is not enforced during resource creation or update. If the enforcement mode is not specified, the value ‘Default’ applies.

The scope of the assignment includes all child resource containers and child resources. If a child resource container or child resource should not have the definition applied, each can be excluded from the evaluation by setting notScopes which defaults to an empty array [].

The effects currently supported in a policy definition include Append, Audit, AuditIfNotExists, Deny, DenyAction, DeployIfNotExists, Disabled, Manual, Modify. When the policy definition effect is Modify, Append, or DeployIfNotExists, Policy alters the request or adds to it. When the policy definition effect is Audit or AuditIfNotExists, Policy causes an Activity log entry to be created for new and updated resources. And when the policy definition effect is Deny or Deny action, Policy stops the creation or alteration of the request. The effects must always be tried out. Validation of a policy ensures that the non-compliant resources are correctly reported and that false positives are excluded. The recommended approach to validating a new policy definition is by following these steps: tightly defining the policy, auditing existing resources, auditing new or updated resource requests, deploying policy to resources, and continuous monitoring.

A differentiation between Audit and AuditIfNotExists must be called out. Audit generates a warning event in the activity log if a related resource does not exist but does not fail the request. AuditIfNotExists generates a warning event in the activity log if a related resource does not exist. The If condition evaluates a field so a value must be provided for the name of the field. It references fields on the resources that are being evaluated.

Tuesday, September 19, 2023

Pure and mixed templates:

Infrastructure-as-a-code is a declarative paradigm that is a language for describing infrastructure and the state that it must achieve. The service that understands this language supports tags, RBAC, declarative syntax, locks, policies, and logs for the resources and their create, update, and delete operations which can be exposed via the command-line interface, scripts, web requests, and the user interface. Declarative style also helps to boost agility, productivity, and quality of work within the organizations. 

Template providers often go to great lengths to determine the convention, syntax and semantics that authors can use to describe the infrastructure to be setup. Many provide common forms of expressing infrastructure and equivalents that are similar across providers. Authors, however, rely on tools to import and export infrastructure. Consequently, they must mix and match templates.

One such template provider is AWS cloud’s CloudFormation. Terraform is the open-source equivalent that helps the users with the task of setting up and provisioning datacenter infrastructure independent of clouds. These cloud configuration files can be shared among team members, treated as code, edited, reviewed and versioned.

Terraform allows including Json and Yaml in the templates and state files using built-in functions called jsonencode and yamlencode respectively. With the tools to export templates in one of the two well-known forms, it becomes easy to import in Terraform with these two built-in functions. Terraform can also be used to read and export existing cloud infrastructure in its syntax but often they may be exported in ugly compressed hard-to-read format and these two built-in functions allow multi-line display of the same content which makes it more readable.

AWS CloudFormation has a certain appeal for being AWS native with a common language to model and provision AWS and third-party resources. It abstracts the nuances in managing AWS resources and their dependencies making it easier for creating and deleting resources in a predictable manner. It makes versioning and iterating of the infrastructure more accessible. It supports iterative testing as well as rollback.

Terraform’s appeal is that it can be used for multi-cloud deployment. For example, it deploys serverless functions with AWS Lambda, manage Microsoft Azure Active Directory resources, and provision a load balancer in Google cloud.

Both facilitate state management. With CloudFormation, users can perform drift detection on all of their assets and get notifications when something changes. It also determines dependencies and performs certain validations before a delete command is honored. Terraform stores the state of the infrastructure on the provisioning computer, or in a remote site in proprietary JSON which serves to describe and configure the resources. The state management is automatically done with no user involvement by CloudFormation whereas Terraform requires you to specify the remote store or fallback to local disk to save state.

Both have their unique ways for addressing flexibility for changing requirements. Terraform has modules which are containers for multiple resources that are used together and CloudFormation utilizes a system called “nested stacks” where templates can be called from within templates. A benefit of Terraform is increased flexibility over CloudFormation regarding modularity.

They also differ in how they handle configuration and parameters. Terraform uses provider specific data sources. The implementation is modular allowing data to be fetched and reused. CloudFormation uses up to 60 parameters per template that must be of a type that CloudFormation understands. They must be declared or retrieved from the System Manager parameter store and used within the template.
Both are powerful cloud infrastructure management tools, but one is more favorable for cloud-agnostic support. It also ties in very well with DevOps automations such as GitLab. Finally, having an abstraction over cloud lock-ins might also be beneficial to the organization in the long run.