Cluster computing

Monday, September 4, 2023

Azure managed instance for Apache Cassandra is an open-source NoSQL distributed database that is trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault tolerance on commodity hardware or cloud infrastructure makes it the perfect platform for mission critical data. This is a distributed database environment, but the data can be replicated to other environments including the Azure Cosmos Database for use with Cassandra API.

The Database Migration Assistant has a preview feature to help with this database migration. The Azure Cosmos DB Cassandra connector helps with the live data migration from existing native Apache Cassandra workloads running on-premises or in the Azure public cloud to the Azure Cosmos DB with zero application downtime. It does this with the help of a replication agent to move data from Apache Cassandra to the Cosmos DB. The replication agent is a java process that runs on the native Cassandra host(s) and uploads data from Cassandra via a managed pipeline. Customers need only download the agent on the source Cassandra nodes and configure the target Azure Cosmos DB Cassandra API account information.

The replication agent runs on the native Cassandra cluster. Once it is installed, it takes a snapshot of the cluster and uploads the requisite files. After the initial snapshot, continuous ingestion commences in the following manner. First, it connects to the replication metadata endpoint of the Cosmos Cassandra account and fetches replication component information. Then it sends the commit logs to the replication component. Finally, mutations are replicated to the Cosmos DB Cassandra endpoint by the replication component.

Customers can begin using the data in the Azure Cosmos DB Cassandra API account by first verifying the supported features of Cassandra here and estimating the request units required. This can be calculated even at the granularity of each operation which helps with the planning.

The benefits of this data migration from native Cassandra clusters to Cosmos DB Cassandra API account include no downtime, no code changes, and no manual data migration. The configuration is simple and the replication is fast. It is also completely transparent to Cassandra and the other workloads to the cluster.

The Cosmos DB Cassandra API account normalizes the cost of all database operations using Request Units. This is a performance currency abstracting the system resources such as CPU, IOPS, and memory that are required to perform the database operations and help with cost estimation in dollars by virtue of unit price.

Sunday, September 3, 2023

The Applications and APIs for Insurance Administration and payment analytics:

As with the broad industry trend across rapid application development scenarios, microservices and single page applications are abundant across the healthcare administration and analytics business purposes. The promise of microservices is the separation of concern among business purposes with deep isolations when necessary, including the data stores. They are also independently testable and provide a medium for continuous deliverables to stakeholders. The promise of single page applications is the simplicity of describing modular components within the web pages and for reusability across workflows. Together they empower a variety of scenarios that require the spectrum of compute intensive to data intensive capabilities.

We leave the infrastructure provisioning and the associated operational services such as logging, registry and monitoring out of this discussion and focus instead on the development of applications and api services. Although the choice of infrastructure and the development of the application are not completely divested of one another and must have mutual considerations, it suffices to say that the business capability versus application development boundary is customer facing while the infrastructure provisioning and application development boundary is backend facing.

Among the several aspects of application development, dedicated data services such as catalog or inventory can be separated from the rest of the capabilities such as claim analytics, cob rules and commercial lines of business. The api services by virtue of their number often suffer from consistency and framework that the infrastructure demands and are developed in house by those business divisions. They also become bloated as division tend to take on solutions to common problems that do not necessarily deal streamline with the business capabilities.

The same can be said about the components in the single page applications. Many applications rediscover the same browsing, filtering, and editing capabilities that do not necessarily pertain to a line of business. This leads both the applications to develop a common repository for reusable modules that become more of a limitation rather than a facilitator of consistency and capability. If there are attributes left out of the common definitions and the derived instance cannot add them, they can no longer use the common definitions and must write one from scratch.

The single page applications essentially display tabular data. They are not data entry intensive or require complex long running calculations. This makes the entire user workflows have short duration but more interactive. Some of the workflows are read-only operations often requiring checking on status or model predictions that run independently. These imply that the analytical queries and logic are also saved external to the applications and sometimes external to the API. Grouping of queries is also dedicated to the business purpose and often require little or no grouping. This leads to a different set of requirements on the analytics and reporting side than on the application and processing side.

Finally, the applications require modernization as much as the legacy platforms do. For example, the dominant statistical platform is SAS and this is now universally replaced by Python and R packages.

#codingexercise

public static int[] canonballsIterative(int[] A, int[] B) {

for (int j = 0; j < B.length; j++) {

int h = B[j];

for (int i = 0; i < A.length; i++) {

if (A[i] >= h ) {

if (i == 0) { break; }

if (i > 0) {

A[i-1] += 1;

System.out.println("h=" + h + " i=" + i + " A: " + printArray(A));

break;

}

return A;

}

Saturday, September 2, 2023

As with any digital assets, Infrastructure-as-code requires the same level of monitoring as with any the resources in the public cloud. Changes to the resources are as important to know before hand than after the effect. Consequently, subscriptions and notifications play an important role in the pipelines that deploy the infrastructure.

There are several ways to setup alerts and notifications, and they mostly have to do with the path rather than the content.

The first method is to send out notifications from the pipeline as the code is compiled and executed. There are ways to do this from say the repository with the help of say GitHub Actions or the repository settings. The latter is used to send out notifications in the form of emails by merely specifying the email addresses. The former is used for more involved notifications such as making HTTP Post requests to webhook urls as in the case of posting a message in the Teams channel. Either way the payload for commit notifications includes information such as the name of the repository, the branch a commit was made in, the SHA1 of the commit and its link to the diff in GitHub, the author of the commit, the date when the commit was made, the files that were changed as part of the commit, and the commit message. Notifications can also be expanded to include a conversation in a specific issue, pull request or gist, all activity in a repository, CI activity such as the status of workflows in repositories such as with GitHub Actions and repository issues, pull requests, releases, security alerts, or discussions if enabled. The notification via Teams channel requires a step in the GitHub actions and the MS_Teams_WebHook_URI for the dedicated Microsoft Teams Channel. The webhook URI is saved as a secret in the GitHub repository’s settings. The step itself is executed only on the events specified and these can include a wide variety with the pull_request, push and the deployment events as the most common ones. The builtin module to use in this case would be actions/checkout@v2 and the runner will require the parameters as the operating system say ubuntu-latest, a github token that is used for reading the repository, the webhook uri read from the secrets, the notification summary, color and timezone. Emoji support isn’t great for incoming webhooks on Microsoft Teams yet but it can be hacked through HEX codes.

The Microsoft Teams Channel on its end can have a GitHub application added or a bot created to display the messages. The Webhook url must be added, configured and saved. Channels that can add a GitHub application have the choice of sending some canned commands to help setup this up end to end. For example, the “subscribe owner/repo workflows:{name: “your workflow name” event: “workflow event” branch: “branch name” actor:”Actor name” } will filter out for the passed in values to those parameters.

These are some of the ways the alerts and notifications can be setup on the IaC.

Friday, September 1, 2023

The Hidden Factor:

Introduction: Software CI/CD pipelines authors often miss out on a critical component when it comes to automating IaC deployments. This factor called state is declared, easy to locate and even documented well but its role in the traditional code pipelines often escapes attention.

The trio of portal, state and IaC must be kept in sync otherwise one of the most perplexing errors that appear is that the changes pushed through the pipeline break unrelated resources.

This article suggests how these three components must be maintained.

Priority:

1. Keep the IaC and state in sync with portal without touching resources.

2. Pipeline must not show conflicts for unrelated changes, edit state.

3. Follow up on any state edits with changes to IaC for resources impacted.

Severity:

1. Maintain associations when adding subnets or virtual networks, allow access to related resources.

2. When version increases occur, please include them in the portal, state, and code.

Best Practice

1. Add optional attributes to IaC

2. Prevent unrelated changes to not see conflict.

3. Follow up on any state edits such as version bump or increase count with IaC

4. Keep the planning and apply stages to show similar or no conflicts.

Process:

1. Forward write-through –

a. Create new resources – complete all associations.

b. Introduce the state of the new resources.

c. Create the resources in the portal.

d. Indicate blockers or announce your changes, when important.

2. Backward propagate changes from Portal

a. Capture the changes in state

b. Capture the changes in IaC

c. Go through step 1 to check that it is no-op

3. Establish baseline and make incremental updates where after each update all three are in sync

4. Add enforcements, detect changes, and send notifications when things change

Finally, the changes being made to keep all three in sync were often spread out over time and distributed among authors leading to sources of errors or discrepancies. Establishing a baseline combination of state, IaC and corresponding resources is necessary to make incremental changes. It is also important to keep them in sync going forward. The best way to do this would be to close the gap by enumerating all discrepancies to establish a baseline and then have the process and the practice to enforce that they do not get out of sync.

Thursday, August 31, 2023

Recently, I came across a situation where CI/CD pipelines were making unintended changes to the resources in the Azure public cloud. The IaC was written in Terraform and the resource provider was Azure. The symptom that manifested was that when a code change was pushed through the pipeline, settings on unrelated resources would fall off. This impacted the uptime of those resources and business continuity suffered whenever those settings were to be restored. In addition, it was getting hard to tell which resources were going to be affected since the author of any change had nothing to do with those resources in his change. The team responsible for the IaC is referred to as the infrastructure team.

There is also some context to this situation that came before these symptoms manifested. First, the subscription where these resources were impacted had long been a shared subscription and one of the first to be tried out. Consequently, there were proof of concepts, multiple versions, and many stakeholders, sometimes even with contributor access who could update their specific resources. The sheer number of resource groups, subnets and virtual networks had grown to be quite large and neglected in a few cases. The specific resources that were most affected by this exchange were the app services and it just so happened that the application engineering team had started requiring changes more often now than ever before for an improvement they owned.

One specific example of changes that were accumulated in the portal was virtual network integration for these resources and whenever these settings fell off, the connectivity was disrupted resulting in some downtime. While this applied to the outbound traffic from the resource, similar discrepancies were noticed on the incoming side where access restrictions were lost on the resource. Since the inbound traffic and outbound traffic settings were maintained by the infrastructure team, they were supposed to be captured in the IaC. Briefly, some of these definitions indeed appeared in the IaC but on closer inspection, they turned out to be improper or even failing enforcement. Other settings happened to be specific to that resource only and very much linked to the code or container image being deployed to those resources. The application engineering team managed these.

Another source of errors was attributed to the Terraform state. Irrespective of the resources in the portal or their definitions in the IaC, the state was maintained and even updated without corresponding changes elsewhere. This was done to overcome the conflicts that were found during the compile or the execution of the IaC but it resulted in other sets of conflicts found when the pipeline ran. Consequently, resources were even destroyed during the execution of the pipeline. It is not wrong to edit the state file, but it is usually done to keep it in sync with both the portal and the IaC. Keeping it in sync with the portal first and then backpropagating the changes to the IaC is one direction of the edits. The other direction is to write through the state with the changes in the IaC and then push it to the resources in the portal. Both Non-prod and prod resources must have their own sets of IaC, state and actual resources and must also be kept separate.

Lastly, the changes being made to keep all three in sync were often spread out over time and distributed among authors leading to sources of errors or discrepancies. Establishing a baseline combination of state, IaC and corresponding resources is necessary to make incremental changes. It is also important to keep them in sync going forward. The best way to do this would be to close the gap by enumerating all discrepancies to establish a baseline and then have the process and the practice to enforce that they do not get out of sync.

References: Earlier articles on IaC shortcomings and resolutions: IacResolutionsPart21.docx

Wednesday, August 30, 2023

Databricks and active directory passthrough authentication.

Azure Databricks is used to process, store, clean, share, analyze, model, and monetize datasets with solutions from Business Intelligence to machine learning. It is used to build and deploy data engineering workflows, machine learning models, analytics dashboards and more.

It connects to different external storage locations including the Azure Data Lake Storage. Users logged in to the Azure Databricks instance can execute python code and use Spark platform to view tabular representation of the data stored in various formats on the external storage accounts. When they refer to a file on the external storage account, they need not specify credentials to connect, and their logged-in credential can be passed through to the remote storage account. For example: spark.read.format("parquet").load("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data")

This feature required two settings:

1. When a compute cluster is created to execute the python code, it must have the checkbox to pass through the credentials, checked.

2. It must also have the flag spark.databricks.passthrough.adls set to true

Until recently, the Sparks UI allowed this flag to be set but the configuration for passthrough changed with the new UI that facilitated Unity Catalog – a unified access control mechanism. Passthrough credentials and Unity Catalog are mutually exclusive. The flag no longer can be set to create new clusters with the new UI in most cases and this affected the implicit login required to authenticate the current user to the remote storage. The token provider used earlier was spark.databricks.passthrough.adls.gen2.tokenProviderClassName and with the new UI the login required more elaborate configuration. The error code encountered by the users when using the earlier clusters with the newer version Databricks UI is 403.

The newer configuration is the following:

spark.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net <sp client id>

spark.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net OAuth

spark.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider

spark.hadoop.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net {{secrets/yoursecretscope/yoursecretname}}

spark.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net https://login.microsoftonline.com/<tenant>/oauth2/token

This requires a secret to be created but that is possible via the https://<databricks-instance#secrets/createScope URL. The value used would be {{secrets/yoursecretscope/yoursecretname}}

Finally, the 403 error also requires that the networking be checked. If the databricks and storage accounts are in different virtual networks, that of the storage account must allow list the subnets both private and public for the databricks instance.

Tuesday, August 29, 2023

Reference: This article is a continuation of articles on Azure Resources with the last one describing Cassandra Configuration: CassandraConnectivity-2.docx