Cluster computing

Thursday, September 7, 2023

Access Control is notorious in the IaC for quickly growing out of bounds and being rather unstable. Between declarative forms and scripts, the role-assignments increase much more with the number of related resources.

A role assignment consists at the minimum of a role, an assignee, and a scope. The role can either be a fully qualified identifier to its definition or just its name. The assignee on the other hand can be one of many types such as a user, group, ServicePrincipal or a ForeignGroup. With each of these types, the role assignment must specify the principal identifier which is a guid. It could also simply replace the type and the identifier with a name for an assignee. A scope must be specified, and it is preferable to give the entire resource id. This role assignment seems simple with values passed for three parameters, but a string of cryptic errors encountered with their usage makes it brittle.

For example, one of the ways the assignee is obtained during IaC is by provisioning the resource with a system managed identity and subsequently retrieving its identifier by querying for the provisioned resource. A simple variable substitution will then be attempted with the role assignment, and it would fail with the message that the principal id provided as the assignee must be a valid Guid. Manually inspecting the resource and making sure that the object id and not the app id was used is required. But the source of the error might be something inconspicuous in the form of double quotes on either end that are added as part of the query results and escapes one’s attention. Stripping the quotes around the principal id is required for the role assignment command to recognize it.

Similarly, scope is often constructed from elements but in fact it is best treated as opaque and taken directly from the source. Any kind of parsing or reconstruction is prone to script errors and grammar errors.

More importantly, the number of role assignments can be reduced by targeting a higher scope, but the diligence required to group and organize the role-assignments such that they can be avoided or replaced with higher or more specific assignments, might be daunting. It is precisely the overlooking of proper organization by multiple participants that the technical debt is incurred and with the role assignments proliferating, this debt comes back to haunt quickly.

Finally, role assignments and network rules are hard to debug when they go missing and it is in the best interest of the code maintainers to actually specify the associations right at the time of creation. The symptoms manifested by missing rules and assignments are not only difficult to diagnose but also tend to work their way backwards from the customers and end-users. Proper application of role assignment might return the dreaded 403 forbidden http status code and message even when the root cause might have been just cross network permissions that went missing when the resources were created.

Authentication, authorization, and auditing are the final proof of declarations that work and those that do not. One must remove the unnecessary just as much as the incorrect ones.

A special mention about IaC state must be made because state eludes the code as the cause and the resource as the effect. Carefully propagating the changes forward from code and writing through the state to apply the changes to the resources and similarly backward propagating the modified resource by importing them into the state and updating the IaC must be fully traversed in both directions to keep the code, the state and the resources in sync. The changes being made to keep all three in sync were often spread out over time and distributed among authors leading to sources of errors or discrepancies. Establishing a baseline combination of state, IaC and corresponding resources is necessary to make incremental changes. It is also important to keep them in sync going forward. The best way to do this would be to close the gap by enumerating all discrepancies to establish a baseline and then have the process and the practice to enforce that they do not get out of sync.

Wednesday, September 6, 2023

Authentication, authorization, and auditing are the final proof of declarations that work and those that do not. One must remove the unnecessary just as much as the incorrect ones.

Tuesday, September 5, 2023

Improve workloads and solution deployments:

Solutions for the industry that are implemented new, benefit from a set of principles that provide prescriptive guidance to improving the quality of their deployments. When the industry moves from digital adoption to digital transformation to digital acceleration, the sustainability journey requires a strong digital foundation. It is the best preparation for keeping pace with this rapid change.

This is true for meeting new sustainability requirements, avoiding the worst impacts of climate change and other business priorities such as driving growth, adapting to industry shifts, and navigating energy consumption and economic conditions. It helps to track and manage data at scale, unifying data and improving visibility across the organization. This helps to reliably report your sustainability impact, driving meaningful progress and finding gaps where the most impact can be delivered.

The well-architected framework consists of five pillars. These are reliability (REL), security (SEC), cost optimization (COST), operational excellence (OPS) and performance efficiency (PERF). The elements that support these pillars are a review, a cost and optimization advisor, documentation, patterns-support-and-service offers, reference architectures and design principles.

This guidance provides a summary of how these principles apply to the management of the data workloads.

Cost optimization is one of the primary benefits of using the right tool for the right solution. It helps to analyze the spend overtime as well as the effects of scale out and scale up. An advisor can help improve reusability, on-demand scaling, reduced data duplication, among many others.

Performance is usually based on external factors and is remarkably close to customer satisfaction. Continuous telemetry and reactiveness are essential to tuned up performance. The shared environment controls for management and monitoring create alerts, dashboards, and notifications specific to the performance of the workload. Performance considerations include storage and compute abstractions, dynamic scaling, partitioning, storage pruning, enhanced drivers, and multilayer cache.

Operational excellence comes with security and reliability. Security and data management must be built right into the system at layers for every application and workload. The data management and analytics scenario focus on establishing a foundation for security. Although workload specific solutions might be required, the foundation for security is built with the Azure landing zones and managed independently from the workload. Confidentiality and integrity of data including privilege management, data privacy and appropriate controls must be ensured. Network isolation and end-to-end encryption must be implemented. SSO, MFA, conditional access and managed service identities are involved to secure authentication. Separation of concerns between azure control plane and data plane as well as RBAC access control must be used.

The key considerations for reliability are how to detect change and how quickly the operations can be resumed. The existing environment should also include auditing, monitoring, alerting and a notification framework.

In addition to all the above, some consideration may be given to improving individual service level agreements, redundancy of workload specific architecture, and processes for monitoring and notification beyond what is provided by the cloud operations teams.

Each pillar contains questions for which the answers relate to technical and organizational decisions that are not related to the features of the software to be deployed. For example, a software that allows people to post comments must honor use cases where some people can write, and others can read. But the system developed must also be safe enough to handle all the traffic and should incur reasonable costs.

Since the most crucial pillars are OPS and SEC, they should never be traded in to get more out of the other pillars.

The security pillar consists of Identity and access management, detective controls, infrastructure protection, data protection and incident response. Three questions are routinely asked for this pillar: How is the access controlled for the serverless api? How are the security boundaries managed for the serverless application? How is the application security implemented for the workload?

The operational excellence pillar is made up of four parts: organization, preparation, operation, and evolution. The questions that drive the decisions for this pillar include: How is the health of the serverless application known? How is the application lifecycle management approached?

The reliability pillar is made of three parts: foundations, change management, and failure management. The questions asked for this pillar include: How are the inbound request rates regulated? How is resiliency built into the serverless application?

The cost optimization pillar consists of five parts: cloud fiscal management practice, expenditure and usage awareness, cost-effective resources, demand management and resources supply, and optimizations over time. The questions asked for cost optimization include: How are the costs optimized?

The performance efficiency pillar is composed of four parts: selection, review, monitoring, and tradeoffs. The questions asked for this pillar include: How is the performance optimized for the serverless application?

In addition to these questions, there is quite a lot of opinionated and even authoritative perspectives on the appropriateness of a framework and they are often referred to as lenses. With these forms of guidance, a well-architected framework moves closer to an optimized realization.

Monday, September 4, 2023

Azure managed instance for Apache Cassandra is an open-source NoSQL distributed database that is trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault tolerance on commodity hardware or cloud infrastructure makes it the perfect platform for mission critical data. This is a distributed database environment, but the data can be replicated to other environments including the Azure Cosmos Database for use with Cassandra API.

The Database Migration Assistant has a preview feature to help with this database migration. The Azure Cosmos DB Cassandra connector helps with the live data migration from existing native Apache Cassandra workloads running on-premises or in the Azure public cloud to the Azure Cosmos DB with zero application downtime. It does this with the help of a replication agent to move data from Apache Cassandra to the Cosmos DB. The replication agent is a java process that runs on the native Cassandra host(s) and uploads data from Cassandra via a managed pipeline. Customers need only download the agent on the source Cassandra nodes and configure the target Azure Cosmos DB Cassandra API account information.

The replication agent runs on the native Cassandra cluster. Once it is installed, it takes a snapshot of the cluster and uploads the requisite files. After the initial snapshot, continuous ingestion commences in the following manner. First, it connects to the replication metadata endpoint of the Cosmos Cassandra account and fetches replication component information. Then it sends the commit logs to the replication component. Finally, mutations are replicated to the Cosmos DB Cassandra endpoint by the replication component.

Customers can begin using the data in the Azure Cosmos DB Cassandra API account by first verifying the supported features of Cassandra here and estimating the request units required. This can be calculated even at the granularity of each operation which helps with the planning.

The benefits of this data migration from native Cassandra clusters to Cosmos DB Cassandra API account include no downtime, no code changes, and no manual data migration. The configuration is simple and the replication is fast. It is also completely transparent to Cassandra and the other workloads to the cluster.

The Cosmos DB Cassandra API account normalizes the cost of all database operations using Request Units. This is a performance currency abstracting the system resources such as CPU, IOPS, and memory that are required to perform the database operations and help with cost estimation in dollars by virtue of unit price.

Sunday, September 3, 2023

The Applications and APIs for Insurance Administration and payment analytics:

As with the broad industry trend across rapid application development scenarios, microservices and single page applications are abundant across the healthcare administration and analytics business purposes. The promise of microservices is the separation of concern among business purposes with deep isolations when necessary, including the data stores. They are also independently testable and provide a medium for continuous deliverables to stakeholders. The promise of single page applications is the simplicity of describing modular components within the web pages and for reusability across workflows. Together they empower a variety of scenarios that require the spectrum of compute intensive to data intensive capabilities.

We leave the infrastructure provisioning and the associated operational services such as logging, registry and monitoring out of this discussion and focus instead on the development of applications and api services. Although the choice of infrastructure and the development of the application are not completely divested of one another and must have mutual considerations, it suffices to say that the business capability versus application development boundary is customer facing while the infrastructure provisioning and application development boundary is backend facing.

Among the several aspects of application development, dedicated data services such as catalog or inventory can be separated from the rest of the capabilities such as claim analytics, cob rules and commercial lines of business. The api services by virtue of their number often suffer from consistency and framework that the infrastructure demands and are developed in house by those business divisions. They also become bloated as division tend to take on solutions to common problems that do not necessarily deal streamline with the business capabilities.

The same can be said about the components in the single page applications. Many applications rediscover the same browsing, filtering, and editing capabilities that do not necessarily pertain to a line of business. This leads both the applications to develop a common repository for reusable modules that become more of a limitation rather than a facilitator of consistency and capability. If there are attributes left out of the common definitions and the derived instance cannot add them, they can no longer use the common definitions and must write one from scratch.

The single page applications essentially display tabular data. They are not data entry intensive or require complex long running calculations. This makes the entire user workflows have short duration but more interactive. Some of the workflows are read-only operations often requiring checking on status or model predictions that run independently. These imply that the analytical queries and logic are also saved external to the applications and sometimes external to the API. Grouping of queries is also dedicated to the business purpose and often require little or no grouping. This leads to a different set of requirements on the analytics and reporting side than on the application and processing side.

Finally, the applications require modernization as much as the legacy platforms do. For example, the dominant statistical platform is SAS and this is now universally replaced by Python and R packages.

#codingexercise

public static int[] canonballsIterative(int[] A, int[] B) {

for (int j = 0; j < B.length; j++) {

int h = B[j];

for (int i = 0; i < A.length; i++) {

if (A[i] >= h ) {

if (i == 0) { break; }

if (i > 0) {

A[i-1] += 1;

System.out.println("h=" + h + " i=" + i + " A: " + printArray(A));

break;

}

return A;

}

Saturday, September 2, 2023

As with any digital assets, Infrastructure-as-code requires the same level of monitoring as with any the resources in the public cloud. Changes to the resources are as important to know before hand than after the effect. Consequently, subscriptions and notifications play an important role in the pipelines that deploy the infrastructure.

There are several ways to setup alerts and notifications, and they mostly have to do with the path rather than the content.

The first method is to send out notifications from the pipeline as the code is compiled and executed. There are ways to do this from say the repository with the help of say GitHub Actions or the repository settings. The latter is used to send out notifications in the form of emails by merely specifying the email addresses. The former is used for more involved notifications such as making HTTP Post requests to webhook urls as in the case of posting a message in the Teams channel. Either way the payload for commit notifications includes information such as the name of the repository, the branch a commit was made in, the SHA1 of the commit and its link to the diff in GitHub, the author of the commit, the date when the commit was made, the files that were changed as part of the commit, and the commit message. Notifications can also be expanded to include a conversation in a specific issue, pull request or gist, all activity in a repository, CI activity such as the status of workflows in repositories such as with GitHub Actions and repository issues, pull requests, releases, security alerts, or discussions if enabled. The notification via Teams channel requires a step in the GitHub actions and the MS_Teams_WebHook_URI for the dedicated Microsoft Teams Channel. The webhook URI is saved as a secret in the GitHub repository’s settings. The step itself is executed only on the events specified and these can include a wide variety with the pull_request, push and the deployment events as the most common ones. The builtin module to use in this case would be actions/checkout@v2 and the runner will require the parameters as the operating system say ubuntu-latest, a github token that is used for reading the repository, the webhook uri read from the secrets, the notification summary, color and timezone. Emoji support isn’t great for incoming webhooks on Microsoft Teams yet but it can be hacked through HEX codes.

The Microsoft Teams Channel on its end can have a GitHub application added or a bot created to display the messages. The Webhook url must be added, configured and saved. Channels that can add a GitHub application have the choice of sending some canned commands to help setup this up end to end. For example, the “subscribe owner/repo workflows:{name: “your workflow name” event: “workflow event” branch: “branch name” actor:”Actor name” } will filter out for the passed in values to those parameters.

These are some of the ways the alerts and notifications can be setup on the IaC.

Friday, September 1, 2023

The Hidden Factor:

Introduction: Software CI/CD pipelines authors often miss out on a critical component when it comes to automating IaC deployments. This factor called state is declared, easy to locate and even documented well but its role in the traditional code pipelines often escapes attention.

The trio of portal, state and IaC must be kept in sync otherwise one of the most perplexing errors that appear is that the changes pushed through the pipeline break unrelated resources.

This article suggests how these three components must be maintained.

Priority:

1. Keep the IaC and state in sync with portal without touching resources.

2. Pipeline must not show conflicts for unrelated changes, edit state.

3. Follow up on any state edits with changes to IaC for resources impacted.

Severity:

1. Maintain associations when adding subnets or virtual networks, allow access to related resources.

2. When version increases occur, please include them in the portal, state, and code.

Best Practice

1. Add optional attributes to IaC

2. Prevent unrelated changes to not see conflict.

3. Follow up on any state edits such as version bump or increase count with IaC

4. Keep the planning and apply stages to show similar or no conflicts.

Process:

1. Forward write-through –

a. Create new resources – complete all associations.

b. Introduce the state of the new resources.

c. Create the resources in the portal.

d. Indicate blockers or announce your changes, when important.

2. Backward propagate changes from Portal

a. Capture the changes in state

b. Capture the changes in IaC

c. Go through step 1 to check that it is no-op

3. Establish baseline and make incremental updates where after each update all three are in sync

4. Add enforcements, detect changes, and send notifications when things change

Finally, the changes being made to keep all three in sync were often spread out over time and distributed among authors leading to sources of errors or discrepancies. Establishing a baseline combination of state, IaC and corresponding resources is necessary to make incremental changes. It is also important to keep them in sync going forward. The best way to do this would be to close the gap by enumerating all discrepancies to establish a baseline and then have the process and the practice to enforce that they do not get out of sync.