Cluster computing

Saturday, February 11, 2023

A conjecture about Artificial Intelligence:

ChatGPT is increasingly in the news for its ability to mimic a human conversationalist and for being versatile. It introduces a significant improvement over the sequential encoding with the GPT-3 family of parallelizable learning. This article wonders if the state considered to be in summation form so that as the text continues to be encoded, the overall state is continuously accumulated in a streaming manner. But first a general introduction to the subject.

ChatGPT is based on a type of neural network called a transformer. These are models that can translate text, write poems and op-ed and even generate code. Newer natural language processing (NLP) models like BERT, GPT-3 or T5 are all based on transformers. Transformers are incredibly impactful as compared to their predecessors that were also based on neural networks. Just to recap, neural networks are models for analyzing complicated data by discovering hidden layers that represent the lateny semantics in a given data and often referred to as the hidden layer between an input layer of data and an output layer of encodings. Neural networks can handle a variety of data including images, video, audio and text. There are different types of neural networks optimized for different type of data. If we are analyzing images, we would typically use a convolutional neural network so called because it often begins with embedding the original data into a common shared space before it undergoes synthesis, training and testing. The embedding space is constructed from an initial collection of representative data. For example, it could refer to a collection of 3D shapes drawn from real world images. The space organizes latent objects between images and shapes which are complete representation of objects. The use of CNN is a technique that focuses on the salient invariant embedded objects rather than the noise.

CNNs worked great for detecting objects in images but did not do as well for languages or tasks such as summarizing text or generating text. The Recurrent Neural Network was introduced to adjust this for language tasks such as translation where an RNN would take a sentence from the source language and translate it to a destination language one word at a time and then sequentially generate the translations. Sequential is important because the words order matter for the meaning of a sentence as in “The dog chased the cat” versus “The cat chased the dog”. RNNs had a few problems. They could not handle large sequences of text, like long paragraphs. They were also not fast enough to handle large data because they were sequential. The ability to train on large data is considered a competitive advantage when it comes to NLP tasks because the models become more tuned.

(to be continued...)

Friday, February 10, 2023

The previous post discussed moving enterprise builds and deployments to the cloud for massive repositories. This article explores the significant other challenges that must also be overcome to do so.

The environment provisioning and manual testing of the environment is critical for the DevOps processes to gain popularity. As with any DevOps practice, the principles on which they are founded must always include a focus on people, process and technology. With the help of Infrastructure-as-a-code and blueprints, resources, policies and accesses can be packaged together and become a unit of provisioning the environment. Not all such deployments will cater to all the teams. Instead of allowing teams to customize, there must be some T-shirt sizing and pre-defined flavors for the environment. Common examples are development, stage and production for flavors and sizes might involve small, medium or large. Roles could be administrator, contributor and reader while naming conventions can include well-understood prefixes and/or suffixes. When organizations look for environments to deploy code, they might not plan for all of the resources or the best practices for their configuration. Having the environment to be well-thought out and pre-defined templates allows them to focus more on their business objectives. Unfortunately, the task of creating environments might be daunting for a centralized organization wide IT staff and especially when there are cross-cutting concerns. In such cases, it is better to delegate the choices to the customers and then use a consensus to prepare the environments. When the environment choices and configurations are baked, they can be made reusable, and a registry could be used to maintain their versions and share them out.

Container images, source code repositories and release repositories are all examples of reusability of the artifacts that they store. The public cloud comes with many aspects of automations, functions, runbooks and logic apps that can help ease the customizations that individual teams need. Creating a pipeline often involves programmability for these teams to leverage but it is not possible to anticipate all their logic or provide a library of common routines. Only as usage increases over time, it is possible to curate scripts.

In the build and development, there are always cyclical dependencies that are encountered at various scopes and levels. These must be broken by forming an initial artifact or logic with which the subsequent cycles can work with. The creation of a dependency container, for example, might require the creation of a dependency which in turn might require the container to package the dependency. This can be broken by creating the first version of the dependencies and then allowing the subsequent to use the iterative versions. Cyclical dependencies are a symptom of trouble that must be actively sought to be eliminated with more streamlined processes so that there is no remediation required.

The DevOps Adoption RoadMap has evolved over time. What used to be Feature Driven Development around 1999 gave way to Lean thinking and Lean software development around 2003, which was followed by Product development flows in 2009 and Continuous Integration/Delivery in 2010. The DevOps Handbook and the DevOps Adoption Playbook is recent as of the last 5-6 years. Principles that inform practices that resolve challenges also align accordingly. For example, the elimination of risk happens with automated testing and deployments and this resolves the manual testing, processes, deployments and releases.

Thursday, February 9, 2023

Massive Code Repositories, scalable builds and DevOps Processes hosted in the cloud.

As enterprises and organizations survey their applications and assets to be moved to the cloud, one of the often-overlooked processes involve the entrenched and almost boutique build systems that they have invested in over the years. The public clouds advocate the use of cloud native DevOps pipeline and automations that work well for new repositories and small projects but when it comes to billion dollars plus revenue generating source code assets, the transition of build and deployment to cloud become surprisingly challenging.

New code projects and businesses can start out with a code repository in GitHub or GitHub Enterprise with files conforming to the 100MB limit and the repository sizes conforming to the 5GB limit. When we start clean, on the Cloud based DevOps, managing the inventory to retain only text in the source and move the binaries to an object storage is easy. When the enterprises have accrued massive repositories over time, even a copy operation becomes difficult to automate. What used to be robocopy on windows involving large payloads, must now involve a transfer over S3.

One of the first challenges in the movement of build and capacity planning infrastructure to the cloud is to prepare the migration. External dependencies and redundancies can cause these repositories to become very large, not to mention branches and versions. Using a package manager or their equivalents to separate out the dependencies into their packages can be helpful to their reusability. Bundler, Node’s package manager and Maven are testament to this effect. Object storage or Artifactory and their equivalents can store binary data and executables. Backup and restore can easily be added from Cloud Services when they are not configurable via the respective cloud services.

Another challenge is the proper mapping of infrastructure to handle such large processes involved in Continuous Integration and Continuous Deployment. GitHub Enterprise can provide up to 32 cores and 50000 minutes/month for public repositories of sizes up to 50GB. The cloud, on the other hand, is limitless compute, storage and networking, all with the convenience of pay-as-you-go billing. If there is effective transformation of DevOps automations, both the infrastructure required and the automations they support become easier to host in the cloud. As with the first challenge, the ability to take stock of the inventory for infrastructure resources and automation logic becomes daunting. Consequently, some form of organization and nomenclature to divide up the inventory into sizeable chunks can help with the transformation and even parallelization

A third challenge involved is environmental provisioning and manual testing. Subscriptions, resource groups, regions and configurations proliferate in the cloud when such DevOps are transformed and migrated. These infrastructure and state become veritable assets to guard just the same way as the source that are delivered with the DevOps. From importing and exporting these infrastructure-as-a-code templates as well as their states and forming blueprints that can include policies and reconcile the state become a necessity. A proper organization and naming convention are needed for these as well.

Other miscellaneous challenges include but are not limited to forming best practices and centers of excellence, creating test data, providing manual deployments and overrides, ensuring suppliers, determining governance, integrating an architecture for tools say in the form of runbooks, manual releases, determination of telemetry, determining teams and managing accesses, supporting regulatory compliance, providing service virtualization, and providing education for special skillsets. In addition, managing size and inconsistencies, maintaining the sanctity as a production grade system and providing an escalation path for feedback and garnering collaboration across a landscape of organizations and teams must be dealt with.

Finally, people, process and technology must come together in a planned and streamlined manner to make this happen. These provide a glimpse of the roadmap towards the migration of build and deployments to the cloud.

Wednesday, February 8, 2023

Public clouds provide an adoption framework for businesses that helps to create an overall cloud adoption plan that guides programs and teams in their digital transformation. The plan methodology provides templates to create backlogs and plans to build necessary skills across the teams. It helps rationalize the data estate, prioritize the technical efforts, and identify the data workloads. It’s important to adhere to a set of architectural principles which help guide development and optimization of the workloads. A well-architected framework stands on five pillars of architectural excellence which include:

- Reliability (REL)

- Security (SEC)

- Cost Optimization (COST)

- Operational Excellence (OPS)

- Performance efficiency (PERF)

The elements that support these pillars are a review, a cost and optimization advisor, documentation, patterns-support-and-service offers, reference architectures and design principles.

This guidance provides a summary of how these principles apply to the management of the data workloads.

Cost optimization is one of the primary benefits of using the right tool for the right solution. It helps to analyze the spend over time as well as the effects of scale out and scale up. An advisor can help improve reusability, on-demand scaling, reduced data duplication, among many others.

Performance is usually based on external factors and is very close to customer satisfaction. Continuous telemetry and reactiveness are essential to tuned up performance. The shared environment controls for management and monitoring create alerts, dashboards, and notifications specific to the performance of the workload. Performance considerations include storage and compute abstractions, dynamic scaling, partitioning, storage pruning, enhanced drivers, and multilayer cache.

Operational excellence comes with security and reliability. Security and data management must be built right into the system at layers for every application and workload. The data management and analytics scenario focus on establishing a foundation for security. Although workload specific solutions might be required, the foundation for security is built with the Azure landing zones and managed independently from the workload. Confidentiality and integrity of data including privilege management, data privacy and appropriate controls must be ensured. Network isolation and end-to-end encryption must be implemented. SSO, MFA, conditional access and managed service identities are involved to secure authentication. Separation of concerns between azure control plane and data plane as well as RBAC access control must be used.

The key considerations for reliability are how to detect change and how quickly the operations can be resumed. The existing environment should also include auditing, monitoring, alerting and a notification framework.

In addition to all the above, some consideration may be given to improving individual service level agreements, redundancy of workload specific architecture, and processes for monitoring and notification beyond what is provided by the cloud operations teams.

Each pillar contains questions for which the answers relate to technical and organizational decisions that are not directly related to the features the software to be deployed. For example, a software that allows people to post comments must honor use cases where some people can write and others can read. But the system developed must also be safe and sound enough to handle all the traffic and should incur reasonable cost.

Since the most crucial pillars are OPS and SEC, they should never be traded in to get more out of the other pillars.

The security pillar consists of Identity and access management, detective controls, infrastructure protection, data protection and incident response. Three questions are routinely asked for this pillar:

1. How is the access controlled for the serverless api?

2. How are the security boundaries managed for the serverless application?

3. How is the application security implemented for the workload?

The operational excellence pillar is made up of four parts: organization, preparation, operation, and evolution. The questions that drive the decisions for this pillar include:

1. How is the health of the serverless application known?

2. How is the application lifecycle management approached?

The reliability pillar is made of three parts: foundations, change management, and failure management. The questions asked for this pillar include:

1. How are the inbound request rates regulated?

2. How is the resiliency build into the serverless application?

The cost optimization pillar consists of five parts: cloud financial management practice, expenditure and usage awareness, cost-effective resources, demand management and resources supply, and optimizations over time. The questions asked for cost optimization include:

1. How are the costs optimized?

The performance efficiency pillar is composed of four parts: selection, review, monitoring and tradeoffs. The questions asked for this pillar include:

1. How is the performance optimized for the serverless application?

In addition to these questions, there’s quite a lot of opinionated and even authoritative perspectives into the appropriateness of a framework and they are often referred to as lenses. With these forms of guidance, a well-architected framework moves closer to reality.

Tuesday, February 7, 2023

One of the architectural patterns for application migration is about managing AWS Service Catalog products in multiple AWS Accounts and AWS Regions. AWS Service Catalog is used to create, share organize and govern the curated IaC templates. Governance and distribution of Infrastructure is simplified and accelerated. AWS uses CloudFormation Templates to define a collection of AWS resources aka stacks required for a solution or a product. StackSets extend this functionality by enabling us to create, update or delete stacks across multiple accounts and AWS Regions with a single operation.

If a CloudFormation template must be made available to other AWS accounts or organizational units, then the portfolio is typically shared. A portfolio is a container that includes one or more products.

On the other hand, this architectural pattern is an alternative approach that is based on AWS CloudFormation StackSets. Instead of sharing portfolio, we use AWS StackSet constraints to set AWS regions and accounts where the resources can be deployed and used. This approach helps to provision the Service Catalog products in multiple accounts, OUs and AWS Regions, and managed from a central location which meets governance requirements.

The benefits of this approach are the following:

the product is provisioned and managed from a primary account, and not shared with other accounts.

This approach provides a consolidated view of all provisioned products (stacks) that are based on a specific set of templates.

The use of a primary account makes the configuration with AWS Service management Connector easier

It is easier to query and use products from the AWS Service Catalog.

The architecture involves an AWS management account and a target Organizational Unit (OU) account. The CloudFormation template and the service catalog product are in the management account. The CloudFormation stack and its resources are in the target OU account. The user creates an AWS CloudFormation template to provision AWS resources, in JSON or Yaml format. The CloudFormation template creates a product in AWS Service Catalog, which is added to a portfolio. The user creates a provisioned product, which creates CloudFormation stacks in the target accounts. Each stack provisions the resources specified in the CloudFormation templates.

The steps to provision products across accounts include: 1. Creating a portfolio say with the AWS command line interface 2. Create the template that describes the resources, 3. Create a product with version title and description and 4. Apply constraints to the portfolio to configure product deployment options such as multiple AWS accounts, regions and permissions and 5. Provide permissions to users so that they can launch the products in the portfolio.

Monday, February 6, 2023

Pure and mixed templates:

Infrastructure-as-a-code is a declarative paradigm that is a language for describing infrastructure and the state that it must achieve. The service that understands this language supports tags, RBAC, declarative syntax, locks, policies, and logs for the resources and their create, update, and delete operations which can be exposed via the command-line interface, scripts, web requests, and the user interface. Declarative style also helps to boost agility, productivity, and quality of work within the organizations.

Template providers often go to great lengths to determine the convention, syntax and semantics that authors can use to describe the infrastructure to be setup. Many provide common forms of expressing infrastructure and equivalents that are similar across providers. Authors, however, rely on tools to import and export infrastructure. Consequently, they must mix and match templates.

One such template provider is AWS cloud’s CloudFormation. Terraform is the open-source equivalent that helps the users with the task of setting up and provisioning datacenter infrastructure independent of clouds. These cloud configuration files can be shared among team members, treated as code, edited, reviewed and versioned.

Terraform allows including Json and Yaml in the templates and state files using built-in functions called jsonencode and yamlencode respectively. With the tools to export templates in one of the two well-known forms, it becomes easy to import in Terraform with these two built-in functions. Terraform can also be used to read and export existing cloud infrastructure in its syntax but often they may be exported in ugly compressed hard-to-read format and these two built-in functions allow multi-line display of the same content which makes it more readable.

AWS CloudFormation has a certain appeal for being AWS native with a common language to model and provision AWS and third-party resources. It abstracts the nuances in managing AWS resources and their dependencies making it easier for creating and deleting resources in a predictable manner. It makes versioning and iterating of the infrastructure more accessible. It supports iterative testing as well as rollback.

Terraform’s appeal is that it can be used for multi-cloud deployment. For example, it deploys serverless functions with AWS Lambda, manage Microsoft Azure Active Directory resources, and provision a load balancer in Google cloud.

Both facilitate state management. With CloudFormation, users can perform drift detection on all of their assets and get notifications when something changes. It also determines dependencies and performs certain validations before a delete command is honored. Terraform stores the state of the infrastructure on the provisioning computer, or in a remote site in proprietary JSON which serves to describe and configure the resources. The state management is automatically done with no user involvement by CloudFormation whereas Terraform requires you to specify the remote store or fallback to local disk to save state.

Both have their unique ways for addressing flexibility for changing requirements. Terraform has modules which are containers for multiple resources that are used together and CloudFormation utilizes a system called “nested stacks” where templates can be called from within templates. A benefit of Terraform is increased flexibility over CloudFormation regarding modularity.

They also differ in how they handle configuration and parameters. Terraform uses provider specific data sources. The implementation is modular allowing data to be fetched and reused. CloudFormation uses up to 60 parameters per template that must be of a type that CloudFormation understands. They must be declared or retrieved from the System Manager parameter store and used within the template.
Both are powerful cloud infrastructure management tools, but one is more favorable for cloud-agnostic support. It also ties in very well with DevOps automations such as GitLab. Finally, having an abstraction over cloud lock-ins might also be beneficial to the organization in the long run.

Sunday, February 5, 2023

Extending datacenters to the public cloud:

A specific pattern used toward hybrid computing involves extending datacenters to the public cloud. Many companies have significant investments in their immovable datacenters and while they can create a private cloud such as a VMWare cloud within the public cloud, they might find it costly to maintain both an on-premise cloud and one on the public cloud. A reasonable approach between these choices is to extend the existing datacenters to the public cloud. This article explores this pattern.

Although technology products are not referred to by their brands or product names in a technical discussion of an architectural pattern, it simplifies this narrative by providing a specific example of the technology discussed. Since many technological innovations are patented, it’s hard to refer to them without using product names. In this case, we use the example of a private cloud with VMWare cloud and refer to its products for manageability. A VMWare vCenter is a centralized management utility that can manage virtual machines, hosts, and dependent components. VMWare vSphere is VMWare’s virtualization platform, which transforms datacenters into aggregated computing infrastructures that include CPU, storage, and networking resources.

The pattern to extend the datacenter to VMWare Cloud on AWS uses Hybrid Linked Mode. Inventories in both places can be managed through a single VMWare vSphere Client interface. This ensures consistent operations and simplified administration and uses a VMWare Cloud Gateway Appliance. It can be used to manage both applications and virtual machines that are on-premises.

There are two mutually exclusive options for configuration. The first option installs the Cloud Gateway Appliance and uses it to link from the on-premises vCenter server to the cloud SDDC. The second option configures Hybrid Linked Mode from the cloud SDDC. The Hybrid Linked Mode can only connect one on-premises vCenter Server Enhanced Linked Mode domain and supports on-premises vCenter Server running more recent versions. When a cloud gateway appliance is connected to the Hybrid Linked Mode, there can be multiple vCenter Server connected to the appliance but when the cloud SDDC is directly connected to the Hybrid Linked Mode, there can be only one vCenter Server.

Different workloads can be migrated using either a cold migration or a live migration with VMWare vSphere vMotion. Factors that must be considered when choosing the migration method include virtual switch type and version, the connection type to the cloud SDDC, and the virtual hardware version.

A cold migration is appropriate for virtual machines that experience downtime. These virtual machines can be shut down, migrated and then powered back on. The migration time is faster because there is no need to copy active memory. This holds true for applications as well. A live migration, on the other hand, uses vMotion to perform rolling migration without downtime and is advisable for mission critical applications. The idea behind vMotion is that a destination instance is prepared and made ready and the switching from source to destination happens near instantaneously.

This pattern establishes promotes the visibility of existing infrastructure to the cloud.